# Model Training Demonstration:
This notebook showcases how to train our GMF and NCF models. This notebook will go over the preprocessing steps as well as the training steps.

## Import Packages and Modules:
These are the following packages and modules we'll be using for this notebook.

In [13]:
## Import necessary Python packages.
import tensorflow as tf
import numpy as np
import pandas as pd
import random

## Import the preprocessing and evaluating modules.
import preprocess
import training 

## Import the models.
import models.gmf_model as GMF
import models.ncf_model as NCF

## Preprocess Dataset:
First, we need to load in the interactions dataset. The Million Song Dataset (MSD) was converted into an interactions dataset and saved as `interactions.csv`, and we can load it in using `pandas`. We'll then convert each user-item interaction to binary (0 for negative interaction and 1 for positive interaction). 

Since the original MSD dataset has string valued IDs for both users and songs, we'll need to convert them to numerical values so that we can properly train our models. This is done using the `MapUserItemID` function.

Finally, since we're working with a very sparse dataset, the standard training-validation split wasn't optimal. Instead, we used the leave-one-out technique, where a random positive interaction was left out and stored in the testing dataset. This is done by using the `LeaveOneOut` function.

In [2]:
## Load in the MSD interaction csv file.
interaction_df = pd.read_csv('interactions.csv')
## Convert whether the user liked a song into binary (0 or 1).
interaction_df['liked'] = (interaction_df['count'] > 0).astype(int)

## Map the DataFrame with numerical IDs.
mapped_df = preprocess.MapUserItemID(df = interaction_df)

## Split the training and testing dataset using the leave-one-out technique.
train_df, test_df = preprocess.LeaveOneOut(df = mapped_df)

We also converted the interactions as `set`. This allowed us to sample random interactions more efficiently compared to other storage methods.

In [3]:
## Create the corresponding sets to positive interactions and unique item IDs.
user_positive_itemsets, item_pool = preprocess.CreatePositiveInteractions(df = mapped_df)

Processing: 100%|██████████| 404103/404103 [02:35<00:00, 2605.83it/s]


## Model Initialization:
We can now load in our model. For this demonstration, we used the General Matrix Factorization (GMF) model as it was less complex than the Neural Collaborative Filtering (NCF) model. Hence, it trained at a faster speed. We used the Adam optimizer with the initial learning rate of $1e-4$ and the binary cross entropy loss function. This process is shown below.

In [10]:
## Initialize the GMF model.
gmf_model = GMF.GMFModel(
                num_users = mapped_df['user_id'].nunique(), 
                num_items = mapped_df['song_id'].nunique(), 
                num_latent = 8
            )

## Compile the GMF model using the following optimizer, loss, and metrics.
gmf_optimizer = tf.keras.optimizers.Adam(learning_rate = 1e-4)
gmf_model.compile(optimizer = gmf_optimizer, loss = 'binary_crossentropy', metrics = ['accuracy'])

## Model Training:
Since we're working with a very sparse interaction dataset, we couldn't use all the interactions (positive and negative) when training our model. For example, let's say that user A has 5 interactions. There are 3,419 possible interactions user A can have with the amount of songs in the dataset. Hence, if we input 5 positive interactions and 3,415 negative interactions, it's very likely to skew the model. Therefore, we used the `GenerateTrainingData` function, which randomly selects a certain number of random negative interactions from the dataset. In the case of user A, we would choose 4 $\times$ 5 negative interaction samples, giving us a total of 25 interactions for the training input. This way, we're able to avoid creating any biases when training the model.

We used a batch size of 256 for this demonstration. Furthermore, we trained the GMF model over 5 epochs. However, we found out that the model's learning saturates around the 20th epoch. 

In [12]:
## Initialize a dictionary to keep track of the training loss and accuracy.
total_history = {}

with tf.device('/GPU:0'):
    ## Train the model over the given number of epochs.
    for epoch, (inputs, labels) in enumerate(training.GenerateTrainingData(train_df, user_positive_itemsets, item_pool, epochs = 5, n_neg_multiplier = 4)):
        print(f"Epoch {epoch+1}")
        
        ## Get the individual users and items.
        users = inputs[:, 0]
        items = inputs[:, 1]

        ## Fit the GMF model and append to history.
        history = gmf_model.fit(x = [users, items], y = labels, batch_size = 256, epochs = 1, verbose = 1)

        ## Store the training loss and accuracy to our custom total_history dictionary.
        for key, values in history.history.items():
            if key not in total_history:
                total_history[key] = [] 
            total_history[key].extend(values)

Generating Negative Samples:   0%|          | 0/162534 [00:00<?, ?it/s]

Generating Negative Samples: 100%|██████████| 162534/162534 [00:05<00:00, 28775.84it/s]


Epoch 1


Generating Negative Samples: 100%|██████████| 162534/162534 [00:05<00:00, 28002.95it/s]


Epoch 2


Generating Negative Samples: 100%|██████████| 162534/162534 [00:05<00:00, 29169.32it/s]


Epoch 3


Generating Negative Samples: 100%|██████████| 162534/162534 [00:05<00:00, 27481.40it/s]


Epoch 4


Generating Negative Samples: 100%|██████████| 162534/162534 [00:05<00:00, 28848.98it/s]


Epoch 5


## Applying Model:
Using the trained model, we can now use the `predict` function to predict the interaction score given a random user and item. This process is shown below.

**NOTE:** Since the model has only been trained for 5 epochs, the performance may be very poor. Please train the model over approximately 20 - 25 epochs for optimal perforamnce.

In [15]:
## Choose a random index to evaluate model on.
rand_index = random.choice(mapped_df.index)

## Extract the users and songs from this row.
user_test = np.array([mapped_df.iloc[rand_index]['user_id']])
song_test = np.array([mapped_df.iloc[rand_index]['song_id']])

## Create interaction array for both user and song.
user_song = [user_test, song_test]

## Input the interaction into the model to get a predicted interaction score.
pred_score = gmf_model.predict(user_song, verbose = 0).item()
print(f'Predicted Interaction: {pred_score}')

Predicted Interaction: 0.2119390368461609
