# Data Preprocessing

Before we start building and training our model, let's do some preprocessing to get the data in the required format.

In [1]:
# Install the Kaggle API package
!pip install kaggle

# Upload your Kaggle API key (you should have the kaggle.json file downloaded from Kaggle)
from google.colab import files
files.upload()  # Upload your Kaggle API key

# Create a directory for the Kaggle API key
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# Use the Kaggle API to download the dataset
!kaggle datasets download -d grouplens/movielens-20m-dataset



Saving kaggle.json to kaggle.json
Downloading movielens-20m-dataset.zip to /content
 99% 193M/195M [00:08<00:00, 27.1MB/s]
100% 195M/195M [00:08<00:00, 24.5MB/s]


In [2]:
!unzip movielens-20m-dataset

Archive:  movielens-20m-dataset.zip
  inflating: genome_scores.csv       
  inflating: genome_tags.csv         
  inflating: link.csv                
  inflating: movie.csv               
  inflating: rating.csv              
  inflating: tag.csv                 


In [4]:
!pip install pytorch_lightning

Collecting pytorch_lightning
  Downloading pytorch_lightning-2.1.1-py3-none-any.whl (776 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m776.3/776.3 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
Collecting torchmetrics>=0.7.0 (from pytorch_lightning)
  Downloading torchmetrics-1.2.0-py3-none-any.whl (805 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m805.2/805.2 kB[0m [31m41.3 MB/s[0m eta [36m0:00:00[0m
Collecting lightning-utilities>=0.8.0 (from pytorch_lightning)
  Downloading lightning_utilities-0.9.0-py3-none-any.whl (23 kB)
Installing collected packages: lightning-utilities, torchmetrics, pytorch_lightning
Successfully installed lightning-utilities-0.9.0 pytorch_lightning-2.1.1 torchmetrics-1.2.0


In [5]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl

np.random.seed(123)

First, we import the ratings dataset.

In [6]:
ratings = pd.read_csv('rating.csv',
                      parse_dates=['timestamp'])

In order to keep memory usage manageable within Kaggle's kernel, we will only use data from 30% of the users in this dataset. Let's randomly select 30% of the users and only use data from the selected users.

In [9]:
movies = pd.read_csv('movie.csv')

In [10]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [12]:
#Using regular expressions to find a year stored between parentheses
#We specify the parantheses so we don't conflict with movies that have years in their titles
movies['year'] = movies.title.str.extract('(\(\d\d\d\d\))',expand=False)
#Removing the parentheses
movies['year'] = movies.year.str.extract('(\d\d\d\d)',expand=False)
#Removing the years from the 'title' column
movies['title'] = movies.title.str.replace('(\(\d\d\d\d\))', '')
#Applying the strip function to get rid of any ending whitespace characters that may have appeared
movies['title'] = movies['title'].apply(lambda x: x.strip())

  movies['title'] = movies.title.str.replace('(\(\d\d\d\d\))', '')


In [13]:
movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


In [6]:
rand_userIds = np.random.choice(ratings['userId'].unique(),
                                size=int(len(ratings['userId'].unique())*0.3),
                                replace=False)

ratings = ratings.loc[ratings['userId'].isin(rand_userIds)]

print('There are {} rows of data from {} users'.format(len(ratings), len(rand_userIds)))

There are 6027314 rows of data from 41547 users


In [7]:
ratings.sample(5)

Unnamed: 0,userId,movieId,rating,timestamp
3840312,26182,3704,4.0,2007-01-31 21:56:52
7608731,52439,3365,4.0,2004-03-21 08:02:56
19363634,134060,1027,3.0,2003-07-15 22:43:45
17181947,118860,2629,1.0,2007-11-29 21:27:08
9344779,64638,4723,2.0,2001-09-10 20:11:41


After filtering the dataset, there are now 6,027,314  rows of data from 41,547 users. Each row in the dataframe corresponds to a movie review made by a single user.

### Train-test split

Along with the rating, there is also a `timestamp` column that shows the date and time the review was submitted. Using the `timestamp` column, we will implement our train-test split strategy using the leave-one-out methodology. For each user, the most recent review is used as the test set (i.e. leave one out), while the rest will be used as training data .

To illustrate this, the movies reviewed by user 39,849 is shown below. The last movie reviewed by the user is the 2014 hit movie Guardians of The Galaxy. We'll use this movie as the testing data for this user, and use the rest of the reviewed movies as training data.

![](https://i.imgur.com/oNJnLqU.png)
> **Movie posters from themoviedb.org (free to use)**
>



This train-test split strategy is often used when training and evaluating recommender systems. Doing a random split would not be fair, as we could potentially be using a user's recent reviews for training and earlier reviews for testing. This introduces data leakage with a look-ahead bias, and the performance of the trained model would not be generalizable to real-world performance.

The code below will split our ratings dataset into a train and test set using the leave-one-out methodology.

In [8]:
ratings['rank_latest'] = ratings.groupby(['userId'])['timestamp'] \
                                .rank(method='first', ascending=False)

train_ratings = ratings[ratings['rank_latest'] != 1]
test_ratings = ratings[ratings['rank_latest'] == 1]

# drop columns that we no longer need
train_ratings = train_ratings[['userId', 'movieId', 'rating']]
test_ratings = test_ratings[['userId', 'movieId', 'rating']]

### Converting the dataset into an implicit feedback dataset

As discussed earlier, we will train a recommender system using implicit feedback. However, the MovieLens dataset that we're using is based on explicit feedback. To convert this dataset into an implicit feedback dataset, we'll simply binarize the ratings such that they are are '1' (i.e. positive class). The value of '1' represents that the user has interacted with the item.

It is important to note that using implicit feedback reframes the problem that our recommender is trying to solve. Instead of trying to predict movie ratings (when using explicit feedback), we are trying to predict whether the user will interact (i.e. click/buy/watch) with each movie, with the aim of presenting to users the movies with the highest interaction likelihood.






In [9]:
train_ratings.loc[:, 'rating'] = 1

train_ratings.sample(5)

  train_ratings.loc[:, 'rating'] = 1


Unnamed: 0,userId,movieId,rating
3411906,23263,5481,1
1815983,12245,5464,1
16198592,112109,6,1
13914487,96124,1247,1
1445807,9790,1690,1


We do have a problem now though. After binarizing our dataset, we see that every sample in the dataset now belongs to the positive class. However we also require negative samples to train our models, to indicate movies that the user has not interacted with. We assume that such movies are those that the user are not interested in - even though this is a sweeping assumption that may not be true, it usually works out rather well in practice.

The code below generates 4 negative samples for each row of data. In other words, the ratio of negative to positive samples is 4:1. This ratio is chosen arbitrarily but I found that it works rather well (feel free to find the best ratio yourself!)

In [10]:
# Get a list of all movie IDs
all_movieIds = ratings['movieId'].unique()

# Placeholders that will hold the training data
users, items, labels = [], [], []

# This is the set of items that each user has interaction with
user_item_set = set(zip(train_ratings['userId'], train_ratings['movieId']))

# 4:1 ratio of negative to positive samples
num_negatives = 4

for (u, i) in tqdm(user_item_set):
    users.append(u)
    items.append(i)
    labels.append(1) # items that the user has interacted with are positive
    for _ in range(num_negatives):
        # randomly select an item
        negative_item = np.random.choice(all_movieIds)
        # check that the user has not interacted with this item
        while (u, negative_item) in user_item_set:
            negative_item = np.random.choice(all_movieIds)
        users.append(u)
        items.append(negative_item)
        labels.append(0) # items not interacted with are negative

  0%|          | 0/5985767 [00:00<?, ?it/s]

Great! We now have the data in the format required by our model. Before we move on, let's define a PyTorch Dataset to facilitate training. The class below simply encapsulates the code we have written above into a PyTorch Dataset class.

In [20]:
class MovieLensTrainDataset(Dataset):
    """MovieLens PyTorch Dataset for Training

    Args:
        ratings (pd.DataFrame): Dataframe containing the movie ratings
        all_movieIds (list): List containing all movieIds

    """

    def __init__(self, ratings, all_movieIds):
        self.users, self.items, self.labels = self.get_dataset(ratings, all_movieIds)

    def __len__(self):
        return len(self.users)

    def __getitem__(self, idx):
        return self.users[idx], self.items[idx], self.labels[idx]

    def get_dataset(self, ratings, all_movieIds):
        users, items, labels = [], [], []
        user_item_set = set(zip(ratings['userId'], ratings['movieId']))

        num_negatives = 4
        for u, i in user_item_set:
            users.append(u)
            items.append(i)
            labels.append(1)
            for _ in range(num_negatives):
                negative_item = np.random.choice(all_movieIds)
                while (u, negative_item) in user_item_set:
                    negative_item = np.random.choice(all_movieIds)
                users.append(u)
                items.append(negative_item)
                labels.append(0)

        return torch.tensor(users), torch.tensor(items), torch.tensor(labels)

# Our model - Neural Collaborative Filtering (NCF)

While there are many deep learning based architecture for recommendation systems, I find that the framework proposed by [He et al.](https://arxiv.org/abs/1708.05031) is the most straightforward and it is simple enough to be implemented in a tutorial such as this.

### User Embeddings

Before we dive into the architecture of the model, let's familiarize ourselves with the concept of embeddings. An embedding is a low-dimensional space that captures the relationship of vectors from a higher dimensional space. To better understand this concept, let's take a closer look at user embeddings.

Imagine that we want to represent our users according to their preference for two genres of movies - action and romance movies. Let the first dimension be how much the user likes action movies, and the second dimension be how much the user likes romance movies.

![](https://i.imgur.com/XENzqXq.png)

Now, assume that Bob is our first user. Bob likes action movies but isn't a fan of romance movies. To represent Bob as a two dimensional vector, we place him in the graph according to his preference.

![](https://i.imgur.com/rSStTCj.png)

Our next user is Joe. Joe is a huge fan of both action and romance movies. We represent Joe using a two dimensional vector just like Bob.

![](https://i.imgur.com/gmmkrEU.png)

This two dimensional space is known as an embedding. Essentially, the embedding reduces our users such that they can be represented in a meaningful manner in a lower dimensional space. In this embedding, users with similar movie preferences are placed near to each other, and vice versa.

![](https://i.imgur.com/9s9Z7JT.png)

Of course, we are not restricted to using just 2 dimensions to represent our users. We can use an arbitrary number of dimensions to represent our users. A larger number of dimensions would allow us to capture the traits of each user more accurately, at the cost of model complexity. In our code, we'll use 8 dimensions (which we will see later).

### Learned Embeddings

Similarly, we will use a separate item embedding layer to represent the traits of the items (i.e. movies) in a lower dimensional space.

You might be wondering, how can we learn the weights of the embedding layer, such that it provides an accurate representation of users and items? In our previous example, we used Bob and Joe's preference for action and romance movies to manually create our embedding. Is there a way to learn such preferences automatically?

The answer is **Collaborative Filtering** - by using the ratings dataset, we can identify similar users and movies, creating user and item embeddings learned from existing ratings.

### Model Architecture

Now that we have a better understanding of embeddings, we are ready to define the model architecture. As you'll see, the user and item embeddings are key to the model.

<!-- ![NCF](https://i.imgur.com/EZh1HHf.png)
 -->

Let's walk through the model architecture using the following training sample:

| userId | movieID | interacted |
|-|-|-|
| 3 | 1 | 1 |


![](https://i.imgur.com/cNWbIce.png)


The inputs to the model are the one-hot encoded user and item vector for `userId = 3` and `movieId = 1`. Because this is a positive sample (movie actually rated by the user), the true label (`interacted`) is 1.

The user input vector and item input vector are fed to the user embedding and item embedding respectively, which results in a smaller, denser user and item vectors.

The embedded user and item vectors are concatenated before passing through a series of fully connected layers, which maps the concatenated embeddings into a prediction vector as output. Finally, we apply a `Sigmoid` function to obtain the most probable class. In the example above, the most probable class is 1 (positive class), since 0.8 > 0.2.


Now, let's define this NCF model using PyTorch Lightning!

In [21]:
class NCF(pl.LightningModule):
    """ Neural Collaborative Filtering (NCF)

        Args:
            num_users (int): Number of unique users
            num_items (int): Number of unique items
            ratings (pd.DataFrame): Dataframe containing the movie ratings for training
            all_movieIds (list): List containing all movieIds (train + test)
    """

    def __init__(self, num_users, num_items, ratings, all_movieIds):
        super().__init__()
        self.user_embedding = nn.Embedding(num_embeddings=num_users, embedding_dim=8)
        self.item_embedding = nn.Embedding(num_embeddings=num_items, embedding_dim=8)
        self.fc1 = nn.Linear(in_features=16, out_features=64)
        self.fc2 = nn.Linear(in_features=64, out_features=32)
        self.output = nn.Linear(in_features=32, out_features=1)
        self.ratings = ratings
        self.all_movieIds = all_movieIds

    def forward(self, user_input, item_input):

        # Pass through embedding layers
        user_embedded = self.user_embedding(user_input)
        item_embedded = self.item_embedding(item_input)

        # Concat the two embedding layers
        vector = torch.cat([user_embedded, item_embedded], dim=-1)

        # Pass through dense layer
        vector = nn.ReLU()(self.fc1(vector))
        vector = nn.ReLU()(self.fc2(vector))

        # Output layer
        pred = nn.Sigmoid()(self.output(vector))

        return pred

    def training_step(self, batch, batch_idx):
        user_input, item_input, labels = batch
        predicted_labels = self(user_input, item_input)
        loss = nn.BCELoss()(predicted_labels, labels.view(-1, 1).float())
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters())

    def train_dataloader(self):
        return DataLoader(MovieLensTrainDataset(self.ratings, self.all_movieIds),
                          batch_size=512, num_workers=4)

We instantiate the NCF model using the class that we have defined above.

In [22]:
num_users = ratings['userId'].max()+1
num_items = ratings['movieId'].max()+1

all_movieIds = ratings['movieId'].unique()

model = NCF(num_users, num_items, train_ratings, all_movieIds)

Let's train our NCF model for 5 epochs using the GPU. Notice that we are using the argument `reload_dataloaders_every_epoch=True`. This creates a new randomly chosen set of negative samples for each epoch, which ensures that our model is not biased by the selection of negative samples.

Note: One advantage of [PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning) over vanilla PyTorch is that you don't need to write your own boiler plate training code. Notice how the [Trainer](https://pytorch-lightning.readthedocs.io/en/latest/trainer.html) class allows us to train our model with just a few lines of code.

In [23]:
trainer = pl.Trainer(max_epochs=3)

trainer.fit(model)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name           | Type      | Params
---------------------------------------------
0 | user_embedding | Embedding | 1.1 M 
1 | item_embedding | Embedding | 1.1 M 
2 | fc1            | Linear    | 1.1 K 
3 | fc2            | Linear    | 2.1 K 
4 | output         | Linear    | 33    
---------------------------------------------
2.2 M     Trainable params
0         Non-trainable params
2.2 M     Total params
8.645     Total estimated model params size (MB)


Training: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


# Evaluating our Recommender System

Now that our model is trained, we are ready to evaluate it using the test data. In traditional Machine Learning projects, we evaluate our models using metrics such as Accuracy (for classification problems) and RMSE (for regression problems). However, such metrics are too simplistic for evaluating recommender systems.

To design a good metric for evaluating recommender systems, we need to first understand how modern recommender systems are used.

Looking at Netflix, we see a list of recommendations like the one below:

![](https://i.imgur.com/5QRWcYy.jpg)

Similarly, Amazon uses a list of recommendations:

![](https://i.imgur.com/XZZ2Ni8.png)

The key here is that we don't need the user to interact on *every* single item in the list of recommendations. Instead, we just need the user to interact with at least one item on the list - as long as the user does that, the recommendations have worked.

To simulate this, let's run the following evaluation protocol to generate a list of 10 recommended items for each user.

* For each user, randomly select 99 items that the user **has not interacted with**
* Combine these 99 items with the test item (the actual item that the user interacted with). We now have 100 items.
* Run the model on these 100 items, and rank them according to their predicted probabilities
* Select the top 10 items from the list of 100 items. If the test item is present within the top 10 items, then we say that this is a hit.
* Repeat the process for all users. The Hit Ratio is then the average hits.

This evaluation protocol is known as **Hit Ratio @ 10**, and it is commonly used to evaluate recommender systems.

### Hit Ratio @ 10

Now, let's evaluate our model using the described protocol.

In [24]:
# User-item pairs for testing
test_user_item_set = set(zip(test_ratings['userId'], test_ratings['movieId']))

# Dict of all items that are interacted with by each user
user_interacted_items = ratings.groupby('userId')['movieId'].apply(list).to_dict()

hits = []
for (u,i) in tqdm(test_user_item_set):
    interacted_items = user_interacted_items[u]
    not_interacted_items = set(all_movieIds) - set(interacted_items)
    selected_not_interacted = list(np.random.choice(list(not_interacted_items), 99))
    test_items = selected_not_interacted + [i]

    predicted_labels = np.squeeze(model(torch.tensor([u]*100),
                                        torch.tensor(test_items)).detach().numpy())

    top10_items = [test_items[i] for i in np.argsort(predicted_labels)[::-1][0:10].tolist()]

    if i in top10_items:
        hits.append(1)
    else:
        hits.append(0)

print("The Hit Ratio @ 10 is {:.2f}".format(np.average(hits)))

  0%|          | 0/41547 [00:00<?, ?it/s]

The Hit Ratio @ 10 is 0.84


We got a pretty good Hit Ratio @ 10 score! To put this into context, what this means is that 86% of the users were recommended the actual item (among a list of 10 items) that they eventually interacted with. Not bad!