# Comparing matrix factorization with transformers for MovieLens recommendations using PyTorch-accelerated.

Reference: https://medium.com/data-science-at-microsoft/comparing-matrix-factorization-with-transformers-for-movielens-recommendations-using-8e3cd3ec8bd8 Chris Hughes

In [5]:
!pip install --user statsmodels
!pip install --user torchmetrics
!pip install --user pytorch-accelerated

Collecting statsmodels
  Downloading statsmodels-0.14.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
Collecting patsy>=0.5.4
  Downloading patsy-0.5.6-py2.py3-none-any.whl (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.9/233.9 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy!=1.9.2,>=1.4
  Downloading scipy-1.12.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.5/38.5 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
Installing collected packages: scipy, patsy, statsmodels
Successfully installed patsy-0.5.6 scipy-1.12.0 statsmodels-0.14.1
[0mCollecting torchmetrics
  Downloading torchmetrics-1.3.2-py3-none-any.whl (841 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [1]:
!cd
!rm -rf .local
!ln -s /storage/config/.local/

In [15]:
import torchmetrics

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
from torch import nn
import seaborn as sns
from IPython.display import Image, display
from statsmodels.distributions.empirical_distribution import ECDF

In [12]:
from pathlib import Path

In [36]:
from scipy.sparse import csr_matrix
from implicit.als import AlternatingLeastSquares

ModuleNotFoundError: No module named 'implicit'

## Download Data

We use MovieLens-100k

### Download Movielens 1m


In [5]:
! wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
! unzip ml-latest-small.zip

--2024-03-30 15:04:49--  http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... 

connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2024-03-30 15:04:49 (1.39 MB/s) - ‘ml-latest-small.zip’ saved [978202/978202]

Archive:  ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  



# MovieLens Dataset Documentation
## Overview
This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files links.csv, movies.csv, ratings.csv and tags.csv. More details about the contents and use of all these files follows.
### User Ids

- MovieLens users were selected at random for inclusion.
- Their ids have been anonymized.
- User ids are consistent between `ratings.csv` and `tags.csv` (i.e., the same id refers to the same user across the two files).

### Movie Ids

- Only movies with at least one rating or tag are included in the dataset.
- These movie ids are consistent with those used on the MovieLens web site (e.g., id 1 corresponds to the URL <https://movielens.org/movies/1>).
- Movie ids are consistent between `ratings.csv`, `tags.csv`, `movies.csv`, and `links.csv` (i.e., the same id refers to the same movie across these four data files).

### Files and Their Structures

#### Ratings Data File Structure (`ratings.csv`)

- Contains all user movie ratings.
- **Format**: `userId,movieId,rating,timestamp`
- **Order**: First by `userId`, then by `movieId`.
- **Ratings**: On a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).
- **Timestamps**: Represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

#### Tags Data File Structure (`tags.csv`)

- Contains all user-generated tags for movies.
- **Format**: `userId,movieId,tag,timestamp`
- **Order**: First by `userId`, then by `movieId`.
- **Tags**: Typically a single word or short phrase, with meaning determined by the user.
- **Timestamps**: Represent seconds since midnight UTC of January 1, 1970.

#### Movies Data File Structure (`movies.csv`)

- Contains information on movies.
- **Format**: `movieId,title,genres`
- **Title**: Includes the year of release in parentheses. May contain errors or inconsistencies.
- **Genres**: A pipe-separated list, from a predefined set including Action, Comedy, Drama, etc., or "(no genres listed)".

#### Links Data File Structure (`links.csv`)

- Contains identifiers linking to other movie data sources.
- **Format**: `movieId,imdbId,tmdbId`
- **movieId**: Identifier used by <https://movielens.org>.
- **imdbId**: Identifier for movies used by <http://www.imdb.com>.
- **tmdbId**: Identifier for movies used by <https://www.themoviedb.org>.



## Load Data

In [13]:
dataset_path = Path('ml-latest-small')

tags = pd.read_csv(
    dataset_path/"tags.csv",
    sep=",",
)
ratings = pd.read_csv(
    dataset_path/"ratings.csv",
    sep=",",
)

movies = pd.read_csv(
    dataset_path/"movies.csv",
    sep=","
)
links = pd.read_csv(
    dataset_path/"links.csv",
    sep=","
)

In [32]:
links

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
...,...,...,...
9737,193581,5476944,432131.0
9738,193583,5914996,445030.0
9739,193585,6397426,479308.0
9740,193587,8391976,483455.0


In [33]:
tags

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200
...,...,...,...,...
3678,606,7382,for katie,1171234019
3679,606,7936,austere,1173392334
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978


In [34]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [35]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [17]:
ratings_df = pd.merge(ratings, movies)[['user_id', 'title', 'rating', 'unix_timestamp']]

In [18]:
ratings_df.head(1)

Unnamed: 0,user_id,title,rating,unix_timestamp
0,1,One Flew Over the Cuckoo's Nest (1975),5,978300760


In [19]:
ratings_df["user_id"] = ratings_df["user_id"].astype(str)

In [20]:
ratings_df.head(3)

Unnamed: 0,user_id,title,rating,unix_timestamp
0,1,One Flew Over the Cuckoo's Nest (1975),5,978300760
1,2,One Flew Over the Cuckoo's Nest (1975),5,978298413
2,12,One Flew Over the Cuckoo's Nest (1975),4,978220179


In [21]:
ratings_df.dtypes

user_id           object
title             object
rating             int64
unix_timestamp     int64
dtype: object

Using pandas, we can print some high-level statistics about the dataset, which may be useful to us.

In [22]:
ratings_per_user = ratings_df.groupby('user_id').rating.count()
ratings_per_item = ratings_df.groupby('title').rating.count()

print(f"Total No. of users: {len(ratings_df.user_id.unique())}")
print(f"Total No. of items: {len(ratings_df.title.unique())}")
print("\n")

print(f"Max observed rating: {ratings_df.rating.max()}")
print(f"Min observed rating: {ratings_df.rating.min()}")
print("\n")

print(f"Max no. of user ratings: {ratings_per_user.max()}")
print(f"Min no. of user ratings: {ratings_per_user.min()}")
print(f"Median no. of ratings per user: {ratings_per_user.median()}")
print("\n")

print(f"Max no. of item ratings: {ratings_per_item.max()}")
print(f"Min no. of item ratings: {ratings_per_item.min()}")
print(f"Median no. of ratings per item: {ratings_per_item.median()}")


Total No. of users: 6040
Total No. of items: 3706


Max observed rating: 5
Min observed rating: 1


Max no. of user ratings: 2314
Min no. of user ratings: 20
Median no. of ratings per user: 96.0


Max no. of item ratings: 3428
Min no. of item ratings: 1
Median no. of ratings per item: 123.5


### Splitting into training and validation sets

In [23]:
def get_last_n_ratings_by_user(
    df, n, min_ratings_per_user=1, user_colname="user_id", timestamp_colname="unix_timestamp"
):
    return (
        df.groupby(user_colname)
        .filter(lambda x: len(x) >= min_ratings_per_user)
        .sort_values(timestamp_colname)
        .groupby(user_colname)
        .tail(n)
        .sort_values(user_colname)
    )

In [24]:
get_last_n_ratings_by_user(ratings_df, 1)

Unnamed: 0,user_id,title,rating,unix_timestamp
28501,1,Pocahontas (1995),5,978824351
482398,10,Hero (1992),5,980638688
800008,100,Apocalypse Now (1979),2,977594963
496041,1000,"Streetcar Named Desire, A (1951)",5,975042421
305563,1001,Austin Powers: The Spy Who Shagged Me (1999),2,1028605534
...,...,...,...,...
767773,995,French Kiss (1995),3,975099776
573889,996,Almost Famous (2000),5,1001227064
76463,997,Gladiator (2000),4,978915132
998801,998,See the Sea (Regarde la mer) (1997),5,975192573


In [25]:
def mark_last_n_ratings_as_validation_set(
    df, n, min_ratings=1, user_colname="user_id", timestamp_colname="unix_timestamp"
):
    """
    Mark the chronologically last n ratings as the validation set.
    This is done by adding the additional 'is_valid' column to the df.
    :param df: a DataFrame containing user item ratings
    :param n: the number of ratings to include in the validation set
    :param min_ratings: only include users with more than this many ratings
    :param user_id_colname: the name of the column containing user ids
    :param timestamp_colname: the name of the column containing the imestamps
    :return: the same df with the additional 'is_valid' column added
    """
    df["is_valid"] = False
    df.loc[
        get_last_n_ratings_by_user(
            df,
            n,
            min_ratings,
            user_colname=user_colname,
            timestamp_colname=timestamp_colname,
        ).index,
        "is_valid",
    ] = True

    return df

Last two ratings by a user

In [26]:
mark_last_n_ratings_as_validation_set(ratings_df, 2)

Unnamed: 0,user_id,title,rating,unix_timestamp,is_valid
0,1,One Flew Over the Cuckoo's Nest (1975),5,978300760,False
1,2,One Flew Over the Cuckoo's Nest (1975),5,978298413,False
2,12,One Flew Over the Cuckoo's Nest (1975),4,978220179,False
3,15,One Flew Over the Cuckoo's Nest (1975),4,978199279,False
4,17,One Flew Over the Cuckoo's Nest (1975),5,978158471,False
...,...,...,...,...,...
1000204,5949,Modulations (1998),5,958846401,False
1000205,5675,Broken Vessels (1998),3,976029116,False
1000206,5780,White Boys (1999),1,958153068,False
1000207,5851,One Little Indian (1973),5,957756608,False


In [27]:
ratings_df.head(3)

Unnamed: 0,user_id,title,rating,unix_timestamp,is_valid
0,1,One Flew Over the Cuckoo's Nest (1975),5,978300760,False
1,2,One Flew Over the Cuckoo's Nest (1975),5,978298413,False
2,12,One Flew Over the Cuckoo's Nest (1975),4,978220179,False


In [28]:
train_df = ratings_df[ratings_df.is_valid==False]
valid_df = ratings_df[ratings_df.is_valid==True]

In [29]:
len(valid_df)

12080

In [30]:
len(train_df)

988129

## Creating a Baseline Model

Check accuracy with median score dumb model


In [31]:
median_rating = train_df.rating.median(); median_rating
import math
from sklearn.metrics import mean_squared_error, mean_absolute_error

predictions = np.array([median_rating]* len(valid_df))

mae = mean_absolute_error(valid_df.rating, predictions)
mse = mean_squared_error(valid_df.rating, predictions)
rmse = math.sqrt(mse)

print(f'mae: {mae}')
print(f'mse: {mse}')
print(f'rmse: {rmse}')

4.0

## Matrix factorization with bias

One very popular approach toward recommendations, both in academia and industry, is matrix factorization.

In addition to representing recommendations in a table, such as our DataFrame, an alternative view would be to represent a set of user-item ratings as a matrix. We can visualize this on a sample of our data as presented below:

In [33]:
ratings_df[((ratings_df.user_id == '1') | 
            (ratings_df.user_id == '2')| 
            (ratings_df.user_id == '4')) 
           & ((ratings_df.title == "One Flew Over the Cuckoo's Nest (1975)") | 
              (ratings_df.title == "To Kill a Mockingbird (1962)")| 
              (ratings_df.title == "Saving Private Ryan (1998)"))].pivot_table('rating', index='user_id', columns='title').fillna('?')

title,One Flew Over the Cuckoo's Nest (1975),Saving Private Ryan (1998),To Kill a Mockingbird (1962)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,5.0,5.0,4.0
2,5.0,4.0,4.0
4,?,5.0,?


In [34]:
user_lookup = {v: i+1 for i, v in enumerate(ratings_df['user_id'].unique())}
# summaryrise user_loopu

In [35]:
user_lookup.get('1000')

3093

In [36]:
user_lookup.get('100')

3629

In [37]:
ratings_df['user_id'].unique()

array(['1', '2', '12', ..., '2982', '3893', '4211'], dtype=object)

In [38]:
ratings_df['user_id'].unique()

array(['1', '2', '12', ..., '2982', '3893', '4211'], dtype=object)

In [39]:
movie_lookup = {v: i+1 for i, v in enumerate(ratings_df['title'].unique())}

In [40]:
list(movie_lookup.keys())[:5]

["One Flew Over the Cuckoo's Nest (1975)",
 'James and the Giant Peach (1996)',
 'My Fair Lady (1964)',
 'Erin Brockovich (2000)',
 "Bug's Life, A (1998)"]

In [41]:
movie_lookup.get("Bug's Life, A (1998)")

5

Now that we can encode our features, as we are using PyTorch, we need to define a Dataset to wrap our DataFrame and return the user-item ratings.

In [42]:
from torch.utils.data import Dataset
class UserItemRatingDataset(Dataset):
    def __init__(self, df, movie_lookup, user_lookup):
        self.df = df
        self.movie_lookup = movie_lookup
        self.user_lookup = user_lookup

    def __getitem__(self, index):
        row = self.df.iloc[index]
        user_id = self.user_lookup[row.user_id]
        movie_id = self.movie_lookup[row.title]
        
        rating = torch.tensor(row.rating, dtype=torch.float32)
        
        return (user_id, movie_id), rating

    def __len__(self):
        return len(self.df)

In [43]:
train_df.head(2)

Unnamed: 0,user_id,title,rating,unix_timestamp,is_valid
0,1,One Flew Over the Cuckoo's Nest (1975),5,978300760,False
1,2,One Flew Over the Cuckoo's Nest (1975),5,978298413,False


We can now use this to create our training and validation datasets:

In [44]:
train_dataset = UserItemRatingDataset(train_df, movie_lookup, user_lookup)
valid_dataset = UserItemRatingDataset(valid_df, movie_lookup, user_lookup)

In [45]:
len(train_dataset)

988129

In [46]:
len(valid_dataset)

12080

In [47]:
train_dataset[0]

((1, 1), tensor(5.))

Next, let's define the model.

In [48]:
class MfDotBias(nn.Module):

    def __init__(
        self, n_factors, n_users, n_items, ratings_range=None, use_biases=True
    ):
        super().__init__()
        self.bias = use_biases
        self.y_range = ratings_range
        self.user_embedding = nn.Embedding(n_users+1, n_factors, padding_idx=0)
        self.item_embedding = nn.Embedding(n_items+1, n_factors, padding_idx=0)

        if use_biases:
            self.user_bias = nn.Embedding(n_users+1, 1, padding_idx=0)
            self.item_bias = nn.Embedding(n_items+1, 1, padding_idx=0)

    def forward(self, inputs):
        users, items = inputs
        dot = self.user_embedding(users) * self.item_embedding(items)
        result = dot.sum(1)
        if self.bias:
            result = (
                result + self.user_bias(users).squeeze() + self.item_bias(items).squeeze()
            )

        if self.y_range is None:
            return result
        else:
            return (
                torch.sigmoid(result) * (self.y_range[1] - self.y_range[0]) 
                # "(sigmoid has formula below)
                + self.y_range[0]
            )
        


$$ out_i = \frac{1}{1 + e^{-input_i}} $$
Sigmoid funtion

### Train

In [49]:
from functools import partial

from pytorch_accelerated import Trainer, notebook_launcher 
from pytorch_accelerated.trainer import TrainerPlaceholderValues, DEFAULT_CALLBACKS
from pytorch_accelerated.callbacks import EarlyStoppingCallback, SaveBestModelCallback, TrainerCallback, StopTrainingError
import torchmetrics

In [50]:
Trainer

pytorch_accelerated.trainer.Trainer

In [51]:
class RecommenderMetricsCallback(TrainerCallback):
    def __init__(self):
        self.metrics = torchmetrics.MetricCollection(
            {
                "mse": torchmetrics.MeanSquaredError(),
                "mae": torchmetrics.MeanAbsoluteError(),
            }
        )

    def _move_to_device(self, trainer):
        self.metrics.to(trainer.device)

    def on_training_run_start(self, trainer, **kwargs):
        self._move_to_device(trainer)

    def on_evaluation_run_start(self, trainer, **kwargs):
        self._move_to_device(trainer)

    def on_eval_step_end(self, trainer, batch, batch_output, **kwargs):
        preds = batch_output["model_outputs"]
        self.metrics.update(preds, batch[1])

    def on_eval_epoch_end(self, trainer, **kwargs):
        metrics = self.metrics.compute()
        
        mse = metrics["mse"].cpu()
        trainer.run_history.update_metric("mae", metrics["mae"].cpu())
        trainer.run_history.update_metric("mse", mse)
        trainer.run_history.update_metric("rmse",  math.sqrt(mse))

        self.metrics.reset()

In [53]:
def train_mf_model():
    model = MfDotBias(
        120, len(user_lookup), len(movie_lookup), ratings_range=[0.5, 5.5]
    )
    loss_func = torch.nn.MSELoss()

    optimizer = torch.optim.AdamW(model.parameters(), lr=0.01)

    create_sched_fn = partial(
        torch.optim.lr_scheduler.OneCycleLR,
        max_lr=0.01,
        epochs=TrainerPlaceholderValues.NUM_EPOCHS,
        steps_per_epoch=TrainerPlaceholderValues.NUM_UPDATE_STEPS_PER_EPOCH,
    )

    trainer = Trainer(
        model=model,
        loss_func=loss_func,
        optimizer=optimizer,
        callbacks=(
            RecommenderMetricsCallback,
            *DEFAULT_CALLBACKS,
            SaveBestModelCallback(watch_metric="mae"),
            EarlyStoppingCallback(
                early_stopping_patience=1,
                early_stopping_threshold=0.001,
                watch_metric="mae",
            ),
        ),
    )

    trainer.train(
        train_dataset=train_dataset,
        eval_dataset=valid_dataset,
        num_epochs=20,
        per_device_batch_size=512,
        create_scheduler_fn=create_sched_fn,
    )


In [54]:
notebook_launcher(train_mf_model, num_processes=1)

Launching training on one GPU.

Starting training run

Starting epoch 1


100%|██████████| 1930/1930 [00:48<00:00, 40.16it/s]



train_loss_epoch: 6.900541305541992


100%|██████████| 24/24 [00:01<00:00, 17.49it/s]



mae: 2.2829654216766357

rmse: 2.636351609277347

mse: 6.950349807739258

eval_loss_epoch: 6.950349807739258

Starting epoch 2


100%|██████████| 1930/1930 [00:44<00:00, 43.16it/s]



train_loss_epoch: 6.414422512054443


100%|██████████| 24/24 [00:01<00:00, 21.96it/s]



mae: 2.224461793899536

rmse: 2.583437907245549

mse: 6.674151420593262

eval_loss_epoch: 6.6741509437561035

Improvement of 0.05850362777709961 observed, resetting counter. 
Early stopping counter: 0/1

Starting epoch 3


100%|██████████| 1930/1930 [00:47<00:00, 40.25it/s]



train_loss_epoch: 5.7183427810668945


100%|██████████| 24/24 [00:01<00:00, 18.63it/s]



mae: 2.1058084964752197

rmse: 2.470480686472798

mse: 6.103274822235107

eval_loss_epoch: 6.103274822235107

Improvement of 0.1186532974243164 observed, resetting counter. 
Early stopping counter: 0/1

Starting epoch 4


100%|██████████| 1930/1930 [00:48<00:00, 39.40it/s]



train_loss_epoch: 4.713072776794434


100%|██████████| 24/24 [00:01<00:00, 20.93it/s]



mae: 1.835720181465149

rmse: 2.2003604506934935

mse: 4.841586112976074

eval_loss_epoch: 4.841585636138916

Improvement of 0.2700883150100708 observed, resetting counter. 
Early stopping counter: 0/1

Starting epoch 5


100%|██████████| 1930/1930 [00:41<00:00, 46.36it/s]



train_loss_epoch: 3.4956231117248535


100%|██████████| 24/24 [00:01<00:00, 15.06it/s]



mae: 1.6005858182907104

rmse: 1.9537655199719464

mse: 3.81719970703125

eval_loss_epoch: 3.81719970703125

Improvement of 0.23513436317443848 observed, resetting counter. 
Early stopping counter: 0/1

Starting epoch 6


100%|██████████| 1930/1930 [00:49<00:00, 38.68it/s]



train_loss_epoch: 2.4345502853393555


100%|██████████| 24/24 [00:01<00:00, 19.83it/s]



mae: 1.3266347646713257

rmse: 1.6710259964642606

mse: 2.792327880859375

eval_loss_epoch: 2.792328119277954

Improvement of 0.27395105361938477 observed, resetting counter. 
Early stopping counter: 0/1

Starting epoch 7


100%|██████████| 1930/1930 [00:47<00:00, 40.64it/s]



train_loss_epoch: 1.4527050256729126


100%|██████████| 24/24 [00:01<00:00, 19.99it/s]



mae: 1.086105465888977

rmse: 1.4066936640762648

mse: 1.9787870645523071

eval_loss_epoch: 1.9787873029708862

Improvement of 0.24052929878234863 observed, resetting counter. 
Early stopping counter: 0/1

Starting epoch 8


100%|██████████| 1930/1930 [00:49<00:00, 38.65it/s]



train_loss_epoch: 1.013391137123108


100%|██████████| 24/24 [00:01<00:00, 19.34it/s]



mae: 0.9966462254524231

rmse: 1.3007130647951188

mse: 1.691854476928711

eval_loss_epoch: 1.6918545961380005

Improvement of 0.08945924043655396 observed, resetting counter. 
Early stopping counter: 0/1

Starting epoch 9


100%|██████████| 1930/1930 [00:50<00:00, 38.23it/s]



train_loss_epoch: 0.8693548440933228


100%|██████████| 24/24 [00:01<00:00, 17.82it/s]



mae: 0.9598191976547241

rmse: 1.2529181273975913

mse: 1.5698038339614868

eval_loss_epoch: 1.5698038339614868

Improvement of 0.036827027797698975 observed, resetting counter. 
Early stopping counter: 0/1

Starting epoch 10


100%|██████████| 1930/1930 [00:49<00:00, 38.88it/s]



train_loss_epoch: 0.7818276286125183


100%|██████████| 24/24 [00:01<00:00, 16.68it/s]



mae: 0.9281371235847473

rmse: 1.204053842259479

mse: 1.4497456550598145

eval_loss_epoch: 1.449745535850525

Improvement of 0.03168207406997681 observed, resetting counter. 
Early stopping counter: 0/1

Starting epoch 11


100%|██████████| 1930/1930 [00:50<00:00, 38.25it/s]



train_loss_epoch: 0.6833054423332214


100%|██████████| 24/24 [00:01<00:00, 17.29it/s]



mae: 0.9005589485168457

rmse: 1.1742487453708195

mse: 1.3788601160049438

eval_loss_epoch: 1.3788601160049438

Improvement of 0.02757817506790161 observed, resetting counter. 
Early stopping counter: 0/1

Starting epoch 12


100%|██████████| 1930/1930 [00:55<00:00, 34.93it/s]



train_loss_epoch: 0.5848037004470825


100%|██████████| 24/24 [00:01<00:00, 17.37it/s]



mae: 0.8754093647003174

rmse: 1.1397823199529993

mse: 1.2991037368774414

eval_loss_epoch: 1.2991037368774414

Improvement of 0.02514958381652832 observed, resetting counter. 
Early stopping counter: 0/1

Starting epoch 13


100%|██████████| 1930/1930 [00:49<00:00, 39.15it/s]



train_loss_epoch: 0.4860416650772095


100%|██████████| 24/24 [00:01<00:00, 16.67it/s]



mae: 0.8570231795310974

rmse: 1.1097128447697562

mse: 1.2314625978469849

eval_loss_epoch: 1.2314624786376953

Improvement of 0.01838618516921997 observed, resetting counter. 
Early stopping counter: 0/1

Starting epoch 14


100%|██████████| 1930/1930 [00:49<00:00, 39.22it/s]



train_loss_epoch: 0.3910124599933624


100%|██████████| 24/24 [00:01<00:00, 19.12it/s]



mae: 0.8421646356582642

rmse: 1.0928716675054062

mse: 1.1943684816360474

eval_loss_epoch: 1.194368600845337

Improvement of 0.014858543872833252 observed, resetting counter. 
Early stopping counter: 0/1

Starting epoch 15


100%|██████████| 1930/1930 [00:49<00:00, 38.86it/s]



train_loss_epoch: 0.30612730979919434


100%|██████████| 24/24 [00:01<00:00, 16.07it/s]



mae: 0.8347188830375671

rmse: 1.0777848854409553

mse: 1.1616202592849731

eval_loss_epoch: 1.1616202592849731

Improvement of 0.0074457526206970215 observed, resetting counter. 
Early stopping counter: 0/1

Starting epoch 16


100%|██████████| 1930/1930 [00:51<00:00, 37.29it/s]



train_loss_epoch: 0.23469701409339905


100%|██████████| 24/24 [00:01<00:00, 17.87it/s]



mae: 0.8328484892845154

rmse: 1.0766420531535035

mse: 1.1591581106185913

eval_loss_epoch: 1.1591581106185913

Improvement of 0.0018703937530517578 observed, resetting counter. 
Early stopping counter: 0/1

Starting epoch 17


100%|██████████| 1930/1930 [00:50<00:00, 38.46it/s]



train_loss_epoch: 0.17985893785953522


100%|██████████| 24/24 [00:01<00:00, 19.54it/s]


mae: 0.8370133638381958

rmse: 1.0813726080027004

mse: 1.169366717338562

eval_loss_epoch: 1.169366717338562
No improvement above threshold observed, incrementing counter. 
Early stopping counter: 1/1
Stopping training due to no improvement after 1 epochs
Finishing training run
Loading checkpoint with mae: 0.8328484892845154 from epoch 16





Comparing this to our baseline, we can see that there is an improvement!

## Sequential recommendations using a transformer

Using matrix factorization, we are treating each rating as being independent from the ratings around it; however, incorporating information about other movies that a user recently rated could provide an additional signal that could boost performance. For example, suppose that a user is watching a trilogy of films; if they have rated the first two instalments highly, it is likely that they may do the same for the finale!

One way that we can approach this is to use a transformer network, specifically the encoder portion, to encode additional context into the learned embeddings for each movie, and then using a fully connected neural network to make the rating predictions.

### Pre-processing the data

The first step is to process our data so that we have a time-sorted list of movies for each user. Let's start by grouping all the ratings by user:

In [90]:
grouped_ratings = ratings_df.sort_values(by='unix_timestamp').groupby('user_id').agg(tuple).reset_index()

In [92]:
grouped_ratings

Unnamed: 0,user_id,title,rating,unix_timestamp,is_valid
0,1,"(Girl, Interrupted (1999), Cinderella (1950), ...","(4, 5, 4, 5, 3, 5, 4, 4, 5, 4, 5, 3, 4, 4, 4, ...","(978300019, 978300055, 978300055, 978300055, 9...","(False, False, False, False, False, False, Fal..."
1,10,"(Godfather, The (1972), Pretty Woman (1990), S...","(3, 4, 3, 4, 4, 3, 5, 5, 5, 3, 3, 4, 5, 4, 4, ...","(978224375, 978224375, 978224375, 978224400, 9...","(False, False, False, False, False, False, Fal..."
2,100,"(Starship Troopers (1997), Star Wars: Episode ...","(3, 4, 4, 3, 4, 3, 1, 1, 5, 4, 4, 3, 4, 2, 3, ...","(977593595, 977593595, 977593607, 977593624, 9...","(False, False, False, False, False, False, Fal..."
3,1000,"(Cat on a Hot Tin Roof (1958), Licence to Kill...","(4, 4, 5, 3, 5, 5, 2, 5, 4, 4, 5, 3, 5, 5, 5, ...","(975040566, 975040566, 975040566, 975040629, 9...","(False, False, False, False, False, False, Fal..."
4,1001,"(Raiders of the Lost Ark (1981), Guinevere (19...","(4, 4, 4, 2, 2, 1, 5, 4, 5, 4, 4, 4, 4, 3, 4, ...","(975039591, 975039702, 975039702, 975039898, 9...","(False, False, False, False, False, False, Fal..."
...,...,...,...,...,...
6035,995,"(Six Days Seven Nights (1998), Star Wars: Epis...","(2, 4, 5, 4, 3, 3, 4, 4, 3, 5, 5, 5, 5, 5, 5, ...","(975054785, 975054785, 975054785, 975054853, 9...","(False, False, False, False, False, False, Fal..."
6036,996,"(Nightmare on Elm Street, A (1984), St. Elmo's...","(4, 3, 5, 3, 5, 5, 5, 5, 4, 2, 5, 5, 5, 4, 5, ...","(975052132, 975052132, 975052195, 975052284, 9...","(False, False, False, False, False, False, Fal..."
6037,997,(Star Wars: Episode V - The Empire Strikes Bac...,"(4, 3, 3, 3, 2, 5, 5, 5, 4, 4, 5, 4, 4, 3, 4, ...","(975044235, 975044425, 975044426, 975044426, 9...","(False, False, False, False, False, False, Fal..."
6038,998,"(Butcher's Wife, The (1991), E.T. the Extra-Te...","(3, 5, 4, 5, 3, 4, 4, 3, 4, 4, 4, 4, 4, 5, 4, ...","(975043499, 975043593, 975043593, 975043593, 9...","(False, False, False, False, False, False, Fal..."


Now that we have grouped by user, we can create an additional column so that we can see the number of events associated with each user

In [94]:
grouped_ratings['num_ratings'] = grouped_ratings['rating'].apply(lambda row: len(row))

Let's take a look at the new dataframe

In [95]:
grouped_ratings

Unnamed: 0,user_id,title,rating,unix_timestamp,is_valid,num_ratings
0,1,"(Girl, Interrupted (1999), Cinderella (1950), ...","(4, 5, 4, 5, 3, 5, 4, 4, 5, 4, 5, 3, 4, 4, 4, ...","(978300019, 978300055, 978300055, 978300055, 9...","(False, False, False, False, False, False, Fal...",53
1,10,"(Godfather, The (1972), Pretty Woman (1990), S...","(3, 4, 3, 4, 4, 3, 5, 5, 5, 3, 3, 4, 5, 4, 4, ...","(978224375, 978224375, 978224375, 978224400, 9...","(False, False, False, False, False, False, Fal...",401
2,100,"(Starship Troopers (1997), Star Wars: Episode ...","(3, 4, 4, 3, 4, 3, 1, 1, 5, 4, 4, 3, 4, 2, 3, ...","(977593595, 977593595, 977593607, 977593624, 9...","(False, False, False, False, False, False, Fal...",76
3,1000,"(Cat on a Hot Tin Roof (1958), Licence to Kill...","(4, 4, 5, 3, 5, 5, 2, 5, 4, 4, 5, 3, 5, 5, 5, ...","(975040566, 975040566, 975040566, 975040629, 9...","(False, False, False, False, False, False, Fal...",84
4,1001,"(Raiders of the Lost Ark (1981), Guinevere (19...","(4, 4, 4, 2, 2, 1, 5, 4, 5, 4, 4, 4, 4, 3, 4, ...","(975039591, 975039702, 975039702, 975039898, 9...","(False, False, False, False, False, False, Fal...",377
...,...,...,...,...,...,...
6035,995,"(Six Days Seven Nights (1998), Star Wars: Epis...","(2, 4, 5, 4, 3, 3, 4, 4, 3, 5, 5, 5, 5, 5, 5, ...","(975054785, 975054785, 975054785, 975054853, 9...","(False, False, False, False, False, False, Fal...",49
6036,996,"(Nightmare on Elm Street, A (1984), St. Elmo's...","(4, 3, 5, 3, 5, 5, 5, 5, 4, 2, 5, 5, 5, 4, 5, ...","(975052132, 975052132, 975052195, 975052284, 9...","(False, False, False, False, False, False, Fal...",296
6037,997,(Star Wars: Episode V - The Empire Strikes Bac...,"(4, 3, 3, 3, 2, 5, 5, 5, 4, 4, 5, 4, 4, 3, 4, ...","(975044235, 975044425, 975044426, 975044426, 9...","(False, False, False, False, False, False, Fal...",30
6038,998,"(Butcher's Wife, The (1991), E.T. the Extra-Te...","(3, 5, 4, 5, 3, 4, 4, 3, 4, 4, 4, 4, 4, 5, 4, ...","(975043499, 975043593, 975043593, 975043593, 9...","(False, False, False, False, False, False, Fal...",135


Now that we have grouped all the ratings for each user, let's divide these into smaller sequences. To make the most out of the data, we would like the model to have the opportunity to predict a rating for every movie in the training set. To do this, let's specify a sequence length s and use the previous s-1 ratings as our user history.

As the model expects each sequence to be a fixed length, we will fill empty spaces with a padding token, so that sequences can be batched and passed to the model. Let's create a function to do this.

We are going to arbitrarily choose a length of 10 here.

In [96]:
sequence_length = 10

In [97]:
def create_sequences(values, sequence_length):
    sequences = []
    for i, v in enumerate(values):
        seq = values[:i+1]
        if len(seq) > sequence_length:
            seq = seq[i-sequence_length+1:i+1]
        elif len(seq) < sequence_length:
            seq =(*(['[PAD]'] * (sequence_length - len(seq))), *seq)
       
        sequences.append(seq)
    return sequences
        

To visualize how this function works, let's apply it, with a sequence length of 3, to the first 10 movies rated by the first user. These movies are:

In [98]:
grouped_ratings.iloc[0]['title'][:10]

('Girl, Interrupted (1999)',
 'Cinderella (1950)',
 'Titanic (1997)',
 'Back to the Future (1985)',
 'Meet Joe Black (1998)',
 'Last Days of Disco, The (1998)',
 'Erin Brockovich (2000)',
 'To Kill a Mockingbird (1962)',
 'Christmas Story, A (1983)',
 'Star Wars: Episode IV - A New Hope (1977)')

Applying our function, we have:

In [99]:
create_sequences(grouped_ratings.iloc[0]['title'][:10], 3)

[('[PAD]', '[PAD]', 'Girl, Interrupted (1999)'),
 ('[PAD]', 'Girl, Interrupted (1999)', 'Cinderella (1950)'),
 ('Girl, Interrupted (1999)', 'Cinderella (1950)', 'Titanic (1997)'),
 ('Cinderella (1950)', 'Titanic (1997)', 'Back to the Future (1985)'),
 ('Titanic (1997)', 'Back to the Future (1985)', 'Meet Joe Black (1998)'),
 ('Back to the Future (1985)',
  'Meet Joe Black (1998)',
  'Last Days of Disco, The (1998)'),
 ('Meet Joe Black (1998)',
  'Last Days of Disco, The (1998)',
  'Erin Brockovich (2000)'),
 ('Last Days of Disco, The (1998)',
  'Erin Brockovich (2000)',
  'To Kill a Mockingbird (1962)'),
 ('Erin Brockovich (2000)',
  'To Kill a Mockingbird (1962)',
  'Christmas Story, A (1983)'),
 ('To Kill a Mockingbird (1962)',
  'Christmas Story, A (1983)',
  'Star Wars: Episode IV - A New Hope (1977)')]

As we can see, we have 10 sequences of length 3, where the final movie in the sequence is unchanged from the original list.

Now, let's apply this function to all of the features in our dataframe

In [100]:
grouped_cols = ['title', 'rating', 'unix_timestamp', 'is_valid'] 
for col in grouped_cols:
    grouped_ratings[col] = grouped_ratings[col].apply(lambda x: create_sequences(x, sequence_length))

In [101]:
grouped_ratings.head(2)

Unnamed: 0,user_id,title,rating,unix_timestamp,is_valid,num_ratings
0,1,"[([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [P...","[([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [P...","[([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [P...","[([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [P...",53
1,10,"[([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [P...","[([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [P...","[([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [P...","[([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [P...",401


Currently, we have one row that contains all the sequences for a certain user. However, during training, we would like to create batches made up of sequences from many different users. To do this, we will have to transform the data so that each sequence has its own row, while remaining associated with the user ID. We can use the pandas 'explode' function for each feature, and then aggregate these DataFrames together.

In [102]:
exploded_ratings = grouped_ratings[['user_id', 'title']].explode('title', ignore_index=True)
dfs = [grouped_ratings[[col]].explode(col, ignore_index=True) for col in grouped_cols[1:]]
seq_df = pd.concat([exploded_ratings, *dfs], axis=1)

In [103]:
seq_df.head()

Unnamed: 0,user_id,title,rating,unix_timestamp,is_valid
0,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA..."
1,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA..."
2,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA..."
3,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], Gir...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], 4, ...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], 978...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], Fal..."
4,1,"([PAD], [PAD], [PAD], [PAD], [PAD], Girl, Inte...","([PAD], [PAD], [PAD], [PAD], [PAD], 4, 5, 4, 5...","([PAD], [PAD], [PAD], [PAD], [PAD], 978300019,...","([PAD], [PAD], [PAD], [PAD], [PAD], False, Fal..."


Now, we can see that each sequence has its own row. However, for the is_valid column, we don't care about the whole sequence and only need the last value as this is the movie for which we will be trying to predict the rating. Let's create a function to extract this value and apply it to these columns.

In [104]:
def get_last_entry(sequence):
    return sequence[-1]

seq_df['is_valid'] = seq_df['is_valid'].apply(get_last_entry)

In [105]:
seq_df

Unnamed: 0,user_id,title,rating,unix_timestamp,is_valid
0,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...",False
1,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...",False
2,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...",False
3,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], Gir...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], 4, ...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], 978...",False
4,1,"([PAD], [PAD], [PAD], [PAD], [PAD], Girl, Inte...","([PAD], [PAD], [PAD], [PAD], [PAD], 4, 5, 4, 5...","([PAD], [PAD], [PAD], [PAD], [PAD], 978300019,...",False
...,...,...,...,...,...
1000204,999,"(General's Daughter, The (1999), Powder (1995)...","(3, 3, 2, 1, 3, 2, 3, 2, 4, 3)","(975364681, 975364717, 975364717, 975364717, 9...",False
1000205,999,"(Powder (1995), We're No Angels (1989), Out of...","(3, 2, 1, 3, 2, 3, 2, 4, 3, 3)","(975364717, 975364717, 975364717, 975364743, 9...",False
1000206,999,"(We're No Angels (1989), Out of Africa (1985),...","(2, 1, 3, 2, 3, 2, 4, 3, 3, 3)","(975364717, 975364717, 975364743, 975364743, 9...",False
1000207,999,"(Out of Africa (1985), Instinct (1999), Corrup...","(1, 3, 2, 3, 2, 4, 3, 3, 3, 2)","(975364717, 975364743, 975364743, 975364784, 9...",False


Also, to make it easy to access the rating that we are trying to predict, let's separate this into its own column.

In [106]:
seq_df['target_rating'] = seq_df['rating'].apply(get_last_entry)
seq_df['previous_ratings'] = seq_df['rating'].apply(lambda seq: seq[:-1])
seq_df.drop(columns=['rating'], inplace=True)

To prevent the model from including padding tokens when calculating attention scores, we can provide an attention mask to the transformer; the mask should be 'True' for a padding token and 'False' otherwise. Let's calculate this for each row, as well as creating a column to show the number of padding tokens present.

In [107]:
seq_df['pad_mask'] = seq_df['title'].apply(lambda x: (np.array(x) == '[PAD]'))
seq_df['num_pads'] = seq_df['pad_mask'].apply(sum)
seq_df['pad_mask'] = seq_df['pad_mask'].apply(lambda x: x.tolist()) # in case we serialize later

Let's inspect the transformed data

In [108]:
seq_df

Unnamed: 0,user_id,title,unix_timestamp,is_valid,target_rating,previous_ratings,pad_mask,num_pads
0,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...",False,4,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","[True, True, True, True, True, True, True, Tru...",9
1,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...",False,5,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","[True, True, True, True, True, True, True, Tru...",8
2,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...",False,4,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","[True, True, True, True, True, True, True, Fal...",7
3,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], Gir...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], 978...",False,5,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], 4, ...","[True, True, True, True, True, True, False, Fa...",6
4,1,"([PAD], [PAD], [PAD], [PAD], [PAD], Girl, Inte...","([PAD], [PAD], [PAD], [PAD], [PAD], 978300019,...",False,3,"([PAD], [PAD], [PAD], [PAD], [PAD], 4, 5, 4, 5)","[True, True, True, True, True, False, False, F...",5
...,...,...,...,...,...,...,...,...
1000204,999,"(General's Daughter, The (1999), Powder (1995)...","(975364681, 975364717, 975364717, 975364717, 9...",False,3,"(3, 3, 2, 1, 3, 2, 3, 2, 4)","[False, False, False, False, False, False, Fal...",0
1000205,999,"(Powder (1995), We're No Angels (1989), Out of...","(975364717, 975364717, 975364717, 975364743, 9...",False,3,"(3, 2, 1, 3, 2, 3, 2, 4, 3)","[False, False, False, False, False, False, Fal...",0
1000206,999,"(We're No Angels (1989), Out of Africa (1985),...","(975364717, 975364717, 975364743, 975364743, 9...",False,3,"(2, 1, 3, 2, 3, 2, 4, 3, 3)","[False, False, False, False, False, False, Fal...",0
1000207,999,"(Out of Africa (1985), Instinct (1999), Corrup...","(975364717, 975364743, 975364743, 975364784, 9...",False,2,"(1, 3, 2, 3, 2, 4, 3, 3, 3)","[False, False, False, False, False, False, Fal...",0


All looks as it should! Let's split this into training and validation sets and save this.

In [109]:
train_seq_df = seq_df[seq_df.is_valid == False]
valid_seq_df = seq_df[seq_df.is_valid == True]

### Training the model

As we saw previously, before we can feed this data into the model, we need to create lookup tables to encode our movies and users. However, this time, we need to include the padding token in our movie lookup.

In [110]:
user_lookup = {v: i+1 for i, v in enumerate(ratings_df['user_id'].unique())}

In [111]:
def create_feature_lookup(df, feature):
    lookup = {v: i+1 for i, v in enumerate(df[feature].unique())}
    lookup['[PAD]'] = 0
    return lookup

In [112]:
movie_lookup = create_feature_lookup(ratings_df, 'title')

Now, we are dealing with sequences of ratings, rather than individual ones, so we will need to create a new dataset to wrap our processed DataFrame:

In [113]:
class MovieSequenceDataset(Dataset):
    def __init__(self, df, movie_lookup, user_lookup):
        super().__init__()
        self.df = df
        self.movie_lookup = movie_lookup
        self.user_lookup = user_lookup

    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):
        data = self.df.iloc[index]
        user_id = self.user_lookup[str(data.user_id)]
        movie_ids = torch.tensor([self.movie_lookup[title] for title in data.title])

        previous_ratings = torch.tensor(
            [rating if rating != "[PAD]" else 0 for rating in data.previous_ratings]
        )

        attention_mask = torch.tensor(data.pad_mask)
        target_rating = data.target_rating
        encoded_features = {
            "user_id": user_id,
            "movie_ids": movie_ids,
            "ratings": previous_ratings,
        }

        return (encoded_features, attention_mask), torch.tensor(
            target_rating, dtype=torch.float32
        )


In [114]:
train_dataset = MovieSequenceDataset(train_seq_df, movie_lookup, user_lookup)
valid_dataset = MovieSequenceDataset(valid_seq_df, movie_lookup, user_lookup)

Now, let's define our transformer model! As a start, given that the matrix factorization model can achieve good performance using only the user and movie ids, let's only include this information for now.

In [115]:
class BstTransformer(nn.Module):
    def __init__(
        self,
        movies_num_unique,
        users_num_unique,
        sequence_length=10,
        embedding_size=120,
        num_transformer_layers=1,
        ratings_range=(0.5, 5.5),
    ):
        super().__init__()
        self.sequence_length = sequence_length
        self.y_range = ratings_range
        self.movies_embeddings = nn.Embedding(
            movies_num_unique + 1, embedding_size, padding_idx=0
        )
        self.user_embeddings = nn.Embedding(users_num_unique + 1, embedding_size)
        self.position_embeddings = nn.Embedding(sequence_length, embedding_size)

        self.encoder = nn.TransformerEncoder(
            encoder_layer=nn.TransformerEncoderLayer(
                d_model=embedding_size,
                nhead=12,
                dropout=0.1,
                batch_first=True,
                activation="gelu",
            ),
            num_layers=num_transformer_layers,
        )

        self.linear = nn.Sequential(
            nn.Linear(
                embedding_size + (embedding_size * sequence_length),
                1024,
            ),
            nn.BatchNorm1d(1024),
            nn.Mish(),
            nn.Linear(1024, 512),
            nn.BatchNorm1d(512),
            nn.Mish(),
            nn.Dropout(0.2),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.Mish(),
            nn.Linear(256, 1),
            nn.Sigmoid(),
        )

    def forward(self, inputs):
        features, mask = inputs

        encoded_user_id = self.user_embeddings(features["user_id"])

        user_features = encoded_user_id

        encoded_movies = self.movies_embeddings(features["movie_ids"])

        positions = torch.arange(
            0, self.sequence_length, 1, dtype=int, device=features["movie_ids"].device
        )
        positions = self.position_embeddings(positions)

        transformer_features = encoded_movies + positions

        transformer_output = self.encoder(
            transformer_features, src_key_padding_mask=mask
        )
        transformer_output = torch.flatten(transformer_output, start_dim=1)

        combined_output = torch.cat((transformer_output, user_features), dim=1)

        rating = self.linear(combined_output)
        rating = rating.squeeze()
        if self.y_range is None:
            return rating
        else:
            return rating * (self.y_range[1] - self.y_range[0]) + self.y_range[0]


We can see that, as a default, we feed our sequence of movie embeddings into a single transformer layer, before concatenating the output with the user features - here, just the user ID - and using this as the input to a fully connected network. Here, we are using only a simple positional encoding that is learned to represent the sequence in which the movies were rated; using a sine- and cosine-based approach provided no benefit during my experiments, but feel free to try it out if you are interested!

Once again, let's define a training function for this model; except for the model initialization, this is identical to the one we used to train the matrix factorization model.

In [116]:
def train_seq_model():
    model = BstTransformer(
        len(movie_lookup), len(user_lookup), sequence_length, embedding_size=120
    )
    loss_func = torch.nn.MSELoss()

    optimizer = torch.optim.AdamW(model.parameters(), lr=0.01)

    create_sched_fn = partial(
        torch.optim.lr_scheduler.OneCycleLR,
        max_lr=0.01,
        epochs=TrainerPlaceholderValues.NUM_EPOCHS,
        steps_per_epoch=TrainerPlaceholderValues.NUM_UPDATE_STEPS_PER_EPOCH,
    )

    trainer = Trainer(
        model=model,
        loss_func=loss_func,
        optimizer=optimizer,
        callbacks=(
            RecommenderMetricsCallback,
            *DEFAULT_CALLBACKS,
            SaveBestModelCallback(watch_metric="mae"),
            EarlyStoppingCallback(
                early_stopping_patience=2,
                early_stopping_threshold=0.001,
                watch_metric="mae",
            ),
        ),
    )

    trainer.train(
        train_dataset=train_dataset,
        eval_dataset=valid_dataset,
        num_epochs=10,
        per_device_batch_size=512,
        create_scheduler_fn=create_sched_fn,
    )


In [117]:
notebook_launcher(train_seq_model, num_processes=2)

Launching training on 2 GPUs.


We can see that this is a significant improvement over the matrix factorization approach!

### Adding additional data

So far, we have only considered the user ID and a sequence of movie IDs to predict the rating; it seems likely that including information about the previous ratings made by the user would improve performance. Thankfully, this is easy to do, and the data is already being returned by our dataset. Let's tweak our architecture to include this:

In [118]:
class BstTransformer(nn.Module):
    def __init__(
        self,
        movies_num_unique,
        users_num_unique,
        sequence_length=10,
        embedding_size=120,
        num_transformer_layers=1,
        ratings_range=(0.5, 5.5),
    ):
        super().__init__()
        self.sequence_length = sequence_length
        self.y_range = ratings_range
        self.movies_embeddings = nn.Embedding(
            movies_num_unique + 1, embedding_size, padding_idx=0
        )
        self.user_embeddings = nn.Embedding(users_num_unique + 1, embedding_size)
        self.ratings_embeddings = nn.Embedding(6, embedding_size, padding_idx=0)
        self.position_embeddings = nn.Embedding(sequence_length, embedding_size)

        self.encoder = nn.TransformerEncoder(
            encoder_layer=nn.TransformerEncoderLayer(
                d_model=embedding_size,
                nhead=12,
                dropout=0.1,
                batch_first=True,
                activation="gelu",
            ),
            num_layers=num_transformer_layers,
        )

        self.linear = nn.Sequential(
            nn.Linear(
                embedding_size + (embedding_size * sequence_length),
                1024,
            ),
            nn.BatchNorm1d(1024),
            nn.Mish(),
            nn.Linear(1024, 512),
            nn.BatchNorm1d(512),
            nn.Mish(),
            nn.Dropout(0.2),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.Mish(),
            nn.Linear(256, 1),
            nn.Sigmoid(),
        )

    def forward(self, inputs):
        features, mask = inputs

        encoded_user_id = self.user_embeddings(features["user_id"])

        user_features = encoded_user_id

        movie_history = features["movie_ids"][:, :-1]
        target_movie = features["movie_ids"][:, -1]

        ratings = self.ratings_embeddings(features["ratings"])

        encoded_movies = self.movies_embeddings(movie_history)
        encoded_target_movie = self.movies_embeddings(target_movie)

        positions = torch.arange(
            0,
            self.sequence_length - 1,
            1,
            dtype=int,
            device=features["movie_ids"].device,
        )
        positions = self.position_embeddings(positions)

        encoded_sequence_movies_with_position_and_rating = (
            encoded_movies + ratings + positions
        )
        encoded_target_movie = encoded_target_movie.unsqueeze(1)

        transformer_features = torch.cat(
            (encoded_sequence_movies_with_position_and_rating, encoded_target_movie),
            dim=1,
        )
        transformer_output = self.encoder(
            transformer_features, src_key_padding_mask=mask
        )
        transformer_output = torch.flatten(transformer_output, start_dim=1)

        combined_output = torch.cat((transformer_output, user_features), dim=1)

        rating = self.linear(combined_output)
        rating = rating.squeeze()
        if self.y_range is None:
            return rating
        else:
            return rating * (self.y_range[1] - self.y_range[0]) + self.y_range[0]


We can see that, to use the ratings data, we have added an additional embedding layer. For each previously rated movie, we then add together the movie embedding, the positional encoding and the rating embedding before feeding this sequence into the transformer. Alternatively, the rating data could be concatenated to, or multiplied with, the movie embedding, but adding them together worked the best out of the approaches that I tried.

As Jupyter maintains a live state for each class definition, we don't need to update our training function; the new class will be used when we launch training:

In [119]:
notebook_launcher(train_seq_model, num_processes=2)

Launching training on 2 GPUs.


We can see that incorporating the ratings data has improved our results slightly!

### Adding user features

In addition to the ratings data, we also have more information about the users that we could add into the model. To remind ourselves, let's take a look at the users table:

In [120]:
users

Unnamed: 0,user_id,sex,age_group,occupation,zip_code,sex_encoded,age_group_encoded
0,1,F,1,10,48067,0,0
1,2,M,56,16,70072,1,6
2,3,M,25,15,55117,1,2
3,4,M,45,7,02460,1,4
4,5,M,25,20,55455,1,2
...,...,...,...,...,...,...,...
6035,6036,F,25,15,32603,0,2
6036,6037,F,45,1,76006,0,4
6037,6038,F,56,1,14706,0,6
6038,6039,F,45,0,01060,0,4


Let's try adding in the categorical variables representing the users' sex, age groups, and occupation to the model, and see if we see any improvement. While occupation looks like it is already sequentially numerically encoded, we must do the same for the sex and age_group columns. We can use the 'LabelEncoder' class from scikit-learn to do this for us, and append the encoded columns to the DataFrame:

In [121]:
from sklearn.preprocessing import LabelEncoder

In [122]:
le = LabelEncoder()

In [123]:
users['sex_encoded'] = le.fit_transform(users.sex)

In [124]:
users['age_group_encoded'] = le.fit_transform(users.age_group)

In [125]:
users["user_id"] = users["user_id"].astype(str)

Now that we have all the features that we are going to use encoded, let's join the user features to our sequences DataFrame, and update our training and validation sets.

In [126]:
seq_with_user_features = pd.merge(seq_df, users)

In [127]:
train_df = seq_with_user_features[seq_with_user_features.is_valid == False]
valid_df = seq_with_user_features[seq_with_user_features.is_valid == True]

Let's update our dataset to include these features.

In [128]:
class MovieSequenceDataset(Dataset):
    def __init__(self, df, movie_lookup, user_lookup):
        super().__init__()
        self.df = df
        self.movie_lookup = movie_lookup
        self.user_lookup = user_lookup

    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):
        data = self.df.iloc[index]
        user_id = self.user_lookup[str(data.user_id)]
        movie_ids = torch.tensor([self.movie_lookup[title] for title in data.title])

        previous_ratings = torch.tensor(
            [rating if rating != "[PAD]" else 0 for rating in data.previous_ratings]
        )

        attention_mask = torch.tensor(data.pad_mask)
        target_rating = data.target_rating
        encoded_features = {
            "user_id": user_id,
            "movie_ids": movie_ids,
            "ratings": previous_ratings,
            "age_group": data["age_group_encoded"],
            "sex": data["sex_encoded"],
            "occupation": data["occupation"],
        }

        return (encoded_features, attention_mask), torch.tensor(
            target_rating, dtype=torch.float32
        )


In [129]:
train_dataset = MovieSequenceDataset(train_df, movie_lookup, user_lookup)
valid_dataset = MovieSequenceDataset(valid_df, movie_lookup, user_lookup)

We can now modify our architecture to include embeddings for these features and concatenate these embeddings to the output of the transformer; then we pass this into the feed-forward network.

In [130]:
class BstTransformer(nn.Module):
    def __init__(
        self,
        movies_num_unique,
        users_num_unique,
        sequence_length=10,
        embedding_size=120,
        num_transformer_layers=1,
        ratings_range=(0.5, 5.5),
    ):
        super().__init__()
        self.sequence_length = sequence_length
        self.y_range = ratings_range
        self.movies_embeddings = nn.Embedding(
            movies_num_unique + 1, embedding_size, padding_idx=0
        )
        self.user_embeddings = nn.Embedding(users_num_unique + 1, embedding_size)
        self.ratings_embeddings = nn.Embedding(6, embedding_size, padding_idx=0)
        self.position_embeddings = nn.Embedding(sequence_length, embedding_size)

        self.sex_embeddings = nn.Embedding(
            3,
            2,
        )
        self.occupation_embeddings = nn.Embedding(
            22,
            11,
        )
        self.age_group_embeddings = nn.Embedding(
            8,
            4,
        )

        self.encoder = nn.TransformerEncoder(
            encoder_layer=nn.TransformerEncoderLayer(
                d_model=embedding_size,
                nhead=12,
                dropout=0.1,
                batch_first=True,
                activation="gelu",
            ),
            num_layers=num_transformer_layers,
        )

        self.linear = nn.Sequential(
            nn.Linear(
                embedding_size + (embedding_size * sequence_length) + 4 + 11 + 2,
                1024,
            ),
            nn.BatchNorm1d(1024),
            nn.Mish(),
            nn.Linear(1024, 512),
            nn.BatchNorm1d(512),
            nn.Mish(),
            nn.Dropout(0.2),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.Mish(),
            nn.Linear(256, 1),
            nn.Sigmoid(),
        )

    def forward(self, inputs):
        features, mask = inputs

        user_id = self.user_embeddings(features["user_id"])

        age_group = self.age_group_embeddings(features["age_group"])
        sex = self.sex_embeddings(features["sex"])
        occupation = self.occupation_embeddings(features["occupation"])

        user_features = user_features = torch.cat(
            (user_id, sex, age_group, occupation), 1
        )

        movie_history = features["movie_ids"][:, :-1]
        target_movie = features["movie_ids"][:, -1]

        ratings = self.ratings_embeddings(features["ratings"])

        encoded_movies = self.movies_embeddings(movie_history)
        encoded_target_movie = self.movies_embeddings(target_movie)

        positions = torch.arange(
            0,
            self.sequence_length - 1,
            1,
            dtype=int,
            device=features["movie_ids"].device,
        )
        positions = self.position_embeddings(positions)

        encoded_sequence_movies_with_position_and_rating = (
            encoded_movies + ratings + positions
        )
        encoded_target_movie = encoded_target_movie.unsqueeze(1)

        transformer_features = torch.cat(
            (encoded_sequence_movies_with_position_and_rating, encoded_target_movie),
            dim=1,
        )
        transformer_output = self.encoder(
            transformer_features, src_key_padding_mask=mask
        )
        transformer_output = torch.flatten(transformer_output, start_dim=1)

        combined_output = torch.cat((transformer_output, user_features), dim=1)

        rating = self.linear(combined_output)
        rating = rating.squeeze()
        if self.y_range is None:
            return rating
        else:
            return rating * (self.y_range[1] - self.y_range[0]) + self.y_range[0]


In [131]:
notebook_launcher(train_seq_model, num_processes=2)

Launching training on 2 GPUs.


Here, we can see a slight decrease in the MAE, but a small increase in the MSE and RMSE, so it looks like these features made a negligible difference to the overall performance.

In writing this article, my main objective has been to try and illustrate how these approaches can be used, and so I've picked the hyperparameters somewhat arbitrarily; it's likely that with some hyperparameter tweaks, and different combinations of features, these metrics can probably be improved upon!

Hopefully this has provided a good introduction to using both matrix factorization and transformer-based approaches in PyTorch, and how pytorch-accelerated can speed up our process when experimenting with different models!