# Comparing matrix factorization with transformers for MovieLens recommendations using PyTorch-accelerated.

By Chris Hughes

The package versions used are:

In [None]:
torch==1.10.0
torchmetrics==0.6.0
pytorch-accelerated==0.1.7

In [1]:
from pathlib import Path

import numpy as np
import pandas as pd
from statsmodels.distributions.empirical_distribution import ECDF
import matplotlib.pyplot as plt

## Prepare Data

For our dataset, we shall use MovieLens-1M, a collection of one million ratings from 6000 users on 4000 movies. This dataset was collected and is maintained by GroupLens, a research group at the University of Minnesota, and released in 2003; it has been frequently used in the Machine Learning community and is commonly presented as a benchmark in academic papers.

### Download Movielens-1M dataset

In [2]:
!wget http://files.grouplens.org/datasets/movielens/ml-1m.zip

--2021-12-05 11:22:27--  http://files.grouplens.org/datasets/movielens/ml-1m.zip
Resolving files.grouplens.org... 128.101.65.152
Connecting to files.grouplens.org|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5917549 (5.6M) [application/zip]
Saving to: ‘ml-1m.zip.1’


2021-12-05 11:22:28 (6.56 MB/s) - ‘ml-1m.zip.1’ saved [5917549/5917549]



In [3]:
!unzip ml-1m.zip

Archive:  ml-1m.zip
replace ml-1m/movies.dat? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


## Load Data

MovieLens consists of three files, 'movies.dat', 'users.dat', and 'ratings.dat', which have the following formats:

In [2]:
dataset_path = Path('ml-1m')

In [3]:
users = pd.read_csv(
    dataset_path/"users.dat",
    sep="::",
    names=["user_id", "sex", "age_group", "occupation", "zip_code"],
    encoding='latin-1',
    engine='python'
)

ratings = pd.read_csv(
    dataset_path/"ratings.dat",
    sep="::",
    names=["user_id", "movie_id", "rating", "unix_timestamp"],
    encoding='latin-1',
    engine='python'
)

movies = pd.read_csv(
    dataset_path/"movies.dat", sep="::", names=["movie_id", "title", "genres"],
    encoding='latin-1',
    engine='python'
)


In [4]:
users

Unnamed: 0,user_id,sex,age_group,occupation,zip_code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,02460
4,5,M,25,20,55455
...,...,...,...,...,...
6035,6036,F,25,15,32603
6036,6037,F,45,1,76006
6037,6038,F,56,1,14706
6038,6039,F,45,0,01060


In [5]:
movies

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama


In [6]:
ratings

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
...,...,...,...,...
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648


Let's combine some of this information into a single DataFrame, to make it easier for us to work with.

In [7]:
ratings_df = pd.merge(ratings, movies)[['user_id', 'title', 'rating', 'unix_timestamp']]

In [8]:
ratings_df["user_id"] = ratings_df["user_id"].astype(str)

Using pandas, we can print some high-level statistics about the dataset, which may be useful to us.

In [9]:
ratings_per_user = ratings_df.groupby('user_id').rating.count()
ratings_per_item = ratings_df.groupby('title').rating.count()

print(f"Total No. of users: {len(ratings_df.user_id.unique())}")
print(f"Total No. of items: {len(ratings_df.title.unique())}")
print("\n")

print(f"Max observed rating: {ratings_df.rating.max()}")
print(f"Min observed rating: {ratings_df.rating.min()}")
print("\n")

print(f"Max no. of user ratings: {ratings_per_user.max()}")
print(f"Min no. of user ratings: {ratings_per_user.min()}")
print(f"Median no. of ratings per user: {ratings_per_user.median()}")
print("\n")

print(f"Max no. of item ratings: {ratings_per_item.max()}")
print(f"Min no. of item ratings: {ratings_per_item.min()}")
print(f"Median no. of ratings per item: {ratings_per_item.median()}")


Total No. of users: 6040
Total No. of items: 3706


Max observed rating: 5
Min observed rating: 1


Max no. of user ratings: 2314
Min no. of user ratings: 20
Median no. of ratings per user: 96.0


Max no. of item ratings: 3428
Min no. of item ratings: 1
Median no. of ratings per item: 123.5


From this, we can see that all ratings are between 1 and 5 and every item has been rated at least once. As every user has rated at least 20 movies, we don't have to worry about the case of how to recommend items to a user where we know nothing about their preferences - but this is often not the case in the real world!

### Splitting into training and validation sets

Before we start modeling, we need to split this dataset into training and validations sets. Often, splitting the dataset is done by randomly sampling a selection of rows, which is a good approach in some cases. However, as we intend to train a transformer model on sequences of ratings, this approach will not work for our purposes. This is because, if we were to simply remove a set of random rows, this is not a good representation of the task that we are trying to model; as it is likely that, for some users, ratings from the middle of a sequence will end up in the validation set.

To avoid this, one approach would be to use a strategy known as 'leave-one-out' validation, in which we select the last chronological rating for each user, given that they have rated some number of items greater than a defined threshold. As this is a good representation of the approach we are trying to model, this is the approach we shall use here.

Let's define a function to get the last n for each user

In [10]:
def get_last_n_ratings_by_user(
    df, n, min_ratings_per_user=1, user_colname="user_id", timestamp_colname="unix_timestamp"
):
    return (
        df.groupby(user_colname)
        .filter(lambda x: len(x) >= min_ratings_per_user)
        .sort_values(timestamp_colname)
        .groupby(user_colname)
        .tail(n)
        .sort_values(user_colname)
    )

In [11]:
get_last_n_ratings_by_user(ratings_df, 1)

Unnamed: 0,user_id,title,rating,unix_timestamp
28501,1,Pocahontas (1995),5,978824351
482398,10,Hero (1992),5,980638688
800008,100,Apocalypse Now (1979),2,977594963
496041,1000,"Streetcar Named Desire, A (1951)",5,975042421
305563,1001,Austin Powers: The Spy Who Shagged Me (1999),2,1028605534
...,...,...,...,...
767773,995,French Kiss (1995),3,975099776
573889,996,Almost Famous (2000),5,1001227064
76463,997,Gladiator (2000),4,978915132
998801,998,See the Sea (Regarde la mer) (1997),5,975192573


We can now use this to define another function to mark the last n ratings per user as our validation set; representing this using the is_valid column:

In [12]:
def mark_last_n_ratings_as_validation_set(
    df, n, min_ratings=1, user_colname="user_id", timestamp_colname="unix_timestamp"
):
    """
    Mark the chronologically last n ratings as the validation set.
    This is done by adding the additional 'is_valid' column to the df.
    :param df: a DataFrame containing user item ratings
    :param n: the number of ratings to include in the validation set
    :param min_ratings: only include users with more than this many ratings
    :param user_id_colname: the name of the column containing user ids
    :param timestamp_colname: the name of the column containing the imestamps
    :return: the same df with the additional 'is_valid' column added
    """
    df["is_valid"] = False
    df.loc[
        get_last_n_ratings_by_user(
            df,
            n,
            min_ratings,
            user_colname=user_colname,
            timestamp_colname=timestamp_colname,
        ).index,
        "is_valid",
    ] = True

    return df

Applying this to our DataFrame, we can see that we now have a validation set of 6040 rows - one for each user.

In [13]:
mark_last_n_ratings_as_validation_set(ratings_df, 1)

Unnamed: 0,user_id,title,rating,unix_timestamp,is_valid
0,1,One Flew Over the Cuckoo's Nest (1975),5,978300760,False
1,2,One Flew Over the Cuckoo's Nest (1975),5,978298413,False
2,12,One Flew Over the Cuckoo's Nest (1975),4,978220179,False
3,15,One Flew Over the Cuckoo's Nest (1975),4,978199279,False
4,17,One Flew Over the Cuckoo's Nest (1975),5,978158471,False
...,...,...,...,...,...
1000204,5949,Modulations (1998),5,958846401,False
1000205,5675,Broken Vessels (1998),3,976029116,False
1000206,5780,White Boys (1999),1,958153068,False
1000207,5851,One Little Indian (1973),5,957756608,False


In [14]:
train_df = ratings_df[ratings_df.is_valid==False]
valid_df = ratings_df[ratings_df.is_valid==True]

In [15]:
len(valid_df)

6040

Even when considering model benchmarks on the same dataset, to have a fair comparison, it is important to understand how the data has been split and to make sure that the approaches taken are consistent!

## Creating a Baseline Model

When starting a new modeling task, it is often a good idea to create a very simple model - known as a baseline model - to perform the task in a straightforward way that requires minimal effort to implement. We can then use the metrics from this model as a comparison for all future approaches; if a complex model is getting worse results than the baseline model, this is a bad sign!

Here, an approach that we can use for this is to simply predict the average rating for every movie, irrespective of context. As the mean can be heavily affected by outliers, let's use the median for this. We can easily calculate the median rating from our training set as follows:

In [17]:
median_rating = train_df.rating.median(); median_rating

4.0

We can then use this as the prediction for every rating in the validation set and calculate our metrics:

In [18]:
import math
from sklearn.metrics import mean_squared_error, mean_absolute_error

predictions = np.array([median_rating]* len(valid_df))

mae = mean_absolute_error(valid_df.rating, predictions)
mse = mean_squared_error(valid_df.rating, predictions)
rmse = math.sqrt(mse)

print(f'mae: {mae}')
print(f'mse: {mse}')
print(f'rmse: {rmse}')

mae: 0.91158940397351
mse: 1.5304635761589405
rmse: 1.2371190630488806


## Matrix factorization with bias

One very popular approach toward recommendations, both in academia and industry, is matrix factorization.

In addition to representing recommendations in a table, such as our DataFrame, an alternative view would be to represent a set of user-item ratings as a matrix. We can visualize this on a sample of our data as presented below:

In [19]:
ratings_df[((ratings_df.user_id == '1') | 
            (ratings_df.user_id == '2')| 
            (ratings_df.user_id == '4')) 
           & ((ratings_df.title == "One Flew Over the Cuckoo's Nest (1975)") | 
              (ratings_df.title == "To Kill a Mockingbird (1962)")| 
              (ratings_df.title == "Saving Private Ryan (1998)"))].pivot_table('rating', index='user_id', columns='title').fillna('?')

title,One Flew Over the Cuckoo's Nest (1975),Saving Private Ryan (1998),To Kill a Mockingbird (1962)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,5.0,5.0,4.0
2,5.0,4.0,4.0
4,?,5.0,?


As not every user will have rated every movie, we can see that some values are missing. Therefore, we can formulate our recommendation problem in the following way:

How can we fill in the blanks, such that the values are consistent with the existing ratings in the matrix?

One way that we can approach this is by considering that there are two smaller matrices that can be multiplied together to make our ratings matrix.

Before we think about training a model, we first need to get the data into the correct format. Currently, we have a title that represents each movie, which is a string; we need to convert this to an integer format so that we can feed it into the model. While we already have an ID representing each user, let's also create our own encoding for this. I generally find it good practice to control all the encodings related to a training process, rather than relying on predefined ID systems defined elsewhere; you will be surprised how many IDs that are supposed to be immutable and unique turn out to be otherwise in the real world!

Here, we can do this very simply by enumerating every unique value for both users and movies. We can create lookup tables for this as shown below:

In [20]:
user_lookup = {v: i+1 for i, v in enumerate(ratings_df['user_id'].unique())}

In [21]:
movie_lookup = {v: i+1 for i, v in enumerate(ratings_df['title'].unique())}

Now that we can encode our features, as we are using PyTorch, we need to define a Dataset to wrap our DataFrame and return the user-item ratings.

In [22]:
from torch.utils.data import Dataset

class UserItemRatingDataset(Dataset):
    def __init__(self, df, movie_lookup, user_lookup):
        self.df = df
        self.movie_lookup = movie_lookup
        self.user_lookup = user_lookup

    def __getitem__(self, index):
        row = self.df.iloc[index]
        user_id = self.user_lookup[row.user_id]
        movie_id = self.movie_lookup[row.title]
        
        rating = torch.tensor(row.rating, dtype=torch.float32)
        
        return (user_id, movie_id), rating

    def __len__(self):
        return len(self.df)


We can now use this to create our training and validation datasets:

In [23]:
train_dataset = UserItemRatingDataset(train_df, movie_lookup, user_lookup)
valid_dataset = UserItemRatingDataset(valid_df, movie_lookup, user_lookup)

Next, let's define the model.

In [24]:
import torch
from torch import nn

class MfDotBias(nn.Module):

    def __init__(
        self, n_factors, n_users, n_items, ratings_range=None, use_biases=True
    ):
        super().__init__()
        self.bias = use_biases
        self.y_range = ratings_range
        self.user_embedding = nn.Embedding(n_users+1, n_factors, padding_idx=0)
        self.item_embedding = nn.Embedding(n_items+1, n_factors, padding_idx=0)

        if use_biases:
            self.user_bias = nn.Embedding(n_users+1, 1, padding_idx=0)
            self.item_bias = nn.Embedding(n_items+1, 1, padding_idx=0)

    def forward(self, inputs):
        users, items = inputs
        dot = self.user_embedding(users) * self.item_embedding(items)
        result = dot.sum(1)
        if self.bias:
            result = (
                result + self.user_bias(users).squeeze() + self.item_bias(items).squeeze()
            )

        if self.y_range is None:
            return result
        else:
            return (
                torch.sigmoid(result) * (self.y_range[1] - self.y_range[0])
                + self.y_range[0]
            )

As we can see, this is very simple to define. Note that because an embedding layer is simply a lookup table, it is important that when we specify the size of the embedding layer, it must contain any value that will be seen during training and evaluation. Because of this, we will use the number of unique items observed in the full dataset to do this, not just the training set. We have also specified a padding embedding at index 0, which can be used for any unknown values. PyTorch handles this by setting this entry to a zero-vector, which is not updated during training.

Additionally, as this is a regression task, the range that the model could predict is potentially unbounded. While the model can learn to restrict the output values to between 1 and 5, we can make this easier for the model by modifying the architecture to restrict this range prior to training. We have done this by applying the sigmoid function to the model's output - which restricts the range to between 0 and 1 - and then scaling this to within a range that we can define.

### Train with PyTorch accelerated

At this point, we would usually start writing the training loop; however, as we are using pytorch-accelerated, this will largely be taken care of for us. However, as pytorch-accelerated tracks only the training and validation losses by default, let's create a callback to track our metrics.

In [25]:
from functools import partial

from pytorch_accelerated import Trainer, notebook_launcher 
from pytorch_accelerated.trainer import TrainerPlaceholderValues, DEFAULT_CALLBACKS
from pytorch_accelerated.callbacks import EarlyStoppingCallback, SaveBestModelCallback, TrainerCallback, StopTrainingError
import torchmetrics

Let's create a callback to track our metrics

In [26]:
class RecommenderMetricsCallback(TrainerCallback):
    def __init__(self):
        self.metrics = torchmetrics.MetricCollection(
            {
                "mse": torchmetrics.MeanSquaredError(),
                "mae": torchmetrics.MeanAbsoluteError(),
            }
        )

    def _move_to_device(self, trainer):
        self.metrics.to(trainer.device)

    def on_training_run_start(self, trainer, **kwargs):
        self._move_to_device(trainer)

    def on_evaluation_run_start(self, trainer, **kwargs):
        self._move_to_device(trainer)

    def on_eval_step_end(self, trainer, batch, batch_output, **kwargs):
        preds = batch_output["model_outputs"]
        self.metrics.update(preds, batch[1])

    def on_eval_epoch_end(self, trainer, **kwargs):
        metrics = self.metrics.compute()
        
        mse = metrics["mse"].cpu()
        trainer.run_history.update_metric("mae", metrics["mae"].cpu())
        trainer.run_history.update_metric("mse", mse)
        trainer.run_history.update_metric("rmse",  math.sqrt(mse))

        self.metrics.reset()

Now, all that is left to do is to train the model. PyTorch-accelerated provides a notebook_launcher function, which enables us to run multi-GPU training runs from within a notebook. To use this, all we need to do is to define a training function that instantiates our Trainer object and calls the train method.

Components such as the model and dataset can be defined anywhere in the notebook, but it is important that the trainer is only ever instantiated within a training function.

In [27]:
def train_mf_model():
    model = MfDotBias(
        120, len(user_lookup), len(movie_lookup), ratings_range=[0.5, 5.5]
    )
    loss_func = torch.nn.MSELoss()

    optimizer = torch.optim.AdamW(model.parameters(), lr=0.01)

    create_sched_fn = partial(
        torch.optim.lr_scheduler.OneCycleLR,
        max_lr=0.01,
        epochs=TrainerPlaceholderValues.NUM_EPOCHS,
        steps_per_epoch=TrainerPlaceholderValues.NUM_UPDATE_STEPS_PER_EPOCH,
    )

    trainer = Trainer(
        model=model,
        loss_func=loss_func,
        optimizer=optimizer,
        callbacks=(
            RecommenderMetricsCallback,
            *DEFAULT_CALLBACKS,
            SaveBestModelCallback(watch_metric="mae"),
            EarlyStoppingCallback(
                early_stopping_patience=2,
                early_stopping_threshold=0.001,
                watch_metric="mae",
            ),
        ),
    )

    trainer.train(
        train_dataset=train_dataset,
        eval_dataset=valid_dataset,
        num_epochs=30,
        per_device_batch_size=512,
        create_scheduler_fn=create_sched_fn,
    )


In [79]:
notebook_launcher(train_mf_model, num_processes=2)

Launching a training on 2 GPUs.

Starting training run

Starting epoch 1


100%|██████████| 971/971 [00:22<00:00, 43.97it/s]



train_loss_epoch: 6.917553340860793


100%|██████████| 6/6 [00:00<00:00, 15.46it/s]



mae: 2.295109510421753

eval_loss_epoch: 7.015153566996257

rmse: 2.6486134878555867

mse: 7.015153408050537

Starting epoch 2


100%|██████████| 971/971 [00:22<00:00, 43.85it/s]



train_loss_epoch: 6.613832648087726


100%|██████████| 6/6 [00:00<00:00, 15.00it/s]



mae: 2.2753686904907227

eval_loss_epoch: 6.9166419506073

rmse: 2.629950895394954

mse: 6.916641712188721

Improvement of 0.019740819931030273 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 3


100%|██████████| 971/971 [00:22<00:00, 42.62it/s]



train_loss_epoch: 6.165679518643663


100%|██████████| 6/6 [00:00<00:00, 14.32it/s]



mae: 2.237168550491333

eval_loss_epoch: 6.745323101679484

rmse: 2.59717600118905

mse: 6.745323181152344

Improvement of 0.03820013999938965 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 4


100%|██████████| 971/971 [00:23<00:00, 42.14it/s]



train_loss_epoch: 5.713309306685883


100%|██████████| 6/6 [00:00<00:00, 14.77it/s]



mae: 2.1757984161376953

eval_loss_epoch: 6.463704665501912

rmse: 2.5423816759151356

mse: 6.463704586029053

Improvement of 0.061370134353637695 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 5


100%|██████████| 971/971 [00:22<00:00, 43.71it/s]



train_loss_epoch: 5.240608819849582


100%|██████████| 6/6 [00:00<00:00, 15.74it/s]



mae: 2.0645127296447754

eval_loss_epoch: 5.93857479095459

rmse: 2.4369190208370552

mse: 5.938574314117432

Improvement of 0.11128568649291992 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 6


100%|██████████| 971/971 [00:22<00:00, 42.76it/s]



train_loss_epoch: 4.5496183526994765


100%|██████████| 6/6 [00:00<00:00, 15.47it/s]



mae: 1.8991206884384155

eval_loss_epoch: 5.143191337585449

rmse: 2.2678606249993862

mse: 5.143191814422607

Improvement of 0.16539204120635986 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 7


100%|██████████| 971/971 [00:22<00:00, 42.67it/s]



train_loss_epoch: 3.7405364967644767


100%|██████████| 6/6 [00:00<00:00, 14.57it/s]



mae: 1.739580750465393

eval_loss_epoch: 4.3907707532246905

rmse: 2.095416644052063

mse: 4.39077091217041

Improvement of 0.15953993797302246 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 8


100%|██████████| 971/971 [00:22<00:00, 42.26it/s]



train_loss_epoch: 3.0049563406915794


100%|██████████| 6/6 [00:00<00:00, 14.61it/s]



mae: 1.5949243307113647

eval_loss_epoch: 3.7915724913279214

rmse: 1.9471960791868859

mse: 3.7915725708007812

Improvement of 0.14465641975402832 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 9


100%|██████████| 971/971 [00:22<00:00, 43.52it/s]



train_loss_epoch: 2.226169708828479


100%|██████████| 6/6 [00:00<00:00, 15.21it/s]



mae: 1.383549690246582

eval_loss_epoch: 3.02661395072937

rmse: 1.7397166294340496

mse: 3.02661395072937

Improvement of 0.21137464046478271 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 10


100%|██████████| 971/971 [00:22<00:00, 43.40it/s]



train_loss_epoch: 1.4580235015965393


100%|██████████| 6/6 [00:00<00:00, 15.30it/s]



mae: 1.18148672580719

eval_loss_epoch: 2.3194796641667685

rmse: 1.5229838160345626

mse: 2.3194797039031982

Improvement of 0.2020629644393921 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 11


100%|██████████| 971/971 [00:22<00:00, 43.93it/s]



train_loss_epoch: 0.9879166315070879


100%|██████████| 6/6 [00:00<00:00, 13.92it/s]



mae: 1.0579732656478882

eval_loss_epoch: 1.898934801419576

rmse: 1.3780184040667658

mse: 1.8989347219467163

Improvement of 0.12351346015930176 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 12


100%|██████████| 971/971 [00:22<00:00, 42.43it/s]



train_loss_epoch: 0.7626882120583256


100%|██████████| 6/6 [00:00<00:00, 14.72it/s]



mae: 0.9977782368659973

eval_loss_epoch: 1.6841108997662861

rmse: 1.2977330230472526

mse: 1.6841109991073608

Improvement of 0.06019502878189087 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 13


100%|██████████| 971/971 [00:22<00:00, 43.88it/s]



train_loss_epoch: 0.6493910940250825


100%|██████████| 6/6 [00:00<00:00, 15.11it/s]



mae: 0.9653714299201965

eval_loss_epoch: 1.5768212874730427

rmse: 1.2557154642710555

mse: 1.5768213272094727

Improvement of 0.03240680694580078 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 14


100%|██████████| 971/971 [00:22<00:00, 43.17it/s]



train_loss_epoch: 0.5876042361608126


100%|██████████| 6/6 [00:00<00:00, 15.26it/s]



mae: 0.9496553540229797

eval_loss_epoch: 1.5230658650398254

rmse: 1.2341255708575487

mse: 1.5230659246444702

Improvement of 0.015716075897216797 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 15


100%|██████████| 971/971 [00:22<00:00, 43.23it/s]



train_loss_epoch: 0.5365160376478052


100%|██████████| 6/6 [00:00<00:00, 15.23it/s]



mae: 0.9332759976387024

eval_loss_epoch: 1.4715567429860432

rmse: 1.2130773526503509

mse: 1.4715566635131836

Improvement of 0.016379356384277344 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 16


100%|██████████| 971/971 [00:22<00:00, 42.80it/s]



train_loss_epoch: 0.48783980191797477


100%|██████████| 6/6 [00:00<00:00, 14.65it/s]



mae: 0.9213075637817383

eval_loss_epoch: 1.4255497852961223

rmse: 1.193963929425417

mse: 1.425549864768982

Improvement of 0.011968433856964111 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 17


100%|██████████| 971/971 [00:23<00:00, 42.15it/s]



train_loss_epoch: 0.43791735835227613


100%|██████████| 6/6 [00:00<00:00, 14.22it/s]



mae: 0.9033301472663879

eval_loss_epoch: 1.3603304823239644

rmse: 1.1663321060765837

mse: 1.360330581665039

Improvement of 0.017977416515350342 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 18


100%|██████████| 971/971 [00:22<00:00, 43.40it/s]



train_loss_epoch: 0.38761754252638064


100%|██████████| 6/6 [00:00<00:00, 14.99it/s]



mae: 0.892224133014679

eval_loss_epoch: 1.3282848397890727

rmse: 1.1525123602148473

mse: 1.328284740447998

Improvement of 0.011106014251708984 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 19


100%|██████████| 971/971 [00:22<00:00, 43.21it/s]



train_loss_epoch: 0.3391927664852044


100%|██████████| 6/6 [00:00<00:00, 14.48it/s]



mae: 0.8964769244194031

eval_loss_epoch: 1.3362776637077332

rmse: 1.1559747419831838

mse: 1.3362776041030884
No improvement above threshold observed, incrementing counter. 
Early stopping counter: 1/2

Starting epoch 20


100%|██████████| 971/971 [00:22<00:00, 42.99it/s]



train_loss_epoch: 0.29504243754824944


100%|██████████| 6/6 [00:00<00:00, 15.36it/s]



mae: 0.8905065655708313

eval_loss_epoch: 1.321727176507314

rmse: 1.1496639320423596

mse: 1.3217271566390991

Improvement of 0.0017175674438476562 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 21


100%|██████████| 971/971 [00:22<00:00, 43.01it/s]



train_loss_epoch: 0.25342329616833914


100%|██████████| 6/6 [00:00<00:00, 14.97it/s]



mae: 0.8901656270027161

eval_loss_epoch: 1.3163430293401082

rmse: 1.1473199506138374

mse: 1.316343069076538
No improvement above threshold observed, incrementing counter. 
Early stopping counter: 1/2

Starting epoch 22


100%|██████████| 971/971 [00:22<00:00, 42.68it/s]



train_loss_epoch: 0.2163390919612193


100%|██████████| 6/6 [00:00<00:00, 14.88it/s]



mae: 0.8915655612945557

eval_loss_epoch: 1.3194273908933003

rmse: 1.148663297500658

mse: 1.3194273710250854
No improvement above threshold observed, incrementing counter. 
Early stopping counter: 2/2
Stopping training due to no improvement after 2 epochs
Finishing training run
Loading checkpoint with mae: 0.8901656270027161


Comparing this to our baseline, we can see that there is an improvement!

## Sequential recommendations using a transformer

Using matrix factorization, we are treating each rating as being independent from the ratings around it; however, incorporating information about other movies that a user recently rated could provide an additional signal that could boost performance. For example, suppose that a user is watching a trilogy of films; if they have rated the first two instalments highly, it is likely that they may do the same for the finale!

One way that we can approach this is to use a transformer network, specifically the encoder portion, to encode additional context into the learned embeddings for each movie, and then using a fully connected neural network to make the rating predictions.

### Pre-processing the data

The first step is to process our data so that we have a time-sorted list of movies for each user. Let's start by grouping all the ratings by user:

In [28]:
grouped_ratings = ratings_df.sort_values(by='unix_timestamp').groupby('user_id').agg(tuple).reset_index()

In [29]:
grouped_ratings

Unnamed: 0,user_id,title,rating,unix_timestamp,is_valid
0,1,"(Girl, Interrupted (1999), Cinderella (1950), ...","(4, 5, 4, 5, 3, 5, 4, 4, 5, 4, 5, 3, 4, 4, 4, ...","(978300019, 978300055, 978300055, 978300055, 9...","(False, False, False, False, False, False, Fal..."
1,10,"(Godfather, The (1972), Pretty Woman (1990), S...","(3, 4, 3, 4, 4, 3, 5, 5, 5, 3, 3, 4, 5, 4, 4, ...","(978224375, 978224375, 978224375, 978224400, 9...","(False, False, False, False, False, False, Fal..."
2,100,"(Starship Troopers (1997), Star Wars: Episode ...","(3, 4, 4, 3, 4, 3, 1, 1, 5, 4, 4, 3, 4, 2, 3, ...","(977593595, 977593595, 977593607, 977593624, 9...","(False, False, False, False, False, False, Fal..."
3,1000,"(Cat on a Hot Tin Roof (1958), Licence to Kill...","(4, 4, 5, 3, 5, 5, 2, 5, 4, 4, 5, 3, 5, 5, 5, ...","(975040566, 975040566, 975040566, 975040629, 9...","(False, False, False, False, False, False, Fal..."
4,1001,"(Raiders of the Lost Ark (1981), Guinevere (19...","(4, 4, 4, 2, 2, 1, 5, 4, 5, 4, 4, 4, 4, 3, 4, ...","(975039591, 975039702, 975039702, 975039898, 9...","(False, False, False, False, False, False, Fal..."
...,...,...,...,...,...
6035,995,"(Six Days Seven Nights (1998), Star Wars: Epis...","(2, 4, 5, 4, 3, 3, 4, 4, 3, 5, 5, 5, 5, 5, 5, ...","(975054785, 975054785, 975054785, 975054853, 9...","(False, False, False, False, False, False, Fal..."
6036,996,"(Nightmare on Elm Street, A (1984), St. Elmo's...","(4, 3, 5, 3, 5, 5, 5, 5, 4, 2, 5, 5, 5, 4, 5, ...","(975052132, 975052132, 975052195, 975052284, 9...","(False, False, False, False, False, False, Fal..."
6037,997,(Star Wars: Episode V - The Empire Strikes Bac...,"(4, 3, 3, 3, 2, 5, 5, 5, 4, 4, 5, 4, 4, 3, 4, ...","(975044235, 975044425, 975044426, 975044426, 9...","(False, False, False, False, False, False, Fal..."
6038,998,"(Butcher's Wife, The (1991), E.T. the Extra-Te...","(3, 5, 4, 5, 3, 4, 4, 3, 4, 4, 4, 4, 4, 5, 4, ...","(975043499, 975043593, 975043593, 975043593, 9...","(False, False, False, False, False, False, Fal..."


Now that we have grouped by user, we can create an additional column so that we can see the number of events associated with each user

In [30]:
grouped_ratings['num_ratings'] = grouped_ratings['rating'].apply(lambda row: len(row))

Let's take a look at the new dataframe

In [31]:
grouped_ratings

Unnamed: 0,user_id,title,rating,unix_timestamp,is_valid,num_ratings
0,1,"(Girl, Interrupted (1999), Cinderella (1950), ...","(4, 5, 4, 5, 3, 5, 4, 4, 5, 4, 5, 3, 4, 4, 4, ...","(978300019, 978300055, 978300055, 978300055, 9...","(False, False, False, False, False, False, Fal...",53
1,10,"(Godfather, The (1972), Pretty Woman (1990), S...","(3, 4, 3, 4, 4, 3, 5, 5, 5, 3, 3, 4, 5, 4, 4, ...","(978224375, 978224375, 978224375, 978224400, 9...","(False, False, False, False, False, False, Fal...",401
2,100,"(Starship Troopers (1997), Star Wars: Episode ...","(3, 4, 4, 3, 4, 3, 1, 1, 5, 4, 4, 3, 4, 2, 3, ...","(977593595, 977593595, 977593607, 977593624, 9...","(False, False, False, False, False, False, Fal...",76
3,1000,"(Cat on a Hot Tin Roof (1958), Licence to Kill...","(4, 4, 5, 3, 5, 5, 2, 5, 4, 4, 5, 3, 5, 5, 5, ...","(975040566, 975040566, 975040566, 975040629, 9...","(False, False, False, False, False, False, Fal...",84
4,1001,"(Raiders of the Lost Ark (1981), Guinevere (19...","(4, 4, 4, 2, 2, 1, 5, 4, 5, 4, 4, 4, 4, 3, 4, ...","(975039591, 975039702, 975039702, 975039898, 9...","(False, False, False, False, False, False, Fal...",377
...,...,...,...,...,...,...
6035,995,"(Six Days Seven Nights (1998), Star Wars: Epis...","(2, 4, 5, 4, 3, 3, 4, 4, 3, 5, 5, 5, 5, 5, 5, ...","(975054785, 975054785, 975054785, 975054853, 9...","(False, False, False, False, False, False, Fal...",49
6036,996,"(Nightmare on Elm Street, A (1984), St. Elmo's...","(4, 3, 5, 3, 5, 5, 5, 5, 4, 2, 5, 5, 5, 4, 5, ...","(975052132, 975052132, 975052195, 975052284, 9...","(False, False, False, False, False, False, Fal...",296
6037,997,(Star Wars: Episode V - The Empire Strikes Bac...,"(4, 3, 3, 3, 2, 5, 5, 5, 4, 4, 5, 4, 4, 3, 4, ...","(975044235, 975044425, 975044426, 975044426, 9...","(False, False, False, False, False, False, Fal...",30
6038,998,"(Butcher's Wife, The (1991), E.T. the Extra-Te...","(3, 5, 4, 5, 3, 4, 4, 3, 4, 4, 4, 4, 4, 5, 4, ...","(975043499, 975043593, 975043593, 975043593, 9...","(False, False, False, False, False, False, Fal...",135


Now that we have grouped all the ratings for each user, let's divide these into smaller sequences. To make the most out of the data, we would like the model to have the opportunity to predict a rating for every movie in the training set. To do this, let's specify a sequence length s and use the previous s-1 ratings as our user history.

As the model expects each sequence to be a fixed length, we will fill empty spaces with a padding token, so that sequences can be batched and passed to the model. Let's create a function to do this.

We are going to arbitrarily choose a length of 10 here.

In [32]:
sequence_length = 10

In [33]:
def create_sequences(values, sequence_length):
    sequences = []
    for i, v in enumerate(values):
        seq = values[:i+1]
        if len(seq) > sequence_length:
            seq = seq[i-sequence_length+1:i+1]
        elif len(seq) < sequence_length:
            seq =(*(['[PAD]'] * (sequence_length - len(seq))), *seq)
       
        sequences.append(seq)
    return sequences
        

To visualize how this function works, let's apply it, with a sequence length of 3, to the first 10 movies rated by the first user. These movies are:

In [34]:
grouped_ratings.iloc[0]['title'][:10]

('Girl, Interrupted (1999)',
 'Cinderella (1950)',
 'Titanic (1997)',
 'Back to the Future (1985)',
 'Meet Joe Black (1998)',
 'Last Days of Disco, The (1998)',
 'Erin Brockovich (2000)',
 'To Kill a Mockingbird (1962)',
 'Christmas Story, A (1983)',
 'Star Wars: Episode IV - A New Hope (1977)')

Applying our function, we have:

In [35]:
create_sequences(grouped_ratings.iloc[0]['title'][:10], 3)

[('[PAD]', '[PAD]', 'Girl, Interrupted (1999)'),
 ('[PAD]', 'Girl, Interrupted (1999)', 'Cinderella (1950)'),
 ('Girl, Interrupted (1999)', 'Cinderella (1950)', 'Titanic (1997)'),
 ('Cinderella (1950)', 'Titanic (1997)', 'Back to the Future (1985)'),
 ('Titanic (1997)', 'Back to the Future (1985)', 'Meet Joe Black (1998)'),
 ('Back to the Future (1985)',
  'Meet Joe Black (1998)',
  'Last Days of Disco, The (1998)'),
 ('Meet Joe Black (1998)',
  'Last Days of Disco, The (1998)',
  'Erin Brockovich (2000)'),
 ('Last Days of Disco, The (1998)',
  'Erin Brockovich (2000)',
  'To Kill a Mockingbird (1962)'),
 ('Erin Brockovich (2000)',
  'To Kill a Mockingbird (1962)',
  'Christmas Story, A (1983)'),
 ('To Kill a Mockingbird (1962)',
  'Christmas Story, A (1983)',
  'Star Wars: Episode IV - A New Hope (1977)')]

As we can see, we have 10 sequences of length 3, where the final movie in the sequence is unchanged from the original list.

Now, let's apply this function to all of the features in our dataframe

In [36]:
grouped_cols = ['title', 'rating', 'unix_timestamp', 'is_valid'] 
for col in grouped_cols:
    grouped_ratings[col] = grouped_ratings[col].apply(lambda x: create_sequences(x, sequence_length))

In [37]:
grouped_ratings.head(2)

Unnamed: 0,user_id,title,rating,unix_timestamp,is_valid,num_ratings
0,1,"[([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [P...","[([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [P...","[([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [P...","[([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [P...",53
1,10,"[([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [P...","[([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [P...","[([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [P...","[([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [P...",401


Currently, we have one row that contains all the sequences for a certain user. However, during training, we would like to create batches made up of sequences from many different users. To do this, we will have to transform the data so that each sequence has its own row, while remaining associated with the user ID. We can use the pandas 'explode' function for each feature, and then aggregate these DataFrames together.

In [38]:
exploded_ratings = grouped_ratings[['user_id', 'title']].explode('title', ignore_index=True)
dfs = [grouped_ratings[[col]].explode(col, ignore_index=True) for col in grouped_cols[1:]]
seq_df = pd.concat([exploded_ratings, *dfs], axis=1)

In [39]:
seq_df.head()

Unnamed: 0,user_id,title,rating,unix_timestamp,is_valid
0,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA..."
1,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA..."
2,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA..."
3,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], Gir...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], 4, ...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], 978...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], Fal..."
4,1,"([PAD], [PAD], [PAD], [PAD], [PAD], Girl, Inte...","([PAD], [PAD], [PAD], [PAD], [PAD], 4, 5, 4, 5...","([PAD], [PAD], [PAD], [PAD], [PAD], 978300019,...","([PAD], [PAD], [PAD], [PAD], [PAD], False, Fal..."


Now, we can see that each sequence has its own row. However, for the is_valid column, we don't care about the whole sequence and only need the last value as this is the movie for which we will be trying to predict the rating. Let's create a function to extract this value and apply it to these columns.

In [40]:
def get_last_entry(sequence):
    return sequence[-1]

seq_df['is_valid'] = seq_df['is_valid'].apply(get_last_entry)

In [41]:
seq_df

Unnamed: 0,user_id,title,rating,unix_timestamp,is_valid
0,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...",False
1,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...",False
2,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...",False
3,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], Gir...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], 4, ...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], 978...",False
4,1,"([PAD], [PAD], [PAD], [PAD], [PAD], Girl, Inte...","([PAD], [PAD], [PAD], [PAD], [PAD], 4, 5, 4, 5...","([PAD], [PAD], [PAD], [PAD], [PAD], 978300019,...",False
...,...,...,...,...,...
1000204,999,"(General's Daughter, The (1999), Powder (1995)...","(3, 3, 2, 1, 3, 2, 3, 2, 4, 3)","(975364681, 975364717, 975364717, 975364717, 9...",False
1000205,999,"(Powder (1995), We're No Angels (1989), Out of...","(3, 2, 1, 3, 2, 3, 2, 4, 3, 3)","(975364717, 975364717, 975364717, 975364743, 9...",False
1000206,999,"(We're No Angels (1989), Out of Africa (1985),...","(2, 1, 3, 2, 3, 2, 4, 3, 3, 3)","(975364717, 975364717, 975364743, 975364743, 9...",False
1000207,999,"(Out of Africa (1985), Instinct (1999), Corrup...","(1, 3, 2, 3, 2, 4, 3, 3, 3, 2)","(975364717, 975364743, 975364743, 975364784, 9...",False


Also, to make it easy to access the rating that we are trying to predict, let's separate this into its own column.

In [42]:
seq_df['target_rating'] = seq_df['rating'].apply(get_last_entry)
seq_df['previous_ratings'] = seq_df['rating'].apply(lambda seq: seq[:-1])
seq_df.drop(columns=['rating'], inplace=True)

To prevent the model from including padding tokens when calculating attention scores, we can provide an attention mask to the transformer; the mask should be 'True' for a padding token and 'False' otherwise. Let's calculate this for each row, as well as creating a column to show the number of padding tokens present.

In [43]:
seq_df['pad_mask'] = seq_df['title'].apply(lambda x: (np.array(x) == '[PAD]'))
seq_df['num_pads'] = seq_df['pad_mask'].apply(sum)
seq_df['pad_mask'] = seq_df['pad_mask'].apply(lambda x: x.tolist()) # in case we serialize later

Let's inspect the transformed data

In [44]:
seq_df

Unnamed: 0,user_id,title,unix_timestamp,is_valid,target_rating,previous_ratings,pad_mask,num_pads
0,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...",False,4,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","[True, True, True, True, True, True, True, Tru...",9
1,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...",False,5,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","[True, True, True, True, True, True, True, Tru...",8
2,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...",False,4,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PA...","[True, True, True, True, True, True, True, Fal...",7
3,1,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], Gir...","([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], 978...",False,5,"([PAD], [PAD], [PAD], [PAD], [PAD], [PAD], 4, ...","[True, True, True, True, True, True, False, Fa...",6
4,1,"([PAD], [PAD], [PAD], [PAD], [PAD], Girl, Inte...","([PAD], [PAD], [PAD], [PAD], [PAD], 978300019,...",False,3,"([PAD], [PAD], [PAD], [PAD], [PAD], 4, 5, 4, 5)","[True, True, True, True, True, False, False, F...",5
...,...,...,...,...,...,...,...,...
1000204,999,"(General's Daughter, The (1999), Powder (1995)...","(975364681, 975364717, 975364717, 975364717, 9...",False,3,"(3, 3, 2, 1, 3, 2, 3, 2, 4)","[False, False, False, False, False, False, Fal...",0
1000205,999,"(Powder (1995), We're No Angels (1989), Out of...","(975364717, 975364717, 975364717, 975364743, 9...",False,3,"(3, 2, 1, 3, 2, 3, 2, 4, 3)","[False, False, False, False, False, False, Fal...",0
1000206,999,"(We're No Angels (1989), Out of Africa (1985),...","(975364717, 975364717, 975364743, 975364743, 9...",False,3,"(2, 1, 3, 2, 3, 2, 4, 3, 3)","[False, False, False, False, False, False, Fal...",0
1000207,999,"(Out of Africa (1985), Instinct (1999), Corrup...","(975364717, 975364743, 975364743, 975364784, 9...",False,2,"(1, 3, 2, 3, 2, 4, 3, 3, 3)","[False, False, False, False, False, False, Fal...",0


All looks as it should! Let's split this into training and validation sets and save this.

In [45]:
train_seq_df = seq_df[seq_df.is_valid == False]
valid_seq_df = seq_df[seq_df.is_valid == True]

### Training the model

As we saw previously, before we can feed this data into the model, we need to create lookup tables to encode our movies and users. However, this time, we need to include the padding token in our movie lookup.

In [46]:
user_lookup = {v: i+1 for i, v in enumerate(ratings_df['user_id'].unique())}

In [47]:
def create_feature_lookup(df, feature):
    lookup = {v: i+1 for i, v in enumerate(df[feature].unique())}
    lookup['[PAD]'] = 0
    return lookup

In [48]:
movie_lookup = create_feature_lookup(ratings_df, 'title')

Now, we are dealing with sequences of ratings, rather than individual ones, so we will need to create a new dataset to wrap our processed DataFrame:

In [49]:
class MovieSequenceDataset(Dataset):
    def __init__(self, df, movie_lookup, user_lookup):
        super().__init__()
        self.df = df
        self.movie_lookup = movie_lookup
        self.user_lookup = user_lookup

    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):
        data = self.df.iloc[index]
        user_id = self.user_lookup[str(data.user_id)]
        movie_ids = torch.tensor([self.movie_lookup[title] for title in data.title])

        previous_ratings = torch.tensor(
            [rating if rating != "[PAD]" else 0 for rating in data.previous_ratings]
        )

        attention_mask = torch.tensor(data.pad_mask)
        target_rating = data.target_rating
        encoded_features = {
            "user_id": user_id,
            "movie_ids": movie_ids,
            "ratings": previous_ratings,
        }

        return (encoded_features, attention_mask), torch.tensor(
            target_rating, dtype=torch.float32
        )


In [50]:
train_dataset = MovieSequenceDataset(train_seq_df, movie_lookup, user_lookup)
valid_dataset = MovieSequenceDataset(valid_seq_df, movie_lookup, user_lookup)

Now, let's define our transformer model! As a start, given that the matrix factorization model can achieve good performance using only the user and movie ids, let's only include this information for now.

In [51]:
class BstTransformer(nn.Module):
    def __init__(
        self,
        movies_num_unique,
        users_num_unique,
        sequence_length=10,
        embedding_size=120,
        num_transformer_layers=1,
        ratings_range=(0.5, 5.5),
    ):
        super().__init__()
        self.sequence_length = sequence_length
        self.y_range = ratings_range
        self.movies_embeddings = nn.Embedding(
            movies_num_unique + 1, embedding_size, padding_idx=0
        )
        self.user_embeddings = nn.Embedding(users_num_unique + 1, embedding_size)
        self.position_embeddings = nn.Embedding(sequence_length, embedding_size)

        self.encoder = nn.TransformerEncoder(
            encoder_layer=nn.TransformerEncoderLayer(
                d_model=embedding_size,
                nhead=12,
                dropout=0.1,
                batch_first=True,
                activation="gelu",
            ),
            num_layers=num_transformer_layers,
        )

        self.linear = nn.Sequential(
            nn.Linear(
                embedding_size + (embedding_size * sequence_length),
                1024,
            ),
            nn.BatchNorm1d(1024),
            nn.Mish(),
            nn.Linear(1024, 512),
            nn.BatchNorm1d(512),
            nn.Mish(),
            nn.Dropout(0.2),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.Mish(),
            nn.Linear(256, 1),
            nn.Sigmoid(),
        )

    def forward(self, inputs):
        features, mask = inputs

        encoded_user_id = self.user_embeddings(features["user_id"])

        user_features = encoded_user_id

        encoded_movies = self.movies_embeddings(features["movie_ids"])

        positions = torch.arange(
            0, self.sequence_length, 1, dtype=int, device=features["movie_ids"].device
        )
        positions = self.position_embeddings(positions)

        transformer_features = encoded_movies + positions

        transformer_output = self.encoder(
            transformer_features, src_key_padding_mask=mask
        )
        transformer_output = torch.flatten(transformer_output, start_dim=1)

        combined_output = torch.cat((transformer_output, user_features), dim=1)

        rating = self.linear(combined_output)
        rating = rating.squeeze()
        if self.y_range is None:
            return rating
        else:
            return rating * (self.y_range[1] - self.y_range[0]) + self.y_range[0]


We can see that, as a default, we feed our sequence of movie embeddings into a single transformer layer, before concatenating the output with the user features - here, just the user ID - and using this as the input to a fully connected network. Here, we are using only a simple positional encoding that is learned to represent the sequence in which the movies were rated; using a sine- and cosine-based approach provided no benefit during my experiments, but feel free to try it out if you are interested!

Once again, let's define a training function for this model; except for the model initialization, this is identical to the one we used to train the matrix factorization model.

In [52]:
def train_seq_model():
    model = BstTransformer(
        len(movie_lookup), len(user_lookup), sequence_length, embedding_size=120
    )
    loss_func = torch.nn.MSELoss()

    optimizer = torch.optim.AdamW(model.parameters(), lr=0.01)

    create_sched_fn = partial(
        torch.optim.lr_scheduler.OneCycleLR,
        max_lr=0.01,
        epochs=TrainerPlaceholderValues.NUM_EPOCHS,
        steps_per_epoch=TrainerPlaceholderValues.NUM_UPDATE_STEPS_PER_EPOCH,
    )

    trainer = Trainer(
        model=model,
        loss_func=loss_func,
        optimizer=optimizer,
        callbacks=(
            RecommenderMetricsCallback,
            *DEFAULT_CALLBACKS,
            SaveBestModelCallback(watch_metric="mae"),
            EarlyStoppingCallback(
                early_stopping_patience=2,
                early_stopping_threshold=0.001,
                watch_metric="mae",
            ),
        ),
    )

    trainer.train(
        train_dataset=train_dataset,
        eval_dataset=valid_dataset,
        num_epochs=10,
        per_device_batch_size=512,
        create_scheduler_fn=create_sched_fn,
    )


In [69]:
notebook_launcher(train_seq_model, num_processes=2)

Launching a training on 2 GPUs.

Starting training run

Starting epoch 1


100%|██████████| 971/971 [00:44<00:00, 21.81it/s]



train_loss_epoch: 0.9955023087630188


100%|██████████| 6/6 [00:00<00:00, 10.53it/s]



mae: 0.7939572930335999

mse: 0.9927792549133301

eval_loss_epoch: 0.9927793244520823

rmse: 0.9963830864247597

Starting epoch 2


100%|██████████| 971/971 [00:44<00:00, 21.87it/s]



train_loss_epoch: 0.8509480904722557


100%|██████████| 6/6 [00:00<00:00, 10.69it/s]



mae: 0.7802140116691589

mse: 0.9521594047546387

eval_loss_epoch: 0.9521594146887461

rmse: 0.9757865569655275

Improvement of 0.013743281364440918 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 3


100%|██████████| 971/971 [00:44<00:00, 21.90it/s]



train_loss_epoch: 0.8159997655974603


100%|██████████| 6/6 [00:00<00:00, 10.80it/s]



mae: 0.7579830288887024

mse: 0.915351152420044

eval_loss_epoch: 0.9153511722882589

rmse: 0.9567398561887364

Improvement of 0.022230982780456543 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 4


100%|██████████| 971/971 [00:44<00:00, 21.79it/s]



train_loss_epoch: 0.7925404456322274


100%|██████████| 6/6 [00:00<00:00, 10.75it/s]



mae: 0.7406826615333557

mse: 0.8825389742851257

eval_loss_epoch: 0.8825389941533407

rmse: 0.9394354550926454

Improvement of 0.01730036735534668 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 5


100%|██████████| 971/971 [00:44<00:00, 21.96it/s]



train_loss_epoch: 0.7654586238546793


100%|██████████| 6/6 [00:00<00:00, 10.43it/s]



mae: 0.7357890009880066

mse: 0.8756368160247803

eval_loss_epoch: 0.8756367762883505

rmse: 0.935754677265778

Improvement of 0.004893660545349121 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 6


100%|██████████| 971/971 [00:44<00:00, 21.81it/s]



train_loss_epoch: 0.7475585157912988


100%|██████████| 6/6 [00:00<00:00, 10.07it/s]



mae: 0.7258664965629578

mse: 0.8621974587440491

eval_loss_epoch: 0.8621974885463715

rmse: 0.9285458840273049

Improvement of 0.009922504425048828 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 7


100%|██████████| 971/971 [00:44<00:00, 21.92it/s]



train_loss_epoch: 0.7325706990199772


100%|██████████| 6/6 [00:00<00:00, 10.68it/s]



mae: 0.7262701988220215

mse: 0.8640387654304504

eval_loss_epoch: 0.8640388051668803

rmse: 0.9295368553373505
No improvement above threshold observed, incrementing counter. 
Early stopping counter: 1/2

Starting epoch 8


100%|██████████| 971/971 [00:44<00:00, 21.80it/s]



train_loss_epoch: 0.7145110318393099


100%|██████████| 6/6 [00:00<00:00, 10.11it/s]



mae: 0.7315012812614441

mse: 0.8688936829566956

eval_loss_epoch: 0.8688936630884806

rmse: 0.9321446684698119
No improvement above threshold observed, incrementing counter. 
Early stopping counter: 2/2
Stopping training due to no improvement after 2 epochs
Finishing training run
Loading checkpoint with mae: 0.7258664965629578


We can see that this is a significant improvement over the matrix factorization approach!

### Adding additional data

So far, we have only considered the user ID and a sequence of movie IDs to predict the rating; it seems likely that including information about the previous ratings made by the user would improve performance. Thankfully, this is easy to do, and the data is already being returned by our dataset. Let's tweak our architecture to include this:

In [53]:
class BstTransformer(nn.Module):
    def __init__(
        self,
        movies_num_unique,
        users_num_unique,
        sequence_length=10,
        embedding_size=120,
        num_transformer_layers=1,
        ratings_range=(0.5, 5.5),
    ):
        super().__init__()
        self.sequence_length = sequence_length
        self.y_range = ratings_range
        self.movies_embeddings = nn.Embedding(
            movies_num_unique + 1, embedding_size, padding_idx=0
        )
        self.user_embeddings = nn.Embedding(users_num_unique + 1, embedding_size)
        self.ratings_embeddings = nn.Embedding(6, embedding_size, padding_idx=0)
        self.position_embeddings = nn.Embedding(sequence_length, embedding_size)

        self.encoder = nn.TransformerEncoder(
            encoder_layer=nn.TransformerEncoderLayer(
                d_model=embedding_size,
                nhead=12,
                dropout=0.1,
                batch_first=True,
                activation="gelu",
            ),
            num_layers=num_transformer_layers,
        )

        self.linear = nn.Sequential(
            nn.Linear(
                embedding_size + (embedding_size * sequence_length),
                1024,
            ),
            nn.BatchNorm1d(1024),
            nn.Mish(),
            nn.Linear(1024, 512),
            nn.BatchNorm1d(512),
            nn.Mish(),
            nn.Dropout(0.2),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.Mish(),
            nn.Linear(256, 1),
            nn.Sigmoid(),
        )

    def forward(self, inputs):
        features, mask = inputs

        encoded_user_id = self.user_embeddings(features["user_id"])

        user_features = encoded_user_id

        movie_history = features["movie_ids"][:, :-1]
        target_movie = features["movie_ids"][:, -1]

        ratings = self.ratings_embeddings(features["ratings"])

        encoded_movies = self.movies_embeddings(movie_history)
        encoded_target_movie = self.movies_embeddings(target_movie)

        positions = torch.arange(
            0,
            self.sequence_length - 1,
            1,
            dtype=int,
            device=features["movie_ids"].device,
        )
        positions = self.position_embeddings(positions)

        encoded_sequence_movies_with_position_and_rating = (
            encoded_movies + ratings + positions
        )
        encoded_target_movie = encoded_target_movie.unsqueeze(1)

        transformer_features = torch.cat(
            (encoded_sequence_movies_with_position_and_rating, encoded_target_movie),
            dim=1,
        )
        transformer_output = self.encoder(
            transformer_features, src_key_padding_mask=mask
        )
        transformer_output = torch.flatten(transformer_output, start_dim=1)

        combined_output = torch.cat((transformer_output, user_features), dim=1)

        rating = self.linear(combined_output)
        rating = rating.squeeze()
        if self.y_range is None:
            return rating
        else:
            return rating * (self.y_range[1] - self.y_range[0]) + self.y_range[0]


We can see that, to use the ratings data, we have added an additional embedding layer. For each previously rated movie, we then add together the movie embedding, the positional encoding and the rating embedding before feeding this sequence into the transformer. Alternatively, the rating data could be concatenated to, or multiplied with, the movie embedding, but adding them together worked the best out of the approaches that I tried.

As Jupyter maintains a live state for each class definition, we don't need to update our training function; the new class will be used when we launch training:

In [71]:
notebook_launcher(train_seq_model, num_processes=2)

Launching a training on 2 GPUs.

Starting training run

Starting epoch 1


100%|██████████| 971/971 [00:44<00:00, 21.60it/s]



train_loss_epoch: 0.9109353272111973


100%|██████████| 6/6 [00:00<00:00, 10.39it/s]



mae: 0.8022098541259766

mse: 0.9802775979042053

eval_loss_epoch: 0.9802776078383127

rmse: 0.9900896918482716

Starting epoch 2


100%|██████████| 971/971 [00:44<00:00, 21.80it/s]



train_loss_epoch: 0.8358323996393614


100%|██████████| 6/6 [00:00<00:00, 10.25it/s]



mae: 0.7573742866516113

mse: 0.9179417490959167

eval_loss_epoch: 0.9179418087005615

rmse: 0.9580927664354412

Improvement of 0.044835567474365234 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 3


100%|██████████| 971/971 [00:44<00:00, 21.76it/s]



train_loss_epoch: 0.8017225273482954


100%|██████████| 6/6 [00:00<00:00, 10.69it/s]



mae: 0.7416232228279114

mse: 0.8967887759208679

eval_loss_epoch: 0.8967887858549753

rmse: 0.9469893219677126

Improvement of 0.01575106382369995 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 4


100%|██████████| 971/971 [00:44<00:00, 21.73it/s]



train_loss_epoch: 0.7820610726898657


100%|██████████| 6/6 [00:00<00:00, 10.47it/s]



mae: 0.7375993728637695

mse: 0.8765184283256531

eval_loss_epoch: 0.8765184382597605

rmse: 0.9362256289621925

Improvement of 0.004023849964141846 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 5


100%|██████████| 971/971 [00:44<00:00, 21.60it/s]



train_loss_epoch: 0.7703093529729715


100%|██████████| 6/6 [00:00<00:00, 10.07it/s]



mae: 0.7289111018180847

mse: 0.8735694885253906

eval_loss_epoch: 0.8735695282618204

rmse: 0.9346493933691877

Improvement of 0.008688271045684814 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 6


100%|██████████| 971/971 [00:44<00:00, 21.65it/s]



train_loss_epoch: 0.7511685453777333


100%|██████████| 6/6 [00:00<00:00,  9.98it/s]



mae: 0.7231311798095703

mse: 0.8583566546440125

eval_loss_epoch: 0.8583566149075826

rmse: 0.926475393436875

Improvement of 0.005779922008514404 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 7


100%|██████████| 971/971 [00:44<00:00, 21.72it/s]



train_loss_epoch: 0.7281422661089872


100%|██████████| 6/6 [00:00<00:00, 10.58it/s]



mae: 0.7262148261070251

mse: 0.8491864204406738

eval_loss_epoch: 0.849186360836029

rmse: 0.9215131146330332
No improvement above threshold observed, incrementing counter. 
Early stopping counter: 1/2

Starting epoch 8


100%|██████████| 971/971 [00:44<00:00, 21.64it/s]



train_loss_epoch: 0.709694542980587


100%|██████████| 6/6 [00:00<00:00, 10.26it/s]



mae: 0.7182666659355164

mse: 0.8506280779838562

eval_loss_epoch: 0.8506280283133189

rmse: 0.9222950059410797

Improvement of 0.004864513874053955 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 9


100%|██████████| 971/971 [00:44<00:00, 21.70it/s]



train_loss_epoch: 0.6928928330556003


100%|██████████| 6/6 [00:00<00:00, 10.42it/s]



mae: 0.7204784750938416

mse: 0.8569676876068115

eval_loss_epoch: 0.8569677571455637

rmse: 0.9257254925769364
No improvement above threshold observed, incrementing counter. 
Early stopping counter: 1/2

Starting epoch 10


100%|██████████| 971/971 [00:44<00:00, 21.75it/s]



train_loss_epoch: 0.6806765550442508


100%|██████████| 6/6 [00:00<00:00, 10.45it/s]



mae: 0.7206871509552002

mse: 0.8620250225067139

eval_loss_epoch: 0.8620249728361765

rmse: 0.9284530265483084
No improvement above threshold observed, incrementing counter. 
Early stopping counter: 2/2
Stopping training due to no improvement after 2 epochs
Finishing training run
Loading checkpoint with mae: 0.7182666659355164


We can see that incorporating the ratings data has improved our results slightly!

### Adding user features

In addition to the ratings data, we also have more information about the users that we could add into the model. To remind ourselves, let's take a look at the users table:

In [54]:
users

Unnamed: 0,user_id,sex,age_group,occupation,zip_code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,02460
4,5,M,25,20,55455
...,...,...,...,...,...
6035,6036,F,25,15,32603
6036,6037,F,45,1,76006
6037,6038,F,56,1,14706
6038,6039,F,45,0,01060


Let's try adding in the categorical variables representing the users' sex, age groups, and occupation to the model, and see if we see any improvement. While occupation looks like it is already sequentially numerically encoded, we must do the same for the sex and age_group columns. We can use the 'LabelEncoder' class from scikit-learn to do this for us, and append the encoded columns to the DataFrame:

In [55]:
from sklearn.preprocessing import LabelEncoder

In [56]:
le = LabelEncoder()

In [57]:
users['sex_encoded'] = le.fit_transform(users.sex)

In [58]:
users['age_group_encoded'] = le.fit_transform(users.age_group)

In [59]:
users["user_id"] = users["user_id"].astype(str)

Now that we have all the features that we are going to use encoded, let's join the user features to our sequences DataFrame, and update our training and validation sets.

In [60]:
seq_with_user_features = pd.merge(seq_df, users)

In [61]:
train_df = seq_with_user_features[seq_with_user_features.is_valid == False]
valid_df = seq_with_user_features[seq_with_user_features.is_valid == True]

Let's update our dataset to include these features.

In [62]:
class MovieSequenceDataset(Dataset):
    def __init__(self, df, movie_lookup, user_lookup):
        super().__init__()
        self.df = df
        self.movie_lookup = movie_lookup
        self.user_lookup = user_lookup

    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):
        data = self.df.iloc[index]
        user_id = self.user_lookup[str(data.user_id)]
        movie_ids = torch.tensor([self.movie_lookup[title] for title in data.title])

        previous_ratings = torch.tensor(
            [rating if rating != "[PAD]" else 0 for rating in data.previous_ratings]
        )

        attention_mask = torch.tensor(data.pad_mask)
        target_rating = data.target_rating
        encoded_features = {
            "user_id": user_id,
            "movie_ids": movie_ids,
            "ratings": previous_ratings,
            "age_group": data["age_group_encoded"],
            "sex": data["sex_encoded"],
            "occupation": data["occupation"],
        }

        return (encoded_features, attention_mask), torch.tensor(
            target_rating, dtype=torch.float32
        )


In [63]:
train_dataset = MovieSequenceDataset(train_df, movie_lookup, user_lookup)
valid_dataset = MovieSequenceDataset(valid_df, movie_lookup, user_lookup)

We can now modify our architecture to include embeddings for these features and concatenate these embeddings to the output of the transformer; then we pass this into the feed-forward network.

In [64]:
class BstTransformer(nn.Module):
    def __init__(
        self,
        movies_num_unique,
        users_num_unique,
        sequence_length=10,
        embedding_size=120,
        num_transformer_layers=1,
        ratings_range=(0.5, 5.5),
    ):
        super().__init__()
        self.sequence_length = sequence_length
        self.y_range = ratings_range
        self.movies_embeddings = nn.Embedding(
            movies_num_unique + 1, embedding_size, padding_idx=0
        )
        self.user_embeddings = nn.Embedding(users_num_unique + 1, embedding_size)
        self.ratings_embeddings = nn.Embedding(6, embedding_size, padding_idx=0)
        self.position_embeddings = nn.Embedding(sequence_length, embedding_size)

        self.sex_embeddings = nn.Embedding(
            3,
            2,
        )
        self.occupation_embeddings = nn.Embedding(
            22,
            11,
        )
        self.age_group_embeddings = nn.Embedding(
            8,
            4,
        )

        self.encoder = nn.TransformerEncoder(
            encoder_layer=nn.TransformerEncoderLayer(
                d_model=embedding_size,
                nhead=12,
                dropout=0.1,
                batch_first=True,
                activation="gelu",
            ),
            num_layers=num_transformer_layers,
        )

        self.linear = nn.Sequential(
            nn.Linear(
                embedding_size + (embedding_size * sequence_length) + 4 + 11 + 2,
                1024,
            ),
            nn.BatchNorm1d(1024),
            nn.Mish(),
            nn.Linear(1024, 512),
            nn.BatchNorm1d(512),
            nn.Mish(),
            nn.Dropout(0.2),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.Mish(),
            nn.Linear(256, 1),
            nn.Sigmoid(),
        )

    def forward(self, inputs):
        features, mask = inputs

        user_id = self.user_embeddings(features["user_id"])

        age_group = self.age_group_embeddings(features["age_group"])
        sex = self.sex_embeddings(features["sex"])
        occupation = self.occupation_embeddings(features["occupation"])

        user_features = user_features = torch.cat(
            (user_id, sex, age_group, occupation), 1
        )

        movie_history = features["movie_ids"][:, :-1]
        target_movie = features["movie_ids"][:, -1]

        ratings = self.ratings_embeddings(features["ratings"])

        encoded_movies = self.movies_embeddings(movie_history)
        encoded_target_movie = self.movies_embeddings(target_movie)

        positions = torch.arange(
            0,
            self.sequence_length - 1,
            1,
            dtype=int,
            device=features["movie_ids"].device,
        )
        positions = self.position_embeddings(positions)

        encoded_sequence_movies_with_position_and_rating = (
            encoded_movies + ratings + positions
        )
        encoded_target_movie = encoded_target_movie.unsqueeze(1)

        transformer_features = torch.cat(
            (encoded_sequence_movies_with_position_and_rating, encoded_target_movie),
            dim=1,
        )
        transformer_output = self.encoder(
            transformer_features, src_key_padding_mask=mask
        )
        transformer_output = torch.flatten(transformer_output, start_dim=1)

        combined_output = torch.cat((transformer_output, user_features), dim=1)

        rating = self.linear(combined_output)
        rating = rating.squeeze()
        if self.y_range is None:
            return rating
        else:
            return rating * (self.y_range[1] - self.y_range[0]) + self.y_range[0]


In [68]:
notebook_launcher(train_seq_model, num_processes=2)

Launching a training on 2 GPUs.

Starting training run

Starting epoch 1


100%|██████████| 971/971 [00:46<00:00, 20.93it/s]



train_loss_epoch: 0.9115137239317692


100%|██████████| 6/6 [00:00<00:00,  9.88it/s]



mae: 0.7698847651481628

eval_loss_epoch: 0.9531584481398264

rmse: 0.9762983956839264

mse: 0.9531585574150085

Starting epoch 2


100%|██████████| 971/971 [00:46<00:00, 21.11it/s]



train_loss_epoch: 0.8351250770285986


100%|██████████| 6/6 [00:00<00:00, 10.07it/s]



mae: 0.7515485882759094

eval_loss_epoch: 0.9225256244341532

rmse: 0.9604819543842161

mse: 0.9225255846977234

Improvement of 0.018336176872253418 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 3


100%|██████████| 971/971 [00:45<00:00, 21.14it/s]



train_loss_epoch: 0.804713054002866


100%|██████████| 6/6 [00:00<00:00,  9.38it/s]



mae: 0.743607223033905

eval_loss_epoch: 0.8977507948875427

rmse: 0.9474971527620479

mse: 0.8977508544921875

Improvement of 0.007941365242004395 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 4


100%|██████████| 971/971 [00:45<00:00, 21.14it/s]



train_loss_epoch: 0.7829881388421653


100%|██████████| 6/6 [00:00<00:00,  9.94it/s]



mae: 0.7408876419067383

eval_loss_epoch: 0.8879891335964203

rmse: 0.9423317060300227

mse: 0.8879890441894531

Improvement of 0.002719581127166748 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 5


100%|██████████| 971/971 [00:45<00:00, 21.21it/s]



train_loss_epoch: 0.7723518149000732


100%|██████████| 6/6 [00:00<00:00,  9.62it/s]



mae: 0.7306551337242126

eval_loss_epoch: 0.8741245567798615

rmse: 0.9349462695671549

mse: 0.8741245269775391

Improvement of 0.010232508182525635 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 6


100%|██████████| 971/971 [00:45<00:00, 21.19it/s]



train_loss_epoch: 0.7589085090418186


100%|██████████| 6/6 [00:00<00:00,  9.82it/s]



mae: 0.7242081761360168

eval_loss_epoch: 0.87059153119723

rmse: 0.9330549081691162

mse: 0.8705914616584778

Improvement of 0.006446957588195801 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 7


100%|██████████| 971/971 [00:45<00:00, 21.14it/s]



train_loss_epoch: 0.7346186338934422


100%|██████████| 6/6 [00:00<00:00,  8.77it/s]



mae: 0.7160568833351135

eval_loss_epoch: 0.8519508838653564

rmse: 0.9230118546721686

mse: 0.8519508838653564

Improvement of 0.00815129280090332 observed, resetting counter. 
Early stopping counter: 0/2

Starting epoch 8


100%|██████████| 971/971 [00:45<00:00, 21.14it/s]



train_loss_epoch: 0.7128203637690794


100%|██████████| 6/6 [00:00<00:00,  9.90it/s]



mae: 0.7230656743049622

eval_loss_epoch: 0.8604253133138021

rmse: 0.9275911240657637

mse: 0.8604252934455872
No improvement above threshold observed, incrementing counter. 
Early stopping counter: 1/2

Starting epoch 9


100%|██████████| 971/971 [00:46<00:00, 21.07it/s]



train_loss_epoch: 0.6947063981074875


100%|██████████| 6/6 [00:00<00:00,  9.97it/s]



mae: 0.723215639591217

eval_loss_epoch: 0.8628555238246918

rmse: 0.928900152881013

mse: 0.8628554940223694
No improvement above threshold observed, incrementing counter. 
Early stopping counter: 2/2
Stopping training due to no improvement after 2 epochs
Finishing training run
Loading checkpoint with mae: 0.7160568833351135


Here, we can see a slight decrease in the MAE, but a small increase in the MSE and RMSE, so it looks like these features made a negligible difference to the overall performance.

In writing this article, my main objective has been to try and illustrate how these approaches can be used, and so I've picked the hyperparameters somewhat arbitrarily; it's likely that with some hyperparameter tweaks, and different combinations of features, these metrics can probably be improved upon!

Hopefully this has provided a good introduction to using both matrix factorization and transformer-based approaches in PyTorch, and how pytorch-accelerated can speed up our process when experimenting with different models!