# Movie recommendations

In this notebook, we will take a stab at the task of recommending movies to potential viewers.

Specifically, we will start from information of how viewers rated movies they _have_ seen, and predict how they would rate movies that they _have not_ seen, based on how other people with similar movie tastes rated those other movies. This kind of task is called _collaborative filtering_.

## Prelude

The only new imports are `pandas` and `csv`. Pandas is a very useful library for working with tables (called _dataframes_), in two or more dimensions. While it can sometimes be counter-intuitive, it's very powerful and well worth getting to grips with.

Here, we only really use Pandas to read in the dataset CSV files, and to do some elementary preprocessing.

In [None]:
import torch, torchtext, numpy as np
import pandas as pd, csv
from torch import nn, optim
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import pdb
torch.manual_seed(291)
np.random.seed(291)

## Data

We're using the MovieLens 100K dataset. Actually MovieLens has much larger movie ranking datasets, and our model does even better on those, but training takes a bit more time than we have during class.

In [None]:
!wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
!unzip ml-latest-small.zip

There are several CSV files. To warm up, let's look at the `movies.csv` table first.

In [None]:
df_movies = pd.read_csv('ml-latest-small/movies.csv')

In [None]:
df_movies

Okay, so each movie has an ID (`movieId`) and other information. Because our focus is on getting recommendations based only on how other users review each movie, we will actually ignore the title, year, and genre this time.

What appears to be the first, unlabelled column is actually just the row index, which Pandas keeps explicitly as part of the dataframe structure.

Next, let's look at what we are _actually_ interested in, namely the movie ratings.

In [None]:
df = pd.read_csv('ml-latest-small/ratings.csv')

In [None]:
df

We have what we need: the user ID, movie ID, and the rating. First, let's look at what the ratings look like. The `df['column']` syntax just selects the column, and `.unique()` collects the unique elements.

In [None]:
df['rating'].unique()

So actually the ratings go up to 5, but they are at 0.5 intervals, so there are 10 of them.

But there is something suspicious here. There are 100,835 rows, but 170,875 movies. This means some movie IDs are not mentioned here (this is called the pigeonhole principle), which means that the `movieId` dimension is sparse — but it would be most convenient for us to work with dense embedding tensors.

Let's verify how many unique movies we have.

In [None]:
len(df['movieId'].unique())

Not even 10,000. We will definitely have to renumber them when we create our dataset.

We will also need to convert these dataframes to PyTorch tensors, so we can feed them into our model during training and testing. To do this, we need to first retrieve the actual values in the table using `.values` (this is a NumPy array), and then call `LongTensor()` on that to create a tensor of integers.

The `[[ ... ]]` syntax is how one selects multiple columns in Pandas.

In [None]:
torch.LongTensor(df[['userId', 'movieId']].values)

Time to build the dataset class as usual.

## Dataset class

This is actually simpler and more straightforward than some datasets we've built before.

The only new thing is that we renumber movie and user IDs so that they start from 0 and are contiguous. `u2n` and `m2n` are dictionary comprehensions — just like the list comprehensions we've seen before but these build a lookup table. Finally, `lambda` creates an unnamed function in place: for example, `lambda x: x+x` is a function that doubles its argument.

In [None]:
class MovieDataset(torch.utils.data.Dataset):
    def __init__(self, fn):
        df = pd.read_csv(fn)
        u2n = { u: n for n, u in enumerate(df['userId'].unique()) }
        m2n = { m: n for n, m in enumerate(df['movieId'].unique()) }
        df['userId'] = df['userId'].apply(lambda u: u2n[u])
        df['movieId'] = df['movieId'].apply(lambda m: m2n[m])
        self.coords = torch.LongTensor(df[['userId','movieId']].values)
        self.ratings = torch.FloatTensor(df['rating'].values)
        self.n_users = df['userId'].nunique()
        self.n_movies = df['movieId'].nunique()

    def __len__(self):
        return len(self.coords)

    def __getitem__(self, i):
        return (self.coords[i], self.ratings[i])

Splitting the dataset is also exactly the same as we've seen before.

In [None]:
ds_full = MovieDataset('ml-latest-small/ratings.csv')
n_train = int(0.8 * len(ds_full))
n_test = len(ds_full) - n_train
rng = torch.Generator().manual_seed(291)
ds_train, ds_test = torch.utils.data.random_split(ds_full, [n_train, n_test], rng)

## Recommender model

Now that we have the dataset, we can build our model. Recall that our plan is to create embeddings from both users and movies into the same embedding space, and estimate how much the two embeddings differ by taking the dot product.

In [None]:
class MovieRecs(nn.Module):
    def __init__(self, n_users, n_movies, emb_dim):
        super(MovieRecs, self).__init__()
        self.user_emb = nn.Embedding(n_users, emb_dim)
        self.movie_emb = nn.Embedding(n_movies, emb_dim)
        nn.init.xavier_uniform_(self.user_emb.weight)
        nn.init.xavier_uniform_(self.movie_emb.weight)
    
    def forward(self, samples):
        users = self.user_emb(samples[:,0])
        movies = self.movie_emb(samples[:,1])
        return (users * movies).sum(1)

Almost nothing new in the train and test code. The only difference is that `sched.step()` is called _inside_ the iteration loop, rather than outside — this is because we are using the single-cycle learning rate scheduler, which makes smooth adjustments after every minibatch rather than after every epoch.

And the dataset is small enough that we don't even need a GPU.

In [None]:
device = torch.device('cpu')

def run_test(model, ldr, crit):
    total_loss, total_count = 0, 0
    model.eval()
    tq_iters = tqdm(ldr, leave=False, desc='test iter')
    with torch.no_grad():
        for coords, labels in tq_iters:
            coords, labels = coords.to(device), labels.to(device)
            preds = model(coords)
            loss = crit(preds, labels)
            total_loss += loss.item() * labels.size(0)
            total_count += labels.size(0)
            tq_iters.set_postfix({'loss': total_loss/total_count}, refresh=True)
    return total_loss / total_count

def run_train(model, ldr, crit, opt, sched):
    model.train()
    total_loss, total_count = 0, 0
    tq_iters = tqdm(ldr, leave=False, desc='train iter')
    for (coords, labels) in tq_iters:
        opt.zero_grad()
        coords, labels = coords.to(device), labels.to(device)
        preds = model(coords)
        loss = crit(preds, labels)
        loss.backward()
        opt.step()
        sched.step()
        total_loss += loss.item() * labels.size(0)
        total_count += labels.size(0)
        tq_iters.set_postfix({'loss': total_loss/total_count}, refresh=True)
    return total_loss / total_count

def run_all(model, ldr_train, ldr_test, crit, opt, sched, n_epochs=10):
    best_loss = np.inf
    tq_epochs = tqdm(range(n_epochs), desc='epochs', unit='ep')
    for epoch in tq_epochs:
        train_loss = run_train(model, ldr_train, crit, opt, sched)
        test_loss = run_test(model, ldr_test, crit)
        tqdm.write(f'epoch {epoch}   train loss {train_loss:.6f}    test loss {test_loss:.6f}')
        if test_loss < best_loss:
            best_loss = test_loss
            tq_epochs.set_postfix({'bE': epoch, 'bL': best_loss}, refresh=True)

Again, only two things new here.

First, we are using mean squared error (MSE) as the loss function. This is only because results on this dataset are normally reported using RMSE (R = root), which PyTorch does not have. We could, of course, easily write an RMSE loss — but minimizing MSE will also minimize the RMSE (because sqrt is strictly increasing), so MSE works just as well.

Second, we are using `OneCycleLR()`, mostly because I happened to try that first and it worked well. Often this gives very good results more quickly than other LR schedules, so it's almost always worth trying. This means that we will need to make a `sched.step()` adjustment as described above.


In [None]:
model = MovieRecs(ds_full.n_users, ds_full.n_movies, 20)
model.to(device)

ldr_train = torch.utils.data.DataLoader(ds_train, batch_size=32, shuffle=True)
ldr_test = torch.utils.data.DataLoader(ds_test, batch_size=32)

n_epochs = 5

crit = nn.MSELoss().to(device)
opt = optim.SGD(model.parameters(), lr=1e-6, momentum=0.9)
sched = optim.lr_scheduler.OneCycleLR(opt, max_lr=0.1, steps_per_epoch=len(ldr_train), epochs=n_epochs)

run_all(model, ldr_train, ldr_test, crit, opt, sched, n_epochs)

This is already pretty good — try for yourself to evaluate a random, untrained model and compare — but it turns out that we can do even better by making the model slightly more complicated.

As we wrote it, the model computes ratings from (user, movie) pairs. But this makes it difficult to account for people who always write bad (or good) reviews, and for movies that are universally considered terrible or amazing. (Or even so terrible that one finds oneself transfixed.)

Actually, the model _could_ learn about grumpy reviewers, but it would have to learn this _separately_ for every movie. We can make learning this much easier by adding bias. Let's try.

## Recommender model with bias

Let's think what bias means in our case. We want it to learn one value (i.e., the bias offset) separately for every reviewer, and another set of value for the movies. This means that our bias is also an embedding: we index it using the movie (or user) ID, and we get back a single number.

The only fly in the ointment is that this gives us a rank-one tensor with one dimensions, but we can `squeeze()` that extra encapsulation away.

In [None]:
class MovieRecs(nn.Module):
    def __init__(self, n_users, n_movies, emb_dim):
        super(MovieRecs, self).__init__()
        self.user_emb = nn.Embedding(n_users, emb_dim)
        self.user_bias = nn.Embedding(n_users, 1)
        self.movie_emb = nn.Embedding(n_movies, emb_dim)
        self.movie_bias = nn.Embedding(n_movies, 1)
        nn.init.xavier_uniform_(self.user_emb.weight)
        nn.init.xavier_uniform_(self.movie_emb.weight)
        nn.init.zeros_(self.user_bias.weight)
        nn.init.zeros_(self.movie_bias.weight)
    
    def forward(self, samples):
        users = self.user_emb(samples[:,0])
        movies = self.movie_emb(samples[:,1])
        dot = (users * movies).sum(1)
        user_b = self.user_bias(samples[:,0]).squeeze()
        movie_b = self.movie_bias(samples[:,1]).squeeze()
        return dot + user_b + movie_b

In [None]:
model = MovieRecs(ds_full.n_users, ds_full.n_movies, 20)
model.to(device)

ldr_train = torch.utils.data.DataLoader(ds_train, batch_size=32, shuffle=True)
ldr_test = torch.utils.data.DataLoader(ds_test, batch_size=32)

n_epochs = 5

crit = nn.MSELoss().to(device)
opt = optim.SGD(model.parameters(), lr=1e-6, momentum=0.9)
sched = optim.lr_scheduler.OneCycleLR(opt, max_lr=0.1, steps_per_epoch=len(ldr_train), epochs=n_epochs)

run_all(model, ldr_train, ldr_test, crit, opt, sched, n_epochs)

This converged much faster for me than the previous version and gave me better results.

There are a few other simple things I tried, like clamping the predicted rating range in various ways (e.g., sigmoid), regularization, messing with the embedding dimension, and so on. Some of them help a little bit — but they're easy enough to try for yourself.
