In [1]:
import pandas as pd
import numpy as np

from fastai.collab import *
from fastai.tabular.all import *

# Lesson 7: Collaborative Filtering

In lecture #7, one of the topics we discussed was *Collaborative Filtering*, the mechanism behind reccommender systems. We walked through an example of how to build a reccommender system via creating *embeddings* for users and items being rated, and applied these ideas in order to predict user ratings and recommend items to users for the *MovieLens* dataset.

In this mini-project, I'll be attempting to build a similar system for another popular reccommender dataset, the Jester *Jokes* dataset. The dataset contains 100 jokes rated by 24983 users. The ratings are on a scale of -10 to 10, with 99 being the "null" rating. The dataset is available at http://eigentaste.berkeley.edu/dataset/.

First, we'll load the dataset into a dataframe. We have to create column names manually, so I'll number the jokes 1-100 and create a user_id column that matches the index, since each row corresponds to a different user.

In [37]:
joke_cols = [str(i) for i in range(1, 101)]
df = pd.read_csv('./jester-data-1.csv', names=['ratings_count'] + joke_cols)
df.drop('ratings_count', axis=1, inplace=True)
df['user_id'] = df.index + 1
col = df.pop('user_id')
df.insert(0, col.name, col)
df.head(20)


Unnamed: 0,user_id,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,100
0,1,-7.82,8.79,-9.66,-8.16,-7.52,-8.5,-9.85,4.17,-8.98,...,2.82,99.0,99.0,99.0,99.0,99.0,-5.63,99.0,99.0,99.0
1,2,4.08,-0.29,6.36,4.37,-2.38,-9.66,-0.73,-5.34,8.88,...,2.82,-4.95,-0.29,7.86,-0.19,-2.14,3.06,0.34,-4.32,1.07
2,3,99.0,99.0,99.0,99.0,9.03,9.27,9.03,9.27,99.0,...,99.0,99.0,99.0,9.08,99.0,99.0,99.0,99.0,99.0,99.0
3,4,99.0,8.35,99.0,99.0,1.8,8.16,-2.82,6.21,99.0,...,99.0,99.0,99.0,0.53,99.0,99.0,99.0,99.0,99.0,99.0
4,5,8.5,4.61,-4.17,-5.39,1.36,1.6,7.04,4.61,-0.44,...,5.19,5.58,4.27,5.19,5.73,1.55,3.11,6.55,1.8,1.6
5,6,-6.17,-3.54,0.44,-8.5,-7.09,-4.32,-8.69,-0.87,-6.65,...,-3.54,-6.89,-0.68,-2.96,-2.18,-3.35,0.05,-9.08,-5.05,-3.45
6,7,99.0,99.0,99.0,99.0,8.59,-9.85,7.72,8.79,99.0,...,99.0,99.0,99.0,99.0,99.0,2.33,99.0,99.0,99.0,99.0
7,8,6.84,3.16,9.17,-6.21,-8.16,-1.7,9.27,1.41,-5.19,...,7.23,-1.12,-0.1,-5.68,-3.16,-3.35,2.14,-0.05,1.31,0.0
8,9,-3.79,-3.54,-9.42,-6.89,-8.74,-0.29,-5.29,-8.93,-7.86,...,4.37,-0.29,4.17,-0.29,-0.29,-0.29,-0.29,-0.29,-3.4,-4.95
9,10,3.01,5.15,5.15,3.01,6.41,5.15,8.93,2.52,3.01,...,99.0,4.47,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0


It'd be good to get this into a format that a fastai dataloader can understand, formatted like (*user_id*, *movie_id*, *rating*). Fortunately, pandas has *melt*, which does exactly this. We just need to specify which columns we'ld like to pivot vertically.

In [38]:
# turn df into user_id, movie_id, rating format
df = df.melt(id_vars='user_id', var_name='joke_id', value_vars=joke_cols, value_name='rating')
df.head()

Unnamed: 0,user_id,joke_id,rating
0,1,1,-7.82
1,2,1,4.08
2,3,1,99.0
3,4,1,99.0
4,5,1,8.5


Now would also be a good time to remove any rows that don't have valid ratings (i.e where rating == 99):

In [39]:
df = df[df['rating'] != 99]
df.head()

Unnamed: 0,user_id,joke_id,rating
0,1,1,-7.82
1,2,1,4.08
4,5,1,8.5
5,6,1,-6.17
7,8,1,6.84


Now we're ready to create a dataloader. Fastai has a special *CollabDataLoaders* class built for collaborative filtering, so we can just use that.

In [40]:
dls = CollabDataLoaders.from_df(df, item_name='joke_id', bs=64)
dls.show_batch()

Unnamed: 0,user_id,joke_id,rating
0,19805,7,2.77
1,5214,88,-4.08
2,18512,27,-0.34
3,21640,47,-8.16
4,20180,51,-8.74
5,10405,12,4.76
6,6982,2,-9.13
7,15088,30,6.55
8,18054,25,-7.23
9,16196,78,-4.17


Since we'll first try our hand at building a collaborative learner ourself, it's important to outline the necessary compoenents:

* We need embeddings for each user and each joke id
* We need a way to access these embeddings for certain user and joke ids
* We also need bias terms for each user and each joke

Then, when we pass in a batch of (user_id, joke_id) pairs, we can use the embeddings and bias terms to predict a rating for each pair. We can then compare these predictions to the actual ratings and use the loss to update the embeddings and bias terms.

Fastai implements an 'Embedding' class, but we can also write it ourselves.

First, I'll create a function to generate parameters for a single set of embeddings: a matrix of size *m* by *n* where m is the number of items and n is the number of embedding factors. I'll also create a function to generate bias terms for each item. We wrap these in nn.Parameter so that we can perform gradient descent during training.

In [41]:
def create_embedding(num_items, num_factors):
    return nn.Parameter(torch.randn(num_items, num_factors)), nn.Parameter(torch.randn(num_items))

Then, we can create a CollabFilter module to put everything together:
* We create embedding matrices for users and jokes, and bias terms for users and jokes

* In the forward function, we access the embeddings for each (user, joke) pair in the batch
* We calculate the dot product of the user and joke embeddings and add the bias terms to get predictions
* We constrain the predictions via torch.sigmoid and ensure they are within our specified rating range (-10, 10)

In [45]:
n_users  = len(dls.classes['user_id'])
n_jokes = len(dls.classes['joke_id'])

class CollabFilter(Module):
    def __init__(self, num_users, num_jokes, num_factors, rating_range = (-10.5, 10.5)) -> None:
        self.user_embedding, self.user_bias = create_embedding(num_users, num_factors)
        self.joke_embedding, self.joke_bias = create_embedding(num_jokes, num_factors)
        self.rating_range = rating_range

    def forward(self, x):
        
        user_indices, joke_indices = x[:,0], x[:,1]

        user_factors = self.user_embedding[user_indices]
        joke_factors = self.joke_embedding[joke_indices]
        user_bias = self.user_bias[user_indices]
        joke_bias = self.joke_bias[joke_indices]

        prediction = (user_factors * joke_factors).sum(dim=1) + user_bias + joke_bias
        
        return torch.sigmoid(prediction) * (self.rating_range[1] - self.rating_range[0]) + self.rating_range[0]

We can pass our model to a fastai Learner to train it. We'll try using 5 embedding factors with a weight decay of 0.1. *Weight Decay* is a regularization mechanims whereby we penalize large weights in order to prevent overfitting by introducing a squared weights term to the loss function. 

In [46]:
cf = CollabFilter(n_users, n_jokes, 40)
learn = Learner(dls, cf, loss_func=mae)
learn.fit_one_cycle(6, 4e-3, wd=0.01)

epoch,train_loss,valid_loss,time
0,5.125642,5.219092,01:57
1,3.385003,3.44995,01:53
2,3.167202,3.248522,01:51
3,2.94829,3.153553,01:54
4,2.592825,3.146925,01:55
5,2.216538,3.16314,01:57


A validation error of 3.16 might not seem like the greatest result, but if my interpretation of the original collaborative filtering paper [https://goldberg.berkeley.edu/pubs/eigentaste.pdf](Goldberg et al, 2000) is correct, we actually beat their un-normalized results of ~ 3.7, albeit with 23 more years of research on our side. 

In [68]:
preds = learn.model(torch.tensor([[1, i] for i in range(1, 101)])).detach().numpy()[:9]
actuals = df[df['user_id'] == 1]['rating'].values[:9]
print(preds)
print(actuals)

[ 1.075839    0.5206909   2.1809263   0.04873371  2.2790222   1.6494293
 -0.14922619  2.043972    1.6697493 ]
[-7.82  8.79 -9.66 -8.16 -7.52 -8.5  -9.85  4.17 -8.98]


We can look at a few predictions from the model to see that it indeed is not the most accurate, but does seem to capture that this user in particular tends to view most jokes negatively. Looking at the bias value for this user we can see that a slight negative bias is applied.

In [63]:
learn.model.user_bias[0]

tensor(-0.0033, grad_fn=<SelectBackward0>)

Another technique often seen in collaborative filtering is conveting the model to a simple neural network. We can achieve this by converting the embeddings into a single linear layer by stacking them. This also allows us to use embeddings of different sizes, since we no longer need to compute the dot product of the embeddings. 

In [84]:
class CollabFilterNN(Module):
    def __init__(self, num_users, num_user_factors, num_jokes, num_joke_factors, hidden_activations, rating_range = (-10.5, 10.5)) -> None:
        self.user_embedding, _ = create_embedding(num_users, num_user_factors)
        self.joke_embedding, _ = create_embedding(num_jokes, num_joke_factors)
        self.rating_range = rating_range

        self.layers = nn.Sequential(
            nn.Linear(num_user_factors + num_joke_factors, hidden_activations),
            nn.ReLU(),
            nn.Linear(hidden_activations, 1)
        )

    def forward(self, x):
        embs = torch.cat([self.user_embedding[x[:,0]], self.joke_embedding[x[:,1]]], dim=1)
        x = self.layers(embs)
        
        return torch.sigmoid(x) * (self.rating_range[1] - self.rating_range[0]) + self.rating_range[0]

Fastai provides *get_emb_sz* which will return a good choice for the number of embedding factors along each axis. We can use this to choose the embeddings sizes for users and jokes

In [66]:
((num_users, user_factors), (num_jokes, joke_factors)) = get_emb_sz(dls)
((num_users, user_factors), (num_jokes, joke_factors))

((24984, 464), (101, 21))

Next, we can try training this new model:

In [85]:
cfnn = CollabFilterNN(num_users, user_factors, num_jokes, joke_factors, 100)
learn = Learner(dls, cfnn, loss_func=mae)
learn.fit_one_cycle(6, 4e-3, wd=0.01)

epoch,train_loss,valid_loss,time
0,3.592479,3.620988,13:09
1,3.493495,3.457361,13:26
2,3.409755,3.355438,13:35
3,3.2605,3.277101,13:29
4,3.041634,3.20046,13:14
5,2.85558,3.200542,13:36


This performed slightly worse than the previous model, we can also try again with a higher weight decay.

In [86]:
cfnn = CollabFilterNN(num_users, user_factors, num_jokes, joke_factors, 100)
learn = Learner(dls, cfnn, loss_func=mae)
learn.fit_one_cycle(5, 4e-3, wd=0.05)

epoch,train_loss,valid_loss,time
0,3.606181,3.61641,13:34
1,3.499767,3.548164,13:53
2,3.472067,3.4491,13:59
3,3.264365,3.307691,13:50
4,3.143815,3.241281,14:03


Not really much better, but is was worth a shot. We could continue to try different hyperparameters if we wanted to try to achieve better performance. 

At this point we could use this model as a backend for a recommender system, where we can attempt to show jokes to users that they have not seen before but we think they will like. We could also find the highest scoring jokes across all users and show those to users who have not rated any yet.