# Collaborative Filtering
These are notes from lesson 7 of Fast AI Practical Deep Learning for Coders.

::: {.callout-tip title="Homework Task"}
- Create a collaborative filtering model in a [spreadsheet](https://docs.google.com/spreadsheets/d/1tIzbgwu3qmAJounfKtr4n-uwAkupetmVqkHAWllYRDw/edit?usp=sharing)
:::

## 1. The Intuition Behind Collaborative Filtering
We have users ratings of movies. 

Say we had “embeddings” of a set of categories for each. So for a given **movie**, we have a vector of `[action, sci-fi, romance]` and for a given **user** we have their preference for `[action, sci-fi, romance]`. Then we could do the dot product between user embedding and movie embedding to get the probability that the user likes that movie. That is, the predicted user rating. 

So the problem boils down to: 

1. What are the embeddings? i.e. the salient factors (`[action, sci-fi, romance]` in the example above)
2. How do we get them?

The answer to both questions is: we just let the model learn them.

Let’s just pick a randomised embedding for each movie and each user. Then we have a loss function which is the MAE between predicted user rating for a movie and actual rating. 
Now we can use SGD to optimise those embeddings to find the best values. 


## 2. A deep learning spreadsheet (!)
To gain an intuition behind the calculations behind a collaborative filter, we can work through a (smaller) example in excel. 
This allows us to see the logic and dig into the calculations before we create them "for real" in Python.

::: {.callout-tip}
This can be found in [this spreadsheet](https://docs.google.com/spreadsheets/d/1tIzbgwu3qmAJounfKtr4n-uwAkupetmVqkHAWllYRDw/edit?usp=sharing).
:::

We first look at an example where the results are in a cross-table and we can take the dot product of user embeddings and movie embeddings.

Then we reshape the problem slightly by placing all of the embeddings in a matrix and doing a lookup. This is essentially what pytorch does, although it uses matrix multiplication by one-hot encoded vectors rather than array lookups for computational efficiency.

We then add a bias term to account for some users who love all movies, or hate all movies. And also movies that are universally beloved.

## 3. Implementing a Collaborative Filter

The broad idea behind collaborative filtering is:

- If we could quantify the most salient "latent factors" about a movie, and...
- Quantify how much a user cares about that factor, then...
- If we multiplied the two (dot product) it would give a measure of their rating.

But what are those latent factors? We let the model learn it. 
1. We initialise randomised latent factors (called embeddings)
2. We use that to predict the user's rating for each move. Initially, those randomised weights will give terrible predictions.
3. Our loss function is the MSE of the ground truth actual predictions and the prediction rating.
4. We can optimise the embedding values to minimise this loss function.

### 3.1. Loading MovieLens data
We use data on user ratings of movies sourced from [MovieLens](https://grouplens.org/datasets/movielens/).
The `ml-latest-small` data set is downloaded and saved in the `DATA_DIR` folder.

In [4]:
from pathlib import Path

from fastai.collab import CollabDataLoaders, Module, Embedding
from fastai.tabular.all import one_hot, sigmoid_range, MSELossFlat, Learner
import pandas as pd 
import torch


DATA_DIR = Path("/Users/gurpreetjohl/workspace/python/ml-practice/ml-practice/datasets/ml-latest-small")

Load the `ratings` data which we will use for this task:

In [5]:
ratings = pd.read_csv(DATA_DIR / 'ratings.csv')
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


The users and movies are encoded as integers. 

For reference, we can load the `movies` data to see what each `movieId` corresponds to:

In [6]:
movies = pd.read_csv(DATA_DIR / 'movies.csv')
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


We'll merge the two for easier human readability.

In [7]:
ratings = ratings.merge(movies)
ratings

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
...,...,...,...,...,...,...
100831,610,166534,4.0,1493848402,Split (2017),Drama|Horror|Thriller
100832,610,168248,5.0,1493850091,John Wick: Chapter Two (2017),Action|Crime|Thriller
100833,610,168250,5.0,1494273047,Get Out (2017),Horror
100834,610,168252,5.0,1493846352,Logan (2017),Action|Sci-Fi


### 3.2 Prepare the Data

In [9]:
dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls.show_batch()

Unnamed: 0,userId,title,rating
0,514,Blockers (2018),3.0
1,307,It's Pat (1994),1.5
2,572,Phenomenon (1996),3.0
3,249,Final Fantasy: The Spirits Within (2001),3.5
4,232,Smokin' Aces (2006),3.0
5,400,Blade Runner (1982),4.5
6,249,She's Out of My League (2010),4.0
7,202,Pulp Fiction (1994),4.0
8,425,Falling Down (1993),4.0
9,200,Tomorrow Never Dies (1997),4.5


Initialise randomised 5-dimensional embeddings.

How should we choose the number of latent factors? (5 in the example above).
Jeremy wrote down some ballpark values for models of different sizes in excel, then fit a function to it to get a heuristic measure. This is the default used by fast AI. 


In [10]:
n_users  = len(dls.classes['userId'])
n_movies = len(dls.classes['title'])
n_factors = 5

user_factors = torch.randn(n_users, n_factors)
movie_factors = torch.randn(n_movies, n_factors)

Doing a matrix multiply by a one hot encoded vector is the same as doing a lookup in an array, just in a more computationally efficient way. Recall the softmax example. 

An embedding is essentially just “look up in an array”. 



## 4. Collaborative Filtering From Scratch

Putting a sigmoid_range on the final layer to squish ratings to fit 0 to 5 means “the model doesn’t have to work as hard” to get movies in the right range. 
In practice we use 5.5 as the sigmoid scale value as a sigmoid can never hit 1, but we want ratings to be able to hit 5. 


In [22]:
from fastai.collab import CollabDataLoaders, Module, Embedding
from fastai.tabular.all import one_hot, sigmoid_range, MSELossFlat, Learner

In [26]:
class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.y_range = y_range

    def forward(self, x):
        users = self.user_factors(x[:, 0])
        movies = self.movie_factors(x[:, 1])
        # Apply a sigmoid to the raw_output
        raw_output = (users * movies).sum(dim=1)
        return sigmoid_range(raw_output, *self.y_range)

We can now fit a model

In [29]:
embedding_dim = 50
num_epochs = 5
max_learning_rate = 5e-3

model = DotProduct(n_users, n_movies, embedding_dim)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(num_epochs, max_learning_rate)

epoch,train_loss,valid_loss,time
0,0.914778,0.895329,00:05
1,0.73677,0.808955,00:05
2,0.514396,0.794814,00:05
3,0.297406,0.796249,00:05
4,0.20721,0.800616,00:05


### 4.2. Adding a bias term
Adding a user bias term and a movie bias term to the prediction call helps account for the fact that some users always rate high (4 or 5) but other users always rate low. And similarly for movies if everyone always rates it a 5 or a 1. 

In [30]:
class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users, 1)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.movie_bias = Embedding(n_movies, 1)
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        raw_output = (users * movies).sum(dim=1, keepdim=True)
        raw_output += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
        return sigmoid_range(raw_output, *self.y_range)

In [32]:
model = DotProductBias(n_users, n_movies, embedding_dim)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(num_epochs, max_learning_rate)

epoch,train_loss,valid_loss,time
0,0.787737,0.798411,00:07
1,0.684797,0.740917,00:06
2,0.430143,0.754881,00:07
3,0.220881,0.769882,00:07
4,0.141639,0.775056,00:07


### 4.3. Weight Decay
The validation loss in the previous model decreases then icnreases, which is a clear indication of overfitting.

We want to avoid overfitting, but data augmentation isn’t possible here. One approach is to use **weight decay** AKA L2 regularisation. We add sum of weights squared to the loss function.

How does this prevent overftting? The larger the coefficients, the sharper the canyons the model is able to produce, which allows it to fit individual data points. By penalising larger weights, it will only produce sharp changes if this causes the model to fit many points well, so it should generalise better.

We essentially want to modify our loss function with an additional term:

```
loss_with_weight_decay = loss + weight_decay * (parameters**2).sum()
```

In practice, these values would be large and numerically unstable. We only actually care about the *gradient* of the loss, so we can add the gradient of the additional term to the existing gradient.
```
parameters.grad += weight_decay * 2 * parameters
```

But `weight_decay` is just a constant that we choose, so we can fold the `2*` term into it.

In [33]:
weight_decay = 0.1

model = DotProductBias(n_users, n_movies, embedding_dim)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(num_epochs, max_learning_rate, wd=weight_decay)

epoch,train_loss,valid_loss,time
0,0.809498,0.814317,00:07
1,0.721856,0.742981,00:07
2,0.552588,0.720124,00:07
3,0.382654,0.714763,00:07
4,0.28517,0.716273,00:07



## References
- [Course lesson page](https://course.fast.ai/Lessons/lesson7.html)
- [Collaborative filtering notebook](https://www.kaggle.com/code/jhoward/collaborative-filtering-deep-dive/notebook)
