# "A Journey Through Fastbook (AJTFB) - Chapter 8: Collaborative Filtering"
> This chapter of ["Deep Learning for Coders with fastai & PyTorch"](https://github.com/fastai/fastbook) moves us away from computer vision to collaborative filtering (think recommendation systems). We'll explore building these models using the traditional "dot product" approach and also using a neural network, but we'll begin by covering the idea of "latent factors," which are both important for colloborative and tabular models.  Lets go!

- toc: true
- branch: master
- badges: true
- hide_binder_badge: true
- comments: true
- author: Wayde Gilliam
- categories: [fastai, fastbook, collaborative filtering, latent factors, embeddings, recommender systems, recsys]
- image: images/articles/fastbook.jpg
- search_exclude: false
- hide: false

Other posts in this series:  
[A Journey Through Fastbook (AJTFB) - Chapter 1](https://ohmeow.com/posts/2020/11/06/ajtfb-chapter-1.html)  
[A Journey Through Fastbook (AJTFB) - Chapter 2](https://ohmeow.com/posts/2020/11/16/ajtfb-chapter-2.html)  
[A Journey Through Fastbook (AJTFB) - Chapter 3](https://ohmeow.com/posts/2020/11/22/ajtfb-chapter-3.html)  
[A Journey Through Fastbook (AJTFB) - Chapter 4](https://ohmeow.com/posts/2021/05/23/ajtfb-chapter-4.html)  
[A Journey Through Fastbook (AJTFB) - Chapter 5](https://ohmeow.com/posts/2021/06/03/ajtfb-chapter-5.html)  
[A Journey Through Fastbook (AJTFB) - Chapter 6a](https://ohmeow.com/posts/2021/06/10/ajtfb-chapter-6-multilabel.html) \
[A Journey Through Fastbook (AJTFB) - Chapter 6b](https://ohmeow.com/posts/2022/02/09/ajtfb-chapter-6-regression.html)  
[A Journey Through Fastbook (AJTFB) - Chapter 7](https://ohmeow.com/posts/2022/03/28/ajtfb-chapter-7.html)  




In [None]:
#hide
! pip install fastai -Uqq

import pdb

In [None]:
#hide
def plot_function(f, tx=None, ty=None, title=None, min=-2, max=2, figsize=(6,4)):
    x = torch.linspace(min,max)
    fig,ax = plt.subplots(figsize=figsize)
    ax.plot(x,f(x))
    if tx is not None: ax.set_xlabel(tx)
    if ty is not None: ax.set_ylabel(ty)
    if title is not None: ax.set_title(title)

## Collaborative Filtering

**What is it?**

Think recommender systems which "look at which products the current user has used or liked, find other users who have used or liked similar products, and then recommend other products that those users have used or liked."

The key to making collaborative filtering and tabular models, is the idea of **latent factors**.

---
## What are "latent factors" and what is the problem they solve?

Remember that models can only work with numbers, and while something like "price" can be used to accurately reflect the value of a house, how do we represent numerically concepts like the day of week, the make/model of a car, or the job function of an employee?

The answer is with latent factors.

In a nutshell, latent factors are numbers associated to a thing (e.g., day of week, model of car, job function, etc...) that are ***learnt*** during model training. At the end this process, we have numbers that provide a representation of that thing we can use and explore in a variety of ways.  These factors are called "latent" because we don't know what they are beforehand.



> Note: The learnt numbers for "a thing" may vary to one degree or another based on the data used during training and your objective.  For example, what "Sunday" means may be represented differently when you are trying to forecast how many bottle of scotch will be sold that week than if you were trying to predict the number of options that will be traded for a certain equity.



> Important: Latent factors allows us to learn a numerical representation of a thing (especially those for which a single number would not do it justice)



If we had something like this ...

![](https://raw.githubusercontent.com/fastai/fastbook/035016fb0cc826542aef77864f36df88a5055d06/images/att_00040.png)

... how could we predict what users would rate movies they have yet to see?  Let's take a look.

In [None]:
from fastai.collab import *
from fastai.tabular.all import *

path = untar_data(URLs.ML_100k)

In [None]:
ratings_df = pd.read_csv(path/"u.data", delimiter="\t", header=None, names=["user", "movie", "rating", "timestamp"])
ratings_df.head()

Unnamed: 0,user,movie,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


How do we numerically represent user `196` and movie `242`?  With latent factors we don't have to know, we can have such a representation learnt using SGD.

**How do we set this up?**

1. "... randomly initialized some parameters [which] will be a set of latent factors for each user and movie."

2. "... to calculate our predictions [take] the **dot product** of each movie with each user.

3. "... to calculate our loss ... let's pick mean squared error for now, since that is one reasonable way to represent the accuracy of a prediction"



> Note: **dot product** = element-wise multiplication of two vectors summed up.



With this in place, "we can optimize our parameters (the latent factors) using stochastic gradient descent, such as to minimize the loss."  In a picture, it looks like this ...

![](https://raw.githubusercontent.com/fastai/fastbook/035016fb0cc826542aef77864f36df88a5055d06/images/att_00041.png)


> Important: The parameters we want to optimize ***are*** the latent factors!

In [None]:
movies_df = pd.read_csv(path/"u.item", delimiter="|", header=None, names=["movie", "title"], usecols=(0,1), encoding="latin-1")
movies_df.head()

Unnamed: 0,movie,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [None]:
ratings_df = ratings_df.merge(movies_df)
ratings_df.head()

Unnamed: 0,user,movie,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)
2,226,242,5,883888671,Kolya (1996)
3,154,242,3,879138235,Kolya (1996)
4,306,242,5,876503793,Kolya (1996)


In [None]:
dls = CollabDataLoaders.from_df(ratings_df, item_name="title", user_name="user", rating_name="rating")
dls.show_batch()

Unnamed: 0,user,title,rating
0,59,"Grifters, The (1990)",5
1,787,G.I. Jane (1997),4
2,554,Sabrina (1995),3
3,292,Trainspotting (1996),5
4,327,Air Force One (1997),2
5,838,Hercules (1997),3
6,561,"Remains of the Day, The (1993)",4
7,877,Starship Troopers (1997),4
8,500,"Birds, The (1963)",4
9,595,Happy Gilmore (1996),3


**So how do we create these latent factors for our users and movies?**

"We can represent our movie and user latent factor tables as simple matrices" that we can index into. But as looking up in an index is not something our models know how to do, we need to use a special PyTorch layer that will do this for us (and more efficiently than using a one-hot-encoded, OHE, vector to do the same). 

And that layer is called an **embedding**. It "indexes into a vector using an integer, but has its derivative calcuated in such a way that it is identical to what it would have been if it had done a matric multiplication with a one-hot-encoded vector."



> Important: An embedding is the "thing that you multiply the one-hot-encoded matrix by (or, using the computational shortcut, inex into directly)"




---
## Collaborative Filtering: From Scratch (dot product)

A **dot product** approach

In [None]:
n_users = len(dls.classes["user"])
n_movies = len(dls.classes["title"])
n_factors = 5

print(n_users, n_movies, n_factors)

944 1665 5


In [None]:
class DotProduct(Module):
  def __init__(self, n_users, n_movies, n_factors):
      super().__init__()
      self.users_emb = Embedding(n_users, n_factors)
      self.movies_emb = Embedding(n_movies, n_factors)

  def forward(self, inp):
    users = self.users_emb(inp[:,0])
    movies = self.movies_emb(inp[:,1])
    return (users * movies).sum(dim=1)

In [None]:
model = DotProduct(n_users=n_users, n_movies=n_movies, n_factors=n_factors)
learn = Learner(dls, model, loss_func=MSELossFlat())

learn.fit_one_cycle(5, 5e-3)

epoch,train_loss,valid_loss,time
0,4.274058,3.688388,00:14
1,1.120103,1.142191,00:15
2,0.963392,1.020399,00:08
3,0.943534,0.987578,00:09
4,0.856938,0.984708,00:08


### Tip 1: Constrain your range of predictions using `sigmoid_range`

"... to make this model a little bit better ... force those predictions to be between 0 and 5. **One thing we discovered empirically is that it's better to have the range go a little bit over 5**, so we use (0, 5.5)"

In [None]:
class DotProduct(Module):
  def __init__(self, n_users, n_movies, n_factors, y_range=(0, 5.5)):
      super().__init__()
      self.users_emb = Embedding(n_users, n_factors)
      self.movies_emb = Embedding(n_movies, n_factors)
      self.y_range = y_range

  def forward(self, inp):
    users = self.users_emb(inp[:,0])
    movies = self.movies_emb(inp[:,1])
    return sigmoid_range((users * movies).sum(dim=1), *self.y_range)

In [None]:
model = DotProduct(n_users=n_users, n_movies=n_movies, n_factors=50)
learn = Learner(dls, model, loss_func=MSELossFlat())

learn.fit_one_cycle(5, 5e-3)

epoch,train_loss,valid_loss,time
0,0.97703,1.006438,00:09
1,0.862363,0.924024,00:09
2,0.693378,0.879653,00:09
3,0.468937,0.886437,00:09
4,0.376035,0.892566,00:09


### Tip 2: Add a "bias"



> Important: A **bias** allows your model to learn an overall representation of a thing, rather than just a bunch of characteristics.



"One obvious missing piece is that some users are just more positive or negative in their recommendations than others, and some movies are just plain better or worse than others. But in our dot product representation, we do not have any way to encode either of these things ... **because at this point we have only weights; we don't have biases"


In [None]:
class DotProduct(Module):
  def __init__(self, n_users, n_movies, n_factors, y_range=(0, 5.5)):
      super().__init__()
      self.users_emb = Embedding(n_users, n_factors)
      self.users_bias = Embedding(n_users, 1)

      self.movies_emb = Embedding(n_movies, n_factors)
      self.movies_bias = Embedding(n_movies, 1)

      self.y_range = y_range

  def forward(self, inp):
    # embeddings
    users = self.users_emb(inp[:,0])
    movies = self.movies_emb(inp[:,1])

    # calc our dot product and add in biases 
    # (important to include "keepdim=True" => res.shape = (64,1), else will get rid of dims equal to 1 and you just get (64))
    res = (users * movies).sum(dim=1, keepdim=True)
    res += self.users_bias(inp[:,0]) + self.movies_bias(inp[:,1])

    # return our target constrained prediction
    return sigmoid_range(res, *self.y_range)

In [None]:
model = DotProduct(n_users=n_users, n_movies=n_movies, n_factors=50)
learn = Learner(dls, model, loss_func=MSELossFlat())

learn.fit_one_cycle(5, 5e-3)

epoch,train_loss,valid_loss,time
0,0.930593,0.949263,00:11
1,0.837153,0.877913,00:09
2,0.612976,0.874214,00:09
3,0.416554,0.899321,00:10
4,0.288284,0.905329,00:10


### Tip 2: Add "weight decay"

Adding in bias has made are model more complex and therefore more prone to overfitting (which seems to be happening here).



> Note: **Overfitting** is where your validation stops improving and actually starts to get worse.



**What do you do when your model overfits?**

We can solve this via data augmentation or by including one or more forms of **regularization** (e.g., a means to "encourage the weights to be as small as possible".


**What is "weight decay" (aka "L2 regularization")?**

"... consists of adding to your loss function the sum of all the weights squared."

**Why do that?**

"Because when we compute the gradients, it will add a contribution to them that will encourage the weights to be as small as possible."

**Why would this prevent overfitting?**

"The idea is that the larger the coefficients are, the sharper canyons we will have in the loss function.... **Letting our model learn high parameters might cause it to fit all the data points in the training set with an overcomplex function that has very sharp changes, which will lead to overfitting**.



> Important: "Limiting our weights from growing too much is going to hinder the training of the model ***but*** it will yield a state where it generalizes better"



**How do we add weight decay into are training?**

"... `wd` is a parameter that **controls that sum of squares we add to our loss" as such:


In [None]:
model = DotProduct(n_users=n_users, n_movies=n_movies, n_factors=50)
learn = Learner(dls, model, loss_func=MSELossFlat())

learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch,train_loss,valid_loss,time
0,0.951403,0.961861,00:10
1,0.848258,0.886002,00:10
2,0.732632,0.849419,00:10
3,0.598517,0.83337,00:10
4,0.473185,0.833645,00:10


### Creating our own Embedding Module

pp.265-267 show how to write your own `nn.Module` that does what `Embedding` does.  Here are some of the important bits to pay attention too ...

"... optimizers require that they can get all the parameters of a module from the module's `parameters` method, so make sure to tell `nn.Module` that you want to treat a tensor as a parameters using the `nn.Parameter` class like so:

```
class T(Module):
  def __init__(self):
    self.a = nn.Parameter(torch.ones(3))
```


> Important: "All PyTorch modules use `nn.Parameter` for any trainable parameters.

```
class T(Module):
  def __init__(self):
    self.a = nn.Liner(1, 3, bias=False)

t = T()
t.parameters()   #=> will show all the weights of your nn.Linear
type(t.a.weight) #=> torch.nn.parameter.Parameter
```

Now, given a method like this ...

```
def create_params(size):
  return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))
```

... we can create randomly initialized parameters, included parameters for our latent factors and biases like this:

```
self.users_emb = create_params([n_users, n_factors])
self.users_bias = create_params([n_users])
```


---
## Interpreting Embeddings and Biases



> Note: "... interesting to see what parameters it has discovered ... easiest to interpret are the biases"




In [None]:
movie_bias = learn.model.movies_bias.weight.squeeze() # => squeeze will get rid of all the single dimensions
idxs = movie_bias.argsort()[:5]                       # => "argsort()" returns the indices sorted by value
[dls.classes["title"][i] for i in idxs]               # => look up the movie title in dls.classes

['Children of the Corn: The Gathering (1996)',
 'Crow: City of Angels, The (1996)',
 'Mortal Kombat: Annihilation (1997)',
 'Robocop 3 (1993)',
 'Cable Guy, The (1996)']

"Think about what this means .... It tells us not just whether a movie is of a kind that people tend not to enjoy watching, but that people tend to not like watching it even if it is of a kind that they would otherwise enjoy!"

To get the movies by highest bias:

In [None]:
idxs = movie_bias.argsort(descending=True)[:5]
[dls.classes["title"][i] for i in idxs]

["Schindler's List (1993)",
 'Titanic (1997)',
 'As Good As It Gets (1997)',
 'Shawshank Redemption, The (1994)',
 'Silence of the Lambs, The (1991)']

> Note: To visualize embeddings with many factors, you "can pull out the most important underlying directions" using a dimensionality reduction model like **principal components analysis** (PCA).



See p.268 and these three StatQuest videos for more on how PCA works (btw, StatQuest is one of my top data science references so consider subscribing to his channel). [Video 1](https://www.youtube.com/watch?v=HMOI_lkzW08&t=13s), [Video 2](https://www.youtube.com/watch?v=FgakZw6K1QQ), and [Video 3](https://www.youtube.com/watch?v=oRvgq966yZg)

---
## Collaborative Filtering: Using `fastai.collab`

In [None]:
learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch,train_loss,valid_loss,time
0,0.973411,0.974048,00:10
1,0.872098,0.892812,00:10
2,0.714793,0.84739,00:10
3,0.60744,0.830938,00:10
4,0.491107,0.832031,00:10


In [None]:
learn.model

EmbeddingDotBias(
  (u_weight): Embedding(944, 50)
  (i_weight): Embedding(1665, 50)
  (u_bias): Embedding(944, 1)
  (i_bias): Embedding(1665, 1)
)

In [None]:
movie_bias = learn.model.i_bias.weight.squeeze() 
idxs = movie_bias.argsort()[:5]                   
[dls.classes["title"][i] for i in idxs]              

['Children of the Corn: The Gathering (1996)',
 'Island of Dr. Moreau, The (1996)',
 'Mortal Kombat: Annihilation (1997)',
 'Crow: City of Angels, The (1996)',
 'Vampire in Brooklyn (1995)']

---
## Embedding Distance

"Another thing we can do with these learned embeddings is to look at distance."

**Why do this?**

"If there were two movies that were nearly identical, their embedding vectors would also have to be nearly identical .... There is a more general idea here: movie similairty can be defined by the similarity of users who like those movies. And that directly means that the distance between two movies' embedding vectors can define that similarity"

In [None]:
movie_factors = learn.model.i_weight.weight
idx = dls.classes["title"].o2i["Silence of the Lambs, The (1991)"]
dists = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None])
targ_idx = dists.argsort(descending=True)[1]
dls.classes["title"][targ_idx]

'Quiet Room, The (1996)'

---
## Bootstrapping

The **bootstrapping problem** asks how we can make recommendations when we have a new user for which no data exists or a new product/movie for which no reviews have been made?

The recommended approach "is to **use a tabular model based on user metadata to construct your initial embedding vector.** When a new user signs up, think about what questions you could ask to help you understand their tastes. Then you can create a model in which **the dependent variable is a user's embedding vector**, and **the independent variables are the results of the questions that you ask them, along with their signup metadata**."



> Important: Be aware of the "problem of **representation bias**" (e.g., where a few very active users end up skewing the results).



See p.271 for more information on how collaborative models may contribute to positive feedback loops and how humans can mitigate by being part of the process.

---
## Collaborative Filtering: From Scratch (NN)

A **neural network** approach requires we "take the results of the embedding lookup and concatenate those activations together. This gives us a matrix we can then pass through linear layers and nonlinearities..."



> Note: Because "we'll be concatenating the embedding matrices, rather than taking their dot product, **the two embedding matrices can have different sizes (different numbers of latent factors)**"



**How do we determine the number of latent factors a "thing" should have?**

Use `get_emb_sz` to return "the recommended sizes for embedding matrices for your data, **based on a heuristic that fast.ai has found tends to work well in practice"

In [None]:
embs = get_emb_sz(dls)
embs

[(944, 74), (1665, 102)]

In [None]:
class CollabNN(Module):
  def __init__(self, user_sz, item_sz, y_range=(0, 0.5), n_act=100):
    self.user_factors = Embedding(*user_sz)
    self.item_factors = Embedding(*item_sz)
    self.layers = nn.Sequential(
      nn.Linear(user_sz[1] + item_sz[1], n_act),
      nn.ReLU(),
      nn.Linear(n_act, 1)
    )
    self.y_range = y_range

  def forward(self, x):
    embs = self.user_factors(x[:,0]), self.item_factors(x[:,1])
    x = self.layers(torch.cat(embs, dim=1))
    return sigmoid_range(x, *self.y_range)

In [None]:
model = CollabNN(*embs)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch,train_loss,valid_loss,time
0,10.389621,10.483508,00:11
1,10.487264,10.4835,00:10
2,10.538147,10.4835,00:10
3,10.407659,10.4835,00:10
4,10.344148,10.4835,00:10


If we use the `collab_learner`, will will calculate our embedding sizes for us and also give us the option of defining how many more layers we want to tack on via the `layers` parameter.  All we have to do is tell it to `use_nn=True` to use a NN rather than the default dot-product model.

In [None]:
learn = collab_learner(dls, use_nn=True, y_range=(0,0.5), layers=[100,50])

In [None]:
learn.model

EmbeddingNN(
  (embeds): ModuleList(
    (0): Embedding(944, 74)
    (1): Embedding(1665, 102)
  )
  (emb_drop): Dropout(p=0.0, inplace=False)
  (bn_cont): BatchNorm1d(0, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): LinBnDrop(
      (0): Linear(in_features=176, out_features=100, bias=False)
      (1): ReLU(inplace=True)
      (2): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): LinBnDrop(
      (0): Linear(in_features=100, out_features=50, bias=False)
      (1): ReLU(inplace=True)
      (2): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (2): LinBnDrop(
      (0): Linear(in_features=50, out_features=1, bias=True)
    )
    (3): SigmoidRange(low=0, high=0.5)
  )
)

In [None]:
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch,train_loss,valid_loss,time
0,10.397267,10.487314,00:12
1,10.366409,10.484529,00:12
2,10.546273,10.4835,00:12
3,10.562647,10.48363,00:12
4,10.475612,10.483509,00:12


**Why use a neural network (NN)?**

Because "we can now directly incorporate other user and movie information, date and time information, or any other information that may be relevant to the recommendation."

We'll see this when we look at `TabularModel` (of which `EmbeddingNN` is a subclass with no continuous data [`n_cont=0`] and an `out_sz=1`.

---
## `kwargs` and `@delegates`

Some helpful notes for both are included on pp.273-274.  In short ...

`**kwargs`:

1. `**kwargs` as a **parameter** = "put any additional keyword arguments into a dict called `kwargs`"
2. `**kwargs` passed as an **argument** = "insert all key/value pairs in the `kwargs` dict as named arguments here."

`@delegates`:
 
 "... fastai resolves [the issue of using `**kwargs` to avoid having to write out all the arguments of the base class] by providing a special `@delegates` decorator, which automatically **changes the signature of the class or function** ... to insert all of its keyword arguments into the signature."


---
## Resources

1. https://book.fast.ai - The book's website; it's updated regularly with new content and recommendations from everything to GPUs to use, how to run things locally and on the cloud, etc...
