*Collaborative Filtering:* look at which products the current user has used or liked, find other users who have used or liked similar products, and then recommend other products that those users have used or liked.

*latent factors:* there some underlying concept of sci-fi, action, and movie age, and these concepts must be relevant for at least some people's movie-watching decisions.

Learn:
- Movie recommendation problem
- start by getting some data suitable for a collaborative filtering model



## A First Look at the Data
MovieLens: dataset of movie watching history, tens of millions of movie rankings:
- a movie ID
- a user ID
- a numeric rating

In [1]:
# use a subset of 100k of them:

from fastai.collab import *
from fastai.tabular.all import *
path = untar_data(URLs.ML_100k)

The main table is in the file u.data. It is tab-separated and the columns are:
- user
- movie
- rating
- timestamp

In [2]:
# indicate names when reading the file with Pandas
# open and take a look:

ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                      names=['user', 'movie', 'rating', 'timestamp'])

ratings.head()

Unnamed: 0,user,movie,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


The empty cells(user has not reviewed the movie yet) in crosstab are the things that model to learn to fill in(figure out which of those movies they mignt be most likely to enjoy)

Assuming factors range between -1(indicating weaker matches) and +1(indicating stronger matches), and the categories are:
- [sci-fi, action, old movies]

In [3]:
# represent the movie The Last Skywalker as follows:

last_skywalker = np.array([0.98, 0.9, -0.9])

In [4]:
# represent a user who likes modern sci-fi action movies as follows:

user1 = np.array([0.9, 0.8, -0.6])

In [5]:
# calculate the match between this combinations:

(user1*last_skywalker).sum()

2.1420000000000003

Multiply two vectors together and add up the results: Dot product.

In [6]:
# represent the movie Casablanca as follows:

casablanca = np.array([-0.99, -0.3, 0.8])

In [7]:
# match between this combination is shown here:

(user1*casablanca).sum()

-1.611

## Learning the Latent Factors
STEP1: Randomly initialize some parameters(set of latent factors for each user and movie)

STEP2: Calculate our predictions. By taking dot product of each movie with each user.
- high product: if user likes action movies and movie has lots of action or vice-versa.
- low product: if we have a mismatch.

STEP3: Calculate our loss. Pick MSE, resonable way to represent the accuracy of a prediction.

Optimize our parameters(the latent factors) using stochastic gradient descent:
- minimize the loss
- calculate the match using dot product
- campare it to actual rating
- then calculate the derivative of this value
- step the weights by multiplying this by the LR

## Creating the DataLoaders

In [8]:
# The table u.item contains the correspondence of IDs to titles:
movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1',
                     usecols=(0,1), names=('movie','title'), header=None)

movies.head()

Unnamed: 0,movie,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [9]:
# merge this with our ratings table to get the user ratings by title:

ratings = ratings.merge(movies)
ratings.head()

Unnamed: 0,user,movie,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)
2,226,242,5,883888671,Kolya (1996)
3,154,242,3,879138235,Kolya (1996)
4,306,242,5,876503793,Kolya (1996)


In [10]:
# build a DataLoaders object from this table
# need to change the value of item_name to use titles instead of IDs:

dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls.show_batch()

Unnamed: 0,user,title,rating
0,640,Hard Target (1993),3
1,795,Everyone Says I Love You (1996),4
2,344,Searching for Bobby Fischer (1993),4
3,201,Escape from New York (1981),2
4,207,"People vs. Larry Flynt, The (1996)",4
5,435,"Prophecy II, The (1998)",2
6,645,Apollo 13 (1995),4
7,429,Little Lord Fauntleroy (1936),4
8,308,"Last of the Mohicans, The (1992)",4
9,406,Dial M for Murder (1954),4


In [11]:
# can't use the crosstab representation directly
# so represent our movie and user latent factor tables as simple matrices:

n_users = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
n_factors = 5

user_factors = torch.randn(n_users, n_factors)
movie_factors = torch.randn(n_movies, n_factors)

To calculate the combination:
- look up the index of movie and user in respective latent factor matrix.
- dot product between the two latent factor vectors.

We can represent look up in an index as a matrix product: replace our indices with one-hot-encoded vectors.

In [12]:
# if we multiply a vector by a one-hot-encoded vector representing the index 3:

one_hot_3 = one_hot(3, n_users).float()
user_factors.t() @ one_hot_3

tensor([-0.2728, -2.2764, -0.1875,  1.3184, -1.8212])

In [13]:
# it gives us the same vector as the one at index in the matrix:

user_factors[3]

tensor([-0.2728, -2.2764, -0.1875,  1.3184, -1.8212])

Pytorch include a special layer that indexes into a vector using an integer, but has its derivative calculated in such a way that it is identical to what it would have been if it had done a matrix multiplication with a one-hot-encoded vector. This is called an embedding.

In computer vision, easy to get pixel values in three numbers of RGB format but we don't have say way to characterize a user or a movie. There are probably relations with genres. 

We don't determine the numbers of characterize instead we will let our model learn them. By analysing existing relations between users and movies.

Embedding: 
- attribute to each of users and each of movies a random vector of a certain length
- make those learnable parameters 

## Collaborative Filtering from Scratch
Creating a new PyTorch module requires inheriting(add additional behaviour to an existing class) from __Module__.

To create a new PyTorch module is that when your module is called, PyTorch will call a method in your class called __forward__, and will pass any parameters that are included in the call

In [14]:
# class defining dot product model:

class DotProduct(Module):
  def __init__(self, n_users, n_movies, n_factors):
    self.user_factors = Embedding(n_users, n_factors)
    self.movie_factors = Embedding(n_movies, n_factors)

  def forward(self, x):
    users = self.user_factors(x[:,0])
    movies = self.movie_factors(x[:,1])
    return (users*movies).sum(dim=1)


Input of the model is a tensor of shape batch_size x 2:
- first column(x[:,0]) : user IDs
- second column(x[:,1]) : movie IDs

We use the embedding layers to represent our matrices of user and movie latent factors:

In [15]:
x, y = dls.one_batch()
x.shape

torch.Size([64, 2])

In [16]:
# create a Learner to optimize our model:

model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())

In [17]:
# fit our model:

learn.fit_one_cycle(5, 5e-3)

epoch,train_loss,valid_loss,time
0,1.338174,1.233714,00:10
1,1.110292,1.05685,00:09
2,0.94919,0.948856,00:09
3,0.858994,0.867887,00:09
4,0.791723,0.852385,00:09


To make the model bit better is to force those predictions to be between 0 and 5, by using sigmoid_range.

In [18]:
# it's better to have the range go a little bit over 5, so we use (0, 5.5):

class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.y_range = y_range
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return sigmoid_range((users * movies).sum(dim=1), *self.y_range)

model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)

epoch,train_loss,valid_loss,time
0,1.014979,0.968196,00:10
1,0.867497,0.873684,00:10
2,0.67231,0.8427,00:10
3,0.480007,0.852948,00:10
4,0.363641,0.857822,00:10


One obvious missing piece is that we have only weights do not have biases, causes:
- some users are more +ve/-ve in their recommendations than others
- some movies are plain better/worse than others

We have a single number for each user that we can add to our scores, and ditto for each movie

In [19]:
# adjust our model architecture:

class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users, 1)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.movie_bias = Embedding(n_movies, 1)
        self.y_range = y_range

    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        res = (users * movies).sum(dim=1, keepdim=True)
        res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
        return sigmoid_range(res, *self.y_range)

In [20]:
# try training this and see how it goes:

model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)

epoch,train_loss,valid_loss,time
0,0.951718,0.926971,00:11
1,0.844999,0.828742,00:10
2,0.621171,0.835285,00:11
3,0.42122,0.856117,00:10
4,0.308912,0.863082,00:10


Ends up being worse, validation loss stopped improving in the middle and started to get worse: overfitting.

There is no way data augmentation, so use another regularization technique.

### Weight Decay(*L2 regularization*)
adding to loss fn the sum of all the weights squared: on computing the gradients, it will add a contribution to them that will encourage the weights to be as small as possible.

The larger the coefficients are, the sharper canyons we will have in the loss fn: prevent overfitting

- model learn high params cause it to fit all the data points in the training set
- overcomplex func that has very sharp changes lead to overfitting

Weight decay is a paramerter that controls that sum of squares we add to our loss(assuming parameters is a tensor of all parameters):
```
loss_with_wd = loss + wd * (parameters ** 2).sum()
```
adding that big sum to our loss is exactly the same as doing this:
```
parameters.grad += wd * 2 * parameters
```

In [21]:
# use weight decay in fastai, pass wd in fit_one_cycle:

model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch,train_loss,valid_loss,time
0,0.943534,0.911988,00:11
1,0.865086,0.850096,00:10
2,0.746888,0.809093,00:10
3,0.598758,0.793388,00:11
4,0.498961,0.793942,00:11


### Creating Our Own Embedding Module
Re-create DotProductBias without using Embedding class. Need randomly initialized weight matrix for each of the embeddings.

In [23]:
# adding a tensor as an attribute to a Module, will not included in parameters:

class T(Module):
    def __init__(self): self.a = torch.ones(3)

L(T().parameters())

(#0) []

To tell Module that we want to treat a tensor as a parameter, we have to wrap it in the nn.Parameter class.

In [25]:
# It's used only as a "marker" to showw what to includein parameters:

class T(Module):
    def __init__(self): self.a = nn.Parameter(torch.ones(3))

L(T().parameters())

(#1) [Parameter containing:
tensor([1., 1., 1.], requires_grad=True)]

In [27]:
# all PyTorch modules use nn.Parameter for any trainable parameters:

class T(Module):
    def __init__(self): self.a = nn.Linear(1, 3, bias=False)

t = T()
L(t.parameters())

(#1) [Parameter containing:
tensor([[0.4440],
        [0.3791],
        [0.2014]], requires_grad=True)]

In [28]:
type(t.a.weight)

torch.nn.parameter.Parameter

In [29]:
# create a tensor as a parameter, with random initialization:

def create_params(size):
    return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))

In [30]:
# use this to create DotProductBias, without Embedding:

class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = create_params([n_users, n_factors])
        self.user_bias = create_params([n_users])
        self.movie_factors = create_params([n_movies, n_factors])
        self.movie_bias = create_params([n_movies])
        self.y_range = y_range

    def forward(self, x):
        users = self.user_factors[x[:,0]]
        movies = self.movie_factors[x[:,1]]
        res = (users*movies).sum(dim=1)
        res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
        return sigmoid_range(res, *self.y_range)

In [31]:
# train it again to check we get around the same results as before:

model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch,train_loss,valid_loss,time
0,0.942291,0.912609,00:11
1,0.886895,0.851541,00:11
2,0.717332,0.80579,00:11
3,0.593627,0.792649,00:11
4,0.483319,0.792937,00:11


## Interpreting Embeddings and Biases
model ca provide us with movie recommendations for our users. The easiest to interpret are the biases.

In [34]:
# the movies with the lowest values in the bias vector:

movie_bias = learn.model.movie_bias.squeeze()
idxs = movie_bias.argsort()[:5]
[dls.classes['title'][i] for i in idxs]

['Children of the Corn: The Gathering (1996)',
 'Mortal Kombat: Annihilation (1997)',
 'Lawnmower Man 2: Beyond Cyberspace (1996)',
 'Barb Wire (1996)',
 'Robocop 3 (1993)']

It's saying is that for each of these movies, even when a user is very well matched to its latent factors, they still generally don't like it.

In [37]:
# by the same token, here are the movies with the hightest bias:

idxs = movie_bias.argsort(descending=True)[:5]
[dls.classes['title'][i] for i in idxs]

['Titanic (1997)',
 'Shawshank Redemption, The (1994)',
 'L.A. Confidential (1997)',
 "Schindler's List (1993)",
 'Silence of the Lambs, The (1991)']

It's not easy to directly interpret the embedding matrices. There're too many factors for a human to look at. But there is a technique that can pull out the most important underlying directions in such a matrix, called principal component analysis (PCA).

### Using fastai.collab

In [38]:
# create and train a collaborative filtering model:

learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch,train_loss,valid_loss,time
0,0.951376,0.910982,00:10
1,0.869922,0.84478,00:10
2,0.734434,0.806652,00:10
3,0.597746,0.794313,00:10
4,0.476012,0.795036,00:11


In [39]:
# the names of the layers can be seen by printing the model:

learn.model

EmbeddingDotBias(
  (u_weight): Embedding(944, 50)
  (i_weight): Embedding(1665, 50)
  (u_bias): Embedding(944, 1)
  (i_bias): Embedding(1665, 1)
)

In [42]:
# we can use these to replicate any of the analyses did before:

movie_bias = learn.model.i_bias.weight.squeeze()
idxs = movie_bias.argsort(descending=True)[:5]
[dls.classes['title'][i] for i in idxs]

['Titanic (1997)',
 'Shawshank Redemption, The (1994)',
 "Schindler's List (1993)",
 'L.A. Confidential (1997)',
 'Star Wars (1977)']

### Embedding Distance
Movie similarity can be defined by the similarity of users who like those movies.

In [44]:
# use this to find the most similar movie to Silence of the Lambs:

movie_factors = learn.model.i_weight.weight
idx = dls.classes['title'].o2i['Silence of the Lambs, The (1991)']
distances = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None])
idx = distances.argsort(descending=True)[1]
dls.classes['title'][idx]

"One Flew Over the Cuckoo's Nest (1975)"

## Bootstrapping a Collaborative Filtering Model
The most extreme version of this problem is having no users, and therefore no history to learn from, then what product to recommend to very first user.

Our dot product model works quite well, and it is the basis of many successful real-world recommendation systems. This approach to collaborative filtering is known as probabilistic matrix factorization(PMF).

## Deep Learning for Collaborative Filtering
To turn our architecture into a deep learning model
- take the results of the embedding lookup and concatenate those activations together.

Since we concatenating the embedding matrices, rather than taking their dot product, the two embedding matricess can have different sizes.

fastai has a function get_emb_sz: returns recommended sizes for embedding matrices for your data

In [45]:
# based on a heuristic that fast.ai has found tends to work well in practice:

embs = get_emb_sz(dls)
embs

[(944, 74), (1665, 102)]

In [46]:
# implement this class:

class CollabNN(Module):
    def __init__(self, user_sz, item_sz, y_range=(0, 5.5), n_act=100):
        self.user_factors = Embedding(*user_sz)
        self.item_factors = Embedding(*item_sz)
        self.layers = nn.Sequential(
            nn.Linear(user_sz[1]+item_sz[1], n_act),
            nn.ReLU(),
            nn.Linear(n_act, 1)
        )
        self.y_range = y_range

    def forward(self, x):
        embs = self.user_factors(x[:,0]), self.item_factors(x[:,1])
        x = self.layers(torch.cat(embs, dim=1))
        return sigmoid_range(x, *self.y_range)

In [48]:
# and use it to create a model:

model = CollabNN(*embs)

- CollabNN creates our Embedding layers as before, except that we now use embs sizes(self.layers is identical to the mini-neural net).
- In forward,
    - we apply embeddings
    - concatenate the results
    - pass this through the mini-neural net
    - apply sigmoid_range

In [49]:
# see if it trains:
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.01)

epoch,train_loss,valid_loss,time
0,0.949596,0.923606,00:11
1,0.9191,0.874231,00:11
2,0.84716,0.850019,00:11
3,0.837006,0.8437,00:11
4,0.789033,0.843859,00:11


fastai provides this model in fastai.collab if pass use_nn=True

In [50]:
# creating two hidden layers, of size 100 and 50:

learn = collab_learner(dls, use_nn=True, y_range=(0, 5.5), layers=[100, 50])
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch,train_loss,valid_loss,time
0,0.97678,0.93548,00:13
1,0.920364,0.884013,00:13
2,0.884336,0.859847,00:13
3,0.834246,0.831388,00:13
4,0.7984,0.831519,00:13


In [51]:
# learn.model is an object of type EmbeddingNN:

@delegates(TabularModel)
class EmbeddingNN(TabularModel):
    def __init__(self, emb_szs, layers, **kwargs):
        super().__init__(emb_szs, layers=layers, n_cont=0, out_sz=1, **kwargs)