# Collaborative Filtering with Neural Networks

In this notebook we will write a matrix factorization model in pytorch to solve a recommendation problem. Then we will write a more general neural model for the same problem.

Collaborative filtering: systems recommend items based on similarity measures between users and/or items. The items recommended to a user are those preferred by similar users. 

The MovieLens dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100004 ratings and 1296 tag applications across 9125 movies. https://grouplens.org/datasets/movielens/. To get the data:

`wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip`

In [1]:
!wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

--2018-10-18 23:15:05--  http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.34.235
Connecting to files.grouplens.org (files.grouplens.org)|128.101.34.235|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2018-10-18 23:15:06 (1.94 MB/s) - ‘ml-latest-small.zip’ saved [978202/978202]



In [3]:
!ls
import zipfile as zip

lesson1-intro.ipynb	      ml-latest-small.zip
lesson2-recommendation.ipynb  README.md


In [11]:
path="/home/jyoti/Deep-Learning-with-Pytorch/"
zip_ref = zip.ZipFile(path+"ml-latest-small.zip", 'r')
zip_ref.extractall(path)
zip_ref.close()

In [12]:
!ls

lesson1-intro.ipynb	      ml-latest-small	   README.md
lesson2-recommendation.ipynb  ml-latest-small.zip


We can see that the zip file has been extracted correctly in the same folder

## MovieLens dataset

In [13]:
from pathlib import Path
import pandas as pd
import numpy as np

In [14]:
PATH = Path("/home/jyoti/Deep-Learning-with-Pytorch/ml-latest-small/")
list(PATH.iterdir())

[PosixPath('/home/jyoti/Deep-Learning-with-Pytorch/ml-latest-small/movies.csv'),
 PosixPath('/home/jyoti/Deep-Learning-with-Pytorch/ml-latest-small/links.csv'),
 PosixPath('/home/jyoti/Deep-Learning-with-Pytorch/ml-latest-small/ratings.csv'),
 PosixPath('/home/jyoti/Deep-Learning-with-Pytorch/ml-latest-small/tags.csv'),
 PosixPath('/home/jyoti/Deep-Learning-with-Pytorch/ml-latest-small/README.txt')]

In [15]:
! head /home/jyoti/Deep-Learning-with-Pytorch/ml-latest-small/ratings.csv

userId,movieId,rating,timestamp
1,1,4.0,964982703
1,3,4.0,964981247
1,6,4.0,964982224
1,47,5.0,964983815
1,50,5.0,964982931
1,70,3.0,964982400
1,101,5.0,964980868
1,110,4.0,964982176
1,151,5.0,964984041


In [16]:
# reading a csv into pandas
data = pd.read_csv(PATH/"ratings.csv")

In [17]:
data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### Encoding data
We enconde the data to have contiguous ids for users and movies. You can think about this as a categorical encoding of our two categorical variables userId and movieId.

In [18]:
# split train and validation before encoding
np.random.seed(3)
msk = np.random.rand(len(data)) < 0.8
train = data[msk].copy()
val = data[~msk].copy()

In [19]:
# here is a handy function modified from fast.ai
def proc_col(col, train_col=None):
    """Encodes a pandas column with continous ids. 
    """
    if train_col is not None:
        uniq = train_col.unique()
    else:
        uniq = col.unique()
    name2idx = {o:i for i,o in enumerate(uniq)}
    return name2idx, np.array([name2idx.get(x, -1) for x in col]), len(uniq) #-1 if name is not present

In [28]:
def encode_data(df, train=None):
    """ Encodes rating data with continous user and movie ids. 
    If train is provided, encodes df with the same encoding as train.
    """
    df = df.copy()
    for col_name in ["userId", "movieId"]:
        train_col = None
        if train is not None:
            train_col = train[col_name]
        _,col,_ = proc_col(df[col_name], train_col)
        df[col_name] = col
        print("Not found",len(df[df[col_name] <= 0]))
        df = df[df[col_name] >= 0] #Keep only ids which have name2idx
    return df

In [30]:
df_train = encode_data(train)
df_val = encode_data(val, train)
print(df_train.head())
print(df_val.head())

Not found 193
Not found 178
Not found 39
Not found 832
   userId  movieId  rating  timestamp
0       0        0     4.0  964982703
1       0        1     4.0  964981247
2       0        2     4.0  964982224
3       0        3     5.0  964983815
6       0        4     5.0  964980868
    userId  movieId  rating  timestamp
4        0      388     5.0  964982931
5        0      995     3.0  964982400
29       0      841     4.0  964981179
30       0      567     4.0  964982653
32       0      402     4.0  964982546


80450

## Embedding layer

In [31]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [33]:
# an Embedding module containing 10 users or items embedding size 3
# embedding will be initialized at random
embed = nn.Embedding(10, 3)
embed.weight

Parameter containing:
tensor([[-0.5182, -0.2305,  1.1904],
        [ 2.0218,  0.0332,  0.3650],
        [ 0.1089,  1.0361, -0.3134],
        [-1.5838,  0.5323, -0.7510],
        [-0.5565,  0.2681, -0.5809],
        [-0.3779,  1.5249, -0.3305],
        [ 1.5884, -0.3437,  0.9804],
        [ 0.0469,  0.5491,  0.5290],
        [ 0.3627, -0.6631, -1.3172],
        [-0.2401, -1.0836,  0.7456]], requires_grad=True)

In [34]:
# given a list of ids we can "look up" the embedding corresponing to each id
# can you see that some vectors are the same?
a = torch.LongTensor([[1,0,1,4,5,1]])
embed(a)

tensor([[[ 2.0218,  0.0332,  0.3650],
         [-0.5182, -0.2305,  1.1904],
         [ 2.0218,  0.0332,  0.3650],
         [-0.5565,  0.2681, -0.5809],
         [-0.3779,  1.5249, -0.3305],
         [ 2.0218,  0.0332,  0.3650]]], grad_fn=<EmbeddingBackward>)

## Matrix factorization model

In [35]:
class MF(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        # initlializing weights
        self.user_emb.weight.data.uniform_(0,0.05) #Why we need initialization
        self.item_emb.weight.data.uniform_(0,0.05)
        
    def forward(self, u, v):
        u = self.user_emb(u)
        v = self.item_emb(v)
        return (u*v).sum(1)   #Element wise multiplication

## Debugging MF model

In [38]:
num_users = 7
num_items = 4
emb_size = 3

user_emb = nn.Embedding(num_users, emb_size)
item_emb = nn.Embedding(num_items, emb_size)
users = torch.LongTensor(df_t_e.userId.values)
items = torch.LongTensor(df_t_e.movieId.values)

In [39]:
U = user_emb(users)
V = item_emb(items)

NameError: name 'users' is not defined

In [20]:
U

tensor([[ 0.2887, -0.1039, -0.6517],
        [ 0.2887, -0.1039, -0.6517],
        [-0.7562,  0.7185, -2.2700],
        [-0.7562,  0.7185, -2.2700],
        [ 1.6527, -0.2885,  0.0281],
        [ 1.6527, -0.2885,  0.0281],
        [-1.0987, -1.5382,  0.3912],
        [-1.0987, -1.5382,  0.3912],
        [-2.2866, -0.6564, -0.5094],
        [-2.2866, -0.6564, -0.5094],
        [ 0.1742, -1.2741,  0.6683],
        [-0.1845, -1.2902, -0.1542],
        [-0.1845, -1.2902, -0.1542]])

In [21]:
# element wise multiplication
U*V 

tensor([[-0.4151,  0.0906,  0.1340],
        [-0.2531,  0.0403, -0.3552],
        [ 0.6629, -0.2785, -1.2372],
        [ 0.1777, -0.8371, -1.0000],
        [-2.3761,  0.2516, -0.0058],
        [-1.4488,  0.1118,  0.0153],
        [ 1.5797,  1.3414, -0.0805],
        [ 0.7645, -0.0590, -0.1178],
        [ 3.2875,  0.5724,  0.1048],
        [ 1.5910, -0.0252,  0.1533],
        [-0.1212, -0.0489, -0.2012],
        [ 0.1617,  0.5000, -0.0840],
        [ 0.1283, -0.0495,  0.0464]])

In [22]:
# what we want is a dot product per row
(U*V).sum(1) 

tensor([-0.1905, -0.5680, -0.8528, -1.6593, -2.1303, -1.3217,  2.8406,
         0.5877,  3.9646,  1.7191, -0.3713,  0.5777,  0.1252])

## Training MF model

In [41]:
num_users = len(df_train.userId.unique())
num_items = len(df_train.movieId.unique())
print(num_users, num_items) 

610 8998


In [42]:
model = MF(num_users, num_items, emb_size=100)  # if you have a GPU .cuda()

In [43]:
# here we are not using data loaders because our data fits well in memory
def train_epocs(model, epochs=10, lr=0.01, wd=0.0, unsqueeze=False):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=wd)
    model.train()
    for i in range(epochs):
        users = torch.LongTensor(df_train.userId.values)  #.cuda()
        items = torch.LongTensor(df_train.movieId.values) #.cuda()
        ratings = torch.FloatTensor(df_train.rating.values)  #.cuda()
        if unsqueeze:
            ratings = ratings.unsqueeze(1)
        y_hat = model(users, items)
        loss = F.mse_loss(y_hat, ratings)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print(loss.item()) # used to be loss.data[0]
    test_loss(model, unsqueeze)

In [48]:
# Here is what unsqueeze does
ratings = torch.FloatTensor(df_train.rating.values)
print(ratings.shape)
ratings = ratings.unsqueeze(1) #.cuda()
ratings.shape

torch.Size([80450])


torch.Size([80450, 1])

In [51]:
def test_loss(model, unsqueeze=False):
    model.eval() #Explicitly letting the model know that it is a evaluation and not training
    users = torch.LongTensor(df_val.userId.values) # .cuda()
    items = torch.LongTensor(df_val.movieId.values) #.cuda()
    ratings = torch.FloatTensor(df_val.rating.values) #.cuda()
    if unsqueeze:
        ratings = ratings.unsqueeze(1)
    y_hat = model(users, items)
    loss = F.mse_loss(y_hat, ratings)
    print("test loss %.3f " % loss.item())

In [52]:
train_epocs(model, epochs=10, lr=0.1)

1.6454849243164062
5.6897969245910645
4.107996463775635
1.0472257137298584
2.813699960708618
2.506274700164795
0.7590300440788269
1.2165346145629883
2.0785794258117676
1.9998751878738403
test loss 1.438 


In [53]:
train_epocs(model, epochs=15, lr=0.01)

1.2020612955093384
0.8295372724533081
0.6758217215538025
0.677545964717865
0.7039048075675964
0.6925514340400696
0.6531720161437988
0.6152136325836182
0.5976650714874268
0.6002582907676697
0.6099865436553955
0.613412618637085
0.6045016050338745
0.5854303240776062
0.5633542537689209
test loss 0.787 


In [54]:
train_epocs(model, epochs=15, lr=0.01)

0.5459926128387451
0.5513851046562195
0.5175593495368958
0.5068127512931824
0.5004147887229919
0.4849693179130554
0.4688773453235626
0.455656498670578
0.44224584102630615
0.4272359609603882
0.4124053120613098
0.3983566462993622
0.3839205205440521
0.36875271797180176
0.353739857673645
test loss 0.790 


## MF with bias

In [55]:
class MF_bias(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF_bias, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.item_bias = nn.Embedding(num_items, 1)
        # init 
        self.user_emb.weight.data.uniform_(0,0.05)
        self.item_emb.weight.data.uniform_(0,0.05)
        self.user_bias.weight.data.uniform_(-0.01,0.01)
        self.item_bias.weight.data.uniform_(-0.01,0.01)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        b_u = self.user_bias(u).squeeze()
        b_v = self.item_bias(v).squeeze()
        return (U*V).sum(1) +  b_u  + b_v

In [56]:
model = MF_bias(num_users, num_items, emb_size=100) #.cuda()

In [57]:
train_epocs(model, epochs=10, lr=0.1, wd=1e-5)

12.909826278686523
4.196595668792725
3.5819344520568848
2.319197654724121
0.7545202970504761
1.8002912998199463
2.476569890975952
2.0964980125427246
1.2626861333847046
0.9100744128227234
test loss 1.468 


In [58]:
train_epocs(model, epochs=10, lr=0.01, wd=1e-5)

1.2451554536819458
0.8298717737197876
0.6669487953186035
0.6649450659751892
0.7212553024291992
0.7655099630355835
0.7730416059494019
0.7484903335571289
0.707660436630249
0.6671319007873535
test loss 0.778 


In [59]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

0.6388686299324036
0.6308144330978394
0.6240432858467102
0.6184990406036377
0.6140540242195129
0.6105394959449768
0.6077687740325928
0.605559766292572
0.6037524342536926
0.6022184491157532
test loss 0.758 


In [60]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

0.6008635759353638
0.5988619923591614
0.5973002314567566
0.5958811044692993
0.5944750308990479
0.5930507779121399
0.5916163325309753
0.5901870727539062
0.5887723565101624
0.5873723030090332
test loss 0.757 


Note that these models are susceptible to weight initialization, optimization algorithm and regularization.

## Neural Network Model

In [61]:
# Note here there is no matrix multiplication, we could potentially make the embeddings 
# for users and items of different sizes.
# Here we could get better results by keep playing with regularization.
    
class CollabFNet(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100, n_hidden=10):
        super(CollabFNet, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.lin1 = nn.Linear(emb_size*2, n_hidden) # we take 2 because later we are going to concat user and item and feed it as input
        self.lin2 = nn.Linear(n_hidden, 1)
        self.drop1 = nn.Dropout(0.1)
        self.drop2 = nn.Dropout(0.0)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        x = torch.cat([U, V], dim=1) #Concating the features and feeding to network
        x = self.drop1(x)
        x = F.relu(self.lin1(x))
        x = self.drop2(x)
        x = self.lin2(x)
        return x

In [62]:
model = CollabFNet(num_users, num_items, emb_size=100) #.cuda()

In [63]:
train_epocs(model, epochs=20, lr=0.1, wd=1e-6, unsqueeze=True) 

12.771885871887207
5.834911823272705
10.26247501373291
1.8494447469711304
2.24432373046875
2.5424840450286865
2.2457563877105713
2.140498161315918
1.7329623699188232
1.2860621213912964
1.1354800462722778
1.0819933414459229
1.0054795742034912
0.9785750508308411
0.9333520531654358
0.8680918216705322
0.8539161682128906
0.7914043664932251
0.785370945930481
0.7429141998291016
test loss 0.959 


In [64]:
train_epocs(model, epochs=20, lr=0.01, wd=1e-6, unsqueeze=True)

0.7443450093269348
0.9130212068557739
0.7124481201171875
0.7190241813659668
0.7762979865074158
0.7371277809143066
0.6773689985275269
0.6637667417526245
0.692821741104126
0.6934863328933716
0.6671110391616821
0.6384292244911194
0.6416347026824951
0.6559163928031921
0.6541212797164917
0.6409287452697754
0.6235097646713257
0.62278813123703
0.6308428049087524
0.6326353549957275
test loss 0.873 


In [65]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-6, unsqueeze=True)

0.6213779449462891
0.6156904697418213
0.6094982624053955
0.6145181655883789
0.6146385669708252
0.6145569086074829
0.6110187768936157
0.612385094165802
0.6096471548080444
0.6068136692047119
test loss 0.854 


In [66]:
train_epocs(model, epochs=20, lr=0.001, wd=1e-6, unsqueeze=True)

0.6100354790687561
0.6111634373664856
0.6062862277030945
0.606310248374939
0.606798529624939
0.6073122024536133
0.6038792133331299
0.6040692925453186
0.603810727596283
0.6053792238235474
0.6031851172447205
0.6048499345779419
0.6034085154533386
0.601294219493866
0.6003644466400146
0.6020768880844116
0.6012876033782959
0.5980045795440674
0.5987824201583862
0.5970166325569153
test loss 0.858 


## TODO
* use t-sne to visualize embeddings

# Lab
* Can you use `tags.csv` and `timestamp` to improve your predictions?
* Play with the hyperparameters
* Look at fastai version of this network and try his transformation https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb
* You may need a dataloader if you data is larger. Can you construct a dataset? Here is an example:
https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel.html
* Work with the largest dataset http://files.grouplens.org/datasets/movielens/ml-latest.zip

# References
* This notebook is based on [lesson 5 of Jeremy Howard's Deep Learning Course](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb)