This notebook involves playing with movies, users, and their ratings to generate something useful, it uses collaborative filtering to give suggestions

In [1]:
import numpy as np
import pandas as pd

In [2]:
movies= pd.read_csv('archive/movies.csv')
ratings= pd.read_csv('archive/ratings.csv')

In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
ratings.userId.unique()# 610 unique userId

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104,
       105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117,
       118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130,
       131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
       144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156,
       157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169,
       170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 18

In [6]:
movies.movieId.unique()

array([     1,      2,      3, ..., 193585, 193587, 193609])

In [7]:
movies.shape,ratings.shape

((9742, 3), (100836, 4))

In [8]:
global_avg= ratings.rating.mean()

In [9]:
ratings.rating= ratings.rating - global_avg

In [10]:
user_avg= ratings.groupby('userId')['rating'].agg('mean')

In [11]:
movie_avg= ratings.groupby('movieId')['rating'].agg('mean')

This idea is from netflix prize competition, it involves matrix factorization to get embeddings of both movies and users, using them to create embeddings from known data then using those embeddings we predict ratings on unseen movies and find out which is rated highest according to their embedding and recommend them. We take averages of users/movies/globally to see through the noise and model the exact value, and this works better than plain idea told above, this makes optimization better suit our needs.

In [12]:
user_avg.index

Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
       ...
       601, 602, 603, 604, 605, 606, 607, 608, 609, 610],
      dtype='int64', name='userId', length=610)

In [13]:
#now lets create movie and user matrices
hidden_dim= 2
Um= np.random.randn(*(len(user_avg),2))
Mm= np.random.randn(*(len(movie_avg),2))

In [14]:
# make the embeddings trainable and move them such that they match the ratings that the data tells us
ratings.rating= np.array(list(map(lambda x:x[1]['rating']- user_avg.loc[x[1]['userId']]-movie_avg.loc[x[1]['movieId']],ratings.iterrows())))

In [15]:
ratings.rating

0        -0.785753
1        -0.124438
2        -0.810901
3         0.159808
4        -0.102567
            ...   
100831    0.479668
100832    0.670144
100833    1.179668
100834    0.533001
100835    0.479668
Name: rating, Length: 100836, dtype: float64

In [16]:
import torch

In [17]:
Um= torch.tensor(Um,requires_grad=True)
Mm= torch.tensor(Mm,requires_grad=True)

In [18]:
ratings.rating.mean()# mean is 0

np.float64(-2.057583391268488e-16)

In [19]:
# what we have done can be done with just embeddings in pytorch, it is just the same, a lookup with vectorrs which can be finetuned for the task.
#for i,u in enumerate(Um):
    #for j,m in enumerate(Mm):
        #user_avg.index[i] 
        #movie_avg.index[j] 

In [20]:
# lets create a dataset with the index as user_id,movie_id and then values will be just the ratings
dataset= ratings[['userId','movieId','rating']]

In [21]:
# using embeddings will be better then the data access will be taken care by pytorch itself.
import torch.nn as nn
import torch.functional as F

In [22]:
Um,Mm=None,None

In [23]:
class RecSys(nn.Module):
    def __init__(self,emb_dim=2):
        super().__init__()
        self.users= nn.Embedding(max(ratings.userId)+1,emb_dim)
        self.movies= nn.Embedding(max(ratings.movieId)+1,emb_dim)
    def forward(self,uid,mid):
        #print(uid,mid)
        return (self.users(uid)*self.movies(mid)).sum(dim=-1)

In [24]:
loss= nn.MSELoss()

In [25]:
bs= 300

In [26]:
from torch.utils.data import DataLoader

In [27]:
dataloader= DataLoader(dataset.values,batch_size=bs,)

In [28]:
model= RecSys()
optimizer= torch.optim.Adam(model.parameters(),lr=3e-4)

In [30]:
for epoch in range(100):
    for i in dataloader:
        x,y= i[:,:-1],i[:,-1]
        optimizer.zero_grad()
        outputs= model(x[:,0].int(),x[:,1].int())
        L= loss(outputs,y.float())
        L.backward()
        optimizer.step()
    print(f'Epoch: {epoch} Loss: {L.data:.3f}')

Epoch: 0 Loss: 4.242
Epoch: 1 Loss: 4.156
Epoch: 2 Loss: 4.072
Epoch: 3 Loss: 3.990
Epoch: 4 Loss: 3.910
Epoch: 5 Loss: 3.831
Epoch: 6 Loss: 3.753
Epoch: 7 Loss: 3.677
Epoch: 8 Loss: 3.602
Epoch: 9 Loss: 3.528
Epoch: 10 Loss: 3.456
Epoch: 11 Loss: 3.384
Epoch: 12 Loss: 3.314
Epoch: 13 Loss: 3.245
Epoch: 14 Loss: 3.178
Epoch: 15 Loss: 3.111
Epoch: 16 Loss: 3.045
Epoch: 17 Loss: 2.981
Epoch: 18 Loss: 2.918
Epoch: 19 Loss: 2.855
Epoch: 20 Loss: 2.794
Epoch: 21 Loss: 2.734
Epoch: 22 Loss: 2.675
Epoch: 23 Loss: 2.617
Epoch: 24 Loss: 2.560
Epoch: 25 Loss: 2.504
Epoch: 26 Loss: 2.448
Epoch: 27 Loss: 2.394
Epoch: 28 Loss: 2.341
Epoch: 29 Loss: 2.289
Epoch: 30 Loss: 2.238
Epoch: 31 Loss: 2.187
Epoch: 32 Loss: 2.138
Epoch: 33 Loss: 2.089
Epoch: 34 Loss: 2.042
Epoch: 35 Loss: 1.995
Epoch: 36 Loss: 1.949
Epoch: 37 Loss: 1.904
Epoch: 38 Loss: 1.860
Epoch: 39 Loss: 1.816
Epoch: 40 Loss: 1.774
Epoch: 41 Loss: 1.732
Epoch: 42 Loss: 1.691
Epoch: 43 Loss: 1.651
Epoch: 44 Loss: 1.612
Epoch: 45 Loss: 1.57

Since this is training, we have embeddings which we can use to do anything, for example we can plot them and we will see similar movies very close to each other, or find the user,s best recommendation.