# Predicting movie ratings 

The aim of this notebook is to show how neural networks can be used for building the backbone of recommendation systems. 

The dataset used here is the Netflix dataset that just has the Movie Id, User Id, Date and Rating. 
Dataset: https://www.kaggle.com/netflix-inc/netflix-prize-data/data

To build a recommendation system, the most important part is to predict which movies a given user will like, and recommend those that he may give the highest rating to.

For this, I have built 3 models in PyTorch:
1. Matrix Factorization to create Movie and User embedded respresentations.
2. Matrix Factorization with Bias - Similar to above, but adds bias vectors for Movie and User to capture the inherent nature of both (i.e. inherent ratings that user gives to any movie, overall rating that a movie gets).
3. Neural Network Model - Using embeddings for Movie and User, and date-time features, I have built a simple neural network using only Linear layers.

Note - You do not need a GPU to run this, if the size of the data can be handled by your CPU. Or you can just run it for one file as a demo (instead of all 4).

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd
from torch.autograd import Variable as V

import matplotlib.pyplot as plt
%matplotlib inline

## Data Preprocessing
https://www.kaggle.com/laowingkin/netflix-movie-recommendation

In [2]:
filenames = [
    'combined_data_1.txt', 'combined_data_2.txt', 'combined_data_3.txt',
    'combined_data_4.txt'
]

In [3]:
f = filenames[0]

In [4]:
f

'combined_data_1.txt'

In [5]:
df = pd.read_csv(f, names=['Cust_Id', 'Rating', 'Date'], parse_dates=True)
df['Rating'] = df['Rating'].astype(float)

In [6]:
print('Dataset 1 shape: {}'.format(df.shape))
print('-Dataset examples-')
print(df.iloc[::5000000, :])

Dataset 1 shape: (24058263, 3)
-Dataset examples-
          Cust_Id  Rating        Date
0              1:     NaN         NaN
5000000   2560324     4.0  2005-12-06
10000000  2271935     2.0  2005-04-11
15000000  1921803     2.0  2005-01-31
20000000  1933327     3.0  2004-11-10


Combine all the 4 files

In [7]:
for f in filenames[1:]:
    df1 = pd.read_csv(f, names=['Cust_Id', 'Rating', 'Date'], parse_dates=True)
    df1['Rating'] = df1['Rating'].astype(float)
    df.append(df1)

In [8]:
df.index = np.arange(0, len(df))
print('Full dataset shape: {}'.format(df.shape))
print('-Dataset examples-')
print(df.iloc[::5000000, :])

Full dataset shape: (24058263, 3)
-Dataset examples-
          Cust_Id  Rating        Date
0              1:     NaN         NaN
5000000   2560324     4.0  2005-12-06
10000000  2271935     2.0  2005-04-11
15000000  1921803     2.0  2005-01-31
20000000  1933327     3.0  2004-11-10


In [9]:
del df1

In [10]:
df_nan = pd.DataFrame(pd.isnull(df.Rating))
df_nan = df_nan[df_nan['Rating'] == True]
df_nan = df_nan.reset_index()
df_nan.head()

Unnamed: 0,index,Rating
0,0,True
1,548,True
2,694,True
3,2707,True
4,2850,True


Create an array that contains the movie id. For this, number of rows between successive NaN's are taken into account.

In [11]:
movie_np = []
movie_id = 1

for i, j in zip(df_nan['index'][1:], df_nan['index'][:-1]):
    temp = np.full((1, i - j - 1), movie_id)
    movie_np = np.append(movie_np, temp)
    movie_id += 1

# Account for last record and corresponding length
last_record = np.full((1, len(df) - df_nan.iloc[-1, 0] - 1), movie_id)
movie_np = np.append(movie_np, last_record)

print('Movie numpy: {}'.format(movie_np))
print('Length: {}'.format(len(movie_np)))

Movie numpy: [  1.00000000e+00   1.00000000e+00   1.00000000e+00 ...,   4.49900000e+03
   4.49900000e+03   4.49900000e+03]
Length: 24053764


In [12]:
df = df[pd.notnull(df['Rating'])]

df['Movie_Id'] = movie_np.astype(int)
df['Cust_Id'] = df['Cust_Id'].astype(int)
print('-Dataset examples-')
print(df.iloc[::5000000, :])

-Dataset examples-
          Cust_Id  Rating        Date  Movie_Id
1         1488844     3.0  2005-09-06         1
5000996    501954     2.0  2004-08-26       996
10001962   404654     5.0  2005-08-29      1962
15002876   886608     2.0  2005-09-19      2876
20003825  1193835     2.0  2003-08-13      3825


In [13]:
df.to_pickle('temp_df_full')

In [2]:
df = pd.read_pickle('temp_df_full')
df.head()

Unnamed: 0,Cust_Id,Rating,Date,Movie_Id
1,1488844,3.0,2005-09-06,1
2,822109,5.0,2005-05-13,1
3,885013,4.0,2005-10-19,1
4,30878,4.0,2005-12-26,1
5,823519,3.0,2004-05-03,1


Filter out the movies and users that do not have too many reviews associated with them.

In [14]:
f = ['count', 'mean']

df_movie_summary = df.groupby('Movie_Id')['Rating'].agg(f)
df_movie_summary.index = df_movie_summary.index.map(int)
movie_benchmark = round(df_movie_summary['count'].quantile(0.8), 0)
drop_movie_list = df_movie_summary[
    df_movie_summary['count'] < movie_benchmark].index

print('Movie minimum times of review: {}'.format(movie_benchmark))

df_cust_summary = df.groupby('Cust_Id')['Rating'].agg(f)
df_cust_summary.index = df_cust_summary.index.map(int)
cust_benchmark = round(df_cust_summary['count'].quantile(0.8), 0)
drop_cust_list = df_cust_summary[
    df_cust_summary['count'] < cust_benchmark].index

print('Customer minimum times of review: {}'.format(cust_benchmark))

Movie minimum times of review: 3884.0
Customer minimum times of review: 79.0


In [15]:
print('Original Shape: {}'.format(df.shape))
df = df[~df['Movie_Id'].isin(drop_movie_list)]
df = df[~df['Cust_Id'].isin(drop_cust_list)]
print('After Trim Shape: {}'.format(df.shape))
print('-Data Examples-')
print(df.iloc[::5000000, :])

Original Shape: (24053764, 4)
After Trim Shape: (13528427, 4)
-Data Examples-
          Cust_Id  Rating        Date  Movie_Id
5109       785314     1.0  2005-07-13         8
8889698    332300     3.0  2004-07-13      1770
17751978   629874     4.0  2005-11-22      3391


### Processing the data for training the model
1. Sort the data by date
2. Create date-time features
3. Encode the data to help with embedding lookup
4. Split the data into train and validation. The latest 20% of the data is used as the validation set

In [16]:
df.sort_values(by=['Date'], axis=0, inplace=True)

In [17]:
df.Date = pd.to_datetime(df.Date)

In [18]:
df['year'] = df.Date.dt.year
df['dow'] = df.Date.dt.dayofweek
df['month'] = df.Date.dt.month
df.head()

Unnamed: 0,Cust_Id,Rating,Date,Movie_Id,year,dow,month
20397788,510180,2.0,1999-11-11,3870,1999,3,11
19589582,510180,4.0,1999-11-11,3730,1999,3,11
14895543,510180,3.0,1999-11-11,2866,1999,3,11
9057969,510180,5.0,1999-11-11,1798,1999,3,11
6902840,510180,5.0,1999-11-11,1367,1999,3,11


In [19]:
# here is a handy function modified from fast.ai
def proc_col(col, train_col=None):
    """Encodes a pandas column with continous ids. 
    """
    if train_col is not None:
        uniq = train_col.unique()
    else:
        uniq = col.unique()
    name2idx = {o: i for i, o in enumerate(uniq)}
    return name2idx, np.array([name2idx.get(x, -1) for x in col]), len(uniq)

In [20]:
def encode_data(df, train=None):
    """ Encodes rating data with continous user and movie ids. 
    If train is provided, encodes df with the same encoding as train.
    """
    df = df.copy()
    for col_name in ['Cust_Id', 'Movie_Id', 'year', 'dow', 'month']:
        train_col = None
        if train is not None:
            train_col = train[col_name]
        _, col, _ = proc_col(df[col_name], train_col)
        df[col_name] = col
        df = df[df[col_name] >= 0]
    return df

In [21]:
df.shape

(13528427, 7)

In [22]:
df_encode = encode_data(df)

print(df_encode.shape)

(13528427, 7)


In [23]:
k = round(len(df_encode) * 0.8)
train = df_encode[:k]
val = df_encode[k:]

In [24]:
train.shape, val.shape

((10822742, 7), (2705685, 7))

In [25]:
val.head()

Unnamed: 0,Cust_Id,Rating,Date,Movie_Id,year,dow,month
8425486,77058,4.0,2005-07-02,127,6,4,8
978848,26979,4.0,2005-07-02,771,6,4,8
139518,75880,3.0,2005-07-02,758,6,4,8
956431,55845,5.0,2005-07-02,832,6,4,8
20756448,85825,1.0,2005-07-02,155,6,4,8


In [26]:
val.describe()

Unnamed: 0,Cust_Id,Rating,Movie_Id,year,dow,month
count,2705685.0,2705685.0,2705685.0,2705685.0,2705685.0,2705685.0
mean,68783.58,3.649158,483.3145,6.0,2.931223,7.528857
std,26718.84,1.042806,279.8225,0.0,2.066913,3.863186
min,0.0,1.0,0.0,6.0,0.0,0.0
25%,51831.0,3.0,247.0,6.0,1.0,8.0
50%,81167.0,4.0,494.0,6.0,3.0,9.0
75%,89793.0,4.0,749.0,6.0,5.0,10.0
max,95324.0,5.0,899.0,6.0,6.0,11.0


In [27]:
train.describe()

Unnamed: 0,Cust_Id,Rating,Movie_Id,year,dow,month
count,10822740.0,10822740.0,10822740.0,10822740.0,10822740.0,10822740.0
mean,40517.22,3.563321,421.4489,4.952607,2.893269,5.161487
std,24514.81,1.056134,255.5251,1.065177,2.061699,3.179469
min,0.0,1.0,0.0,0.0,0.0,0.0
25%,18995.0,3.0,203.0,5.0,1.0,2.0
50%,39884.0,4.0,421.0,5.0,3.0,5.0
75%,61138.0,4.0,653.0,6.0,5.0,7.0
max,86767.0,5.0,898.0,6.0,6.0,11.0


In [28]:
# train = df_encode_train[['Rating','Cust_Id', 'Movie_Id', 'Date']]
# val = df_encode_val[['Rating','Cust_Id', 'Movie_Id', 'Date']]

**Now that the data is processed, we can start building the components of out model. There are 3 major components:**
1. DataSet - It passes the data to the DataLoader and in turn to the model in a form that the model can consume.
2. Data Loader - Splits the data (as processed in the DataSet) into mini batches and passes it into the model on the fly. (I haven't defined a custom one here.)
3. The model itself - This is where we initialse weights, define what layers to use, and the sequence.
4. Training loop - This is where we actually train for multiple epochs and also generate test predictions.

## Dataset

In [29]:
class CustomDataset(Dataset):
    def __init__(self, df):
        self.u = torch.LongTensor(df.Cust_Id.values)
        self.v = torch.LongTensor(df.Movie_Id.values)
        self.y = torch.LongTensor(df.Rating.values)

    def __len__(self):
        self.len = len(self.u)
        return self.len

    def __getitem__(self, index):
        return self.u[index], self.v[index], self.y[index]

## Baseline Model - MF without Bias

In [30]:
num_users = len(df_encode.Cust_Id.unique())
num_movies = len(df_encode.Movie_Id.unique())
emb_size = 50
num_users, num_movies

(95325, 900)

In [31]:
class MF(nn.Module):
    def __init__(self, num_user, num_movie, emb_size):
        super(MF, self).__init__()
        self.user_emb = nn.Embedding(num_user, emb_size)
        self.movie_emb = nn.Embedding(num_movie, emb_size)
        self.user_emb.weight.data.uniform_(0, 0.05)
        self.movie_emb.weight.data.uniform_(0, 0.05)

    def forward(self, u, v):
        u = self.user_emb(u)
        v = self.movie_emb(v)
        return F.sigmoid((u * v).sum(1)) * 4 + 1

In [32]:
train_ds = CustomDataset(train[['Cust_Id', 'Rating', 'Movie_Id']])
train_dl = DataLoader(train_ds, batch_size=100000, shuffle=True)

# val_ds = CustomDataset(val)
# val_dl = DataLoader(val_ds, batch_size = 100000, shuffle=False)
# u,v,y = next(iter(train_dl))

In [33]:
model = MF(num_users, num_movies, emb_size).cuda()
model

MF(
  (user_emb): Embedding(95325, 50)
  (movie_emb): Embedding(900, 50)
)

In [34]:
def test_loss(model, val):
    model.eval()
    preds = []
    user = V(torch.LongTensor(val.Cust_Id.values)).cuda()
    movie = V(torch.LongTensor(val.Movie_Id.values)).cuda()
    rating = V(torch.LongTensor(val.Rating.values)).float().cuda()
    y_hat = model(user, movie)
    loss = F.mse_loss(y_hat, rating)
    #     print("Validation loss %.3f " % loss.data[0])
    #     return y_hat
    return loss.data[0]

In [35]:
def train_loop(model, train_dl, val, epochs, learning_rate, wd=0):
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optimiser = torch.optim.Adam(parameters, learning_rate, weight_decay=wd)
    for i in range(epochs):
        model.train()
        for j, d in enumerate(train_dl):
            mb_loss = []
            user = V(d[0]).cuda()
            movie = V(d[1]).cuda()
            rating = V(d[2]).float().cuda()
            y_hat = model(user, movie)
            loss = F.mse_loss(y_hat, rating)
            optimiser.zero_grad()
            loss.backward()
            mb_loss.append(loss.data[0])
            optimiser.step()
        print(f'Training loss for epoch {i} = {np.mean(mb_loss)}')
        print(f'Validation loss for epoch {i} = {test_loss(model, val)}')


#     return test_loss(model, val)

In [36]:
train_loop(
    model,
    train_dl,
    val[['Cust_Id', 'Rating', 'Movie_Id']],
    3,
    0.05,
    wd=0.00001)

Training loss for epoch 0 = 0.8476807475090027
Validation loss for epoch 0 = 1.2101482152938843
Training loss for epoch 1 = 0.8126561045646667
Validation loss for epoch 1 = 1.1889935731887817
Training loss for epoch 2 = 0.7813032269477844
Validation loss for epoch 2 = 1.1767539978027344


This approach gives us an MSE Loss od 1.176 on the validation set. I am using this as my baseline model and training 2 more models to improve the performance.

## MF with Bias

In [37]:
class MFBias(nn.Module):
    def __init__(self, num_user, num_movie, emb_size):
        super(MFBias, self).__init__()
        self.user_emb = nn.Embedding(num_user, emb_size)
        self.movie_emb = nn.Embedding(num_movie, emb_size)
        self.user_bias = nn.Embedding(num_user, 1)
        self.movie_bias = nn.Embedding(num_movie, 1)

        self.user_emb.weight.data.uniform_(0, 0.05)
        self.movie_emb.weight.data.uniform_(0, 0.05)
        self.user_bias.weight.data.uniform_(-0.01, 0.01)
        self.movie_bias.weight.data.uniform_(-0.01, 0.01)

    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.movie_emb(v)

        b_u = self.user_bias(u).squeeze()
        b_v = self.movie_bias(v).squeeze()

        return F.sigmoid((U * V).sum(1) + b_u + b_v) * 4 + 1

In [38]:
bias_model = MFBias(num_users, num_movies, emb_size).cuda()

In [39]:
train_loop(bias_model, train_dl, val, 3, 0.05, wd=0.00001)

Training loss for epoch 0 = 0.7856623530387878
Validation loss for epoch 0 = 0.9371397495269775
Training loss for epoch 1 = 0.7730052471160889
Validation loss for epoch 1 = 0.9305683970451355
Training loss for epoch 2 = 0.777308464050293
Validation loss for epoch 2 = 0.9317906498908997


Thus adding user and movie bias made the model better by reducing MSE Loss to 0.93 from 1.176.

## Neural Net Approach

For this approach, I have created embeddings for user, movie, year, month and day of the week.

In [40]:
class CollabNN(nn.Module):
    def __init__(self,
                 num_users,
                 num_movies,
                 emb_size,
                 num_years,
                 num_months,
                 num_dow,
                 layer_size=100):
        super(CollabNN, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.movie_emb = nn.Embedding(num_movies, emb_size)
        self.user_emb.weight.data.uniform_(0, 0.05)
        self.movie_emb.weight.data.uniform_(0, 0.05)

        self.year_emb = nn.Embedding(num_years, 10)
        self.month_emb = nn.Embedding(num_months, 10)
        self.dow_emb = nn.Embedding(num_dow, 10)

        self.year_emb.weight.data.uniform_(0, 0.05)
        self.month_emb.weight.data.uniform_(0, 0.05)
        self.dow_emb.weight.data.uniform_(0, 0.05)

        self.lin1 = nn.Linear(emb_size * 2 + 30, layer_size)
        self.lin2 = nn.Linear(layer_size, 10)
        self.lin3 = nn.Linear(10, 1)
        self.drop1 = nn.Dropout(0.2)
        self.drop2 = nn.Dropout(0.2)

    def forward(self, u, v, year, month, dow):
        u = self.user_emb(u)
        v = self.movie_emb(v)
        year = self.year_emb(year)
        dow = self.dow_emb(dow)
        month = self.month_emb(month)

        x = self.drop1(torch.cat([u, v, year, month, dow], dim=1))
        x = F.relu(self.lin1(x))
        x = self.drop2(F.relu(self.lin2(x)))
        x = self.lin3(x)
        return F.sigmoid(x) * 4 + 1

In [41]:
class CustomDataset2(Dataset):
    def __init__(self, df):
        self.u = torch.LongTensor(df.Cust_Id.values)
        self.v = torch.LongTensor(df.Movie_Id.values)
        self.year = torch.LongTensor(df.year.values)
        self.month = torch.LongTensor(df.month.values)
        self.dow = torch.LongTensor(df.dow.values)
        self.y = torch.LongTensor(df.Rating.values)

    def __len__(self):
        self.len = len(self.u)
        return self.len

    def __getitem__(self, index):
        return self.u[index], self.v[index], self.year[index], self.month[
            index], self.dow[index], self.y[index]

In [42]:
num_years = len(df_encode.year.unique())
num_months = len(df_encode.month.unique())
num_dow = len(df_encode.dow.unique())
num_dow, num_months, num_years

(7, 12, 7)

In [43]:
train_ds = CustomDataset2(train)
train_dl = DataLoader(train_ds, batch_size=100000, shuffle=True)

In [44]:
def test_loss(model, val):
    model.eval()
    preds = []
    user = V(torch.LongTensor(val.Cust_Id.values)).cuda()
    movie = V(torch.LongTensor(val.Movie_Id.values)).cuda()
    year = V(torch.LongTensor(val.year.values)).cuda()
    month = V(torch.LongTensor(val.month.values)).cuda()
    dow = V(torch.LongTensor(val.dow.values)).cuda()
    rating = V(torch.LongTensor(val.Rating.values)).float().cuda()

    y_hat = model(user, movie, year, month, dow)
    loss = F.mse_loss(y_hat, rating)
    #     print("Validation loss %.3f " % loss.data[0])
    #     return y_hat
    return loss.data[0]

In [45]:
def train_loop(model, train_dl, val, epochs, learning_rate, wd=0):
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optimiser = torch.optim.Adam(parameters, learning_rate, weight_decay=wd)
    for i in range(epochs):
        model.train()
        for j, d in enumerate(train_dl):
            mb_loss = []
            user = V(d[0]).cuda()
            movie = V(d[1]).cuda()
            year = V(d[2]).cuda()
            month = V(d[3]).cuda()
            dow = V(d[4]).cuda()
            rating = V(d[5]).float().cuda()
            y_hat = model(user, movie, year, month, dow)
            loss = F.mse_loss(y_hat, rating)
            optimiser.zero_grad()
            loss.backward()
            mb_loss.append(loss.data[0])
            optimiser.step()
        print(f'Training loss for epoch {i} = {np.mean(mb_loss)}')
        print(f'Validation loss for epoch {i} = {test_loss(model, val)}')


#     return test_loss(model, val)

In [46]:
nn_model = CollabNN(num_users, num_movies, 10, num_years, num_months, num_dow,
                    200).cuda()

In [47]:
train_loop(nn_model, train_dl, val, 3, 0.05, 0)

Training loss for epoch 0 = 0.8176421523094177
Validation loss for epoch 0 = 0.9230867624282837
Training loss for epoch 1 = 0.7879918813705444
Validation loss for epoch 1 = 0.9167137742042542
Training loss for epoch 2 = 0.7823916077613831
Validation loss for epoch 2 = 0.9127696752548218


Thus we see that this model gives lower MSE Loss (0.91) than the previous 2 (1.176 and 0.93).