# COLLABORRATIVE FILTERING FOR THE MOVIELENS DATA

We are going to build two models <br>
1. A Simple Dot product model that illustrates the construction and use of latent factors, we will try and optimise that model with gradient decent
2. Then we will define a simple Neural Network and fit it to the data and compare it with the Academic state of the Art for this problem.<br>
I am using a smaller 100,000 dataset to make things faster.


-----------------------------------------------------------------------------------------------------------

###### Imports 

I would normally do this with a utility file that woud have all my helper functions but in this case doing it here so you can see all of it.

In [1]:
from __future__ import division,print_function
import math, os, json, sys, re
import  pickle
from importlib import reload
from glob import glob
import numpy as np
from matplotlib import pyplot as plt
from operator import itemgetter, attrgetter, methodcaller
from collections import OrderedDict
import itertools
from itertools import chain

import pandas as pd
import PIL
from PIL import Image
from numpy.random import random, permutation, randn, normal, uniform, choice
from numpy import newaxis
import scipy
from scipy import misc, ndimage
from scipy.ndimage.interpolation import zoom
from scipy.ndimage import imread
from sklearn.metrics import confusion_matrix
import bcolz
from sklearn.preprocessing import OneHotEncoder
from sklearn.manifold import TSNE

In [2]:
%matplotlib inline
import keras
from keras import backend as K
from keras.utils.data_utils import get_file
from keras.utils import np_utils
from keras.utils.np_utils import to_categorical
from keras.models import Sequential, Model
from keras.layers import Input, Embedding, Reshape, merge, LSTM, Bidirectional
from keras.layers import TimeDistributed, Activation, SimpleRNN, GRU
from keras.layers.core import Flatten, Dense, Dropout, Lambda
from keras.regularizers import l2, activity_l2, l1, activity_l1
from keras.layers.normalization import BatchNormalization
from keras.optimizers import SGD, RMSprop, Adam
from keras.utils.layer_utils import layer_from_config
from keras.metrics import categorical_crossentropy, categorical_accuracy
from keras.layers.convolutional import *
from keras.preprocessing import image, sequence
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


### IMPORTING & DATA SETUP 

We are using the Movielens data 
- Small: 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users
- Large: 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. 

We will first build out or model on the smaller dataset and look at the results we get.
Also define a folder where we can store the models that we define.
We also define a hyperpaprameter batch_size - this depends on the size that my GPU can handle.

In [20]:
#path = "data/ml-20m/"
path = "./data/ml-latest-small/"
model_path = path + 'models/'
if not os.path.exists(model_path): os.mkdir(model_path)
batch_size=64

Let us import the data in using pandas and the *read_csv* mathod and take a look at the first five entries

In [4]:
ratings = pd.read_csv(path+'ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
len(ratings) # get the total number of ratings

100836

Here we are also using the other files that has the name of the movies so we can use it to display some results later.
- We use the movieId as the key and the names as the values of the dict

In [6]:
movie_names = pd.read_csv(path+'movies.csv').set_index('movieId')['title'].to_dict()

In [7]:
movie_names[1]

'Toy Story (1995)'

Let us now get all the users and the movies in the ratings list & since each row is a single user rating a single movie with a rating we wil use the unique method to get the arrays for the users and the movies


In [8]:
users = ratings.userId.unique()
movies = ratings.movieId.unique()

Let us now build out the userid to index and the movie id to index so we can look them up 

In [9]:
userid2idx = {o:i for i,o in enumerate(users)}
movieid2idx = {o:i for i,o in enumerate(movies)}

Let us now get the movieId in the ratings dataframe to be the same as the one in the movieid2idx by looking it up and applying it

In [10]:
ratings.movieId = ratings.movieId.apply(lambda x: movieid2idx[x])
ratings.userId = ratings.userId.apply(lambda x: userid2idx[x])

Let us now take a look at some of the ranges of data that we have in the ratings table

In [11]:
user_min, user_max, movie_min, movie_max = (ratings.userId.min(), 
    ratings.userId.max(), ratings.movieId.min(), ratings.movieId.max())
user_min, user_max, movie_min, movie_max

(0, 609, 0, 9723)

Again the number of unique users and the total number of movies

In [12]:
n_users = ratings.userId.nunique()
n_movies = ratings.movieId.nunique()
n_users, n_movies

(610, 9724)

Since we using collaborative filtering and supposing there are latent factors for the kinds of users and the kinds of movies we have to decide how many latent factor we want to use as influencing both. 
I picked it to be 50

In [13]:
n_factors = 50

Setting the seed for repetability

In [14]:
np.random.seed = 42

We now split the training data into training set and validation set, i have decided it to be a 80-20 split.
We get some ratings for all users and some ratings for all movies

In [15]:
msk = np.random.rand(len(ratings)) < 0.8
trn = ratings[msk]
val = ratings[~msk]

### DOT PRODUCT MODEL

Let us first just create a dot product model where take the matrix dot product of the latent factors for both users and the movies and try and minimize the ratings error.

- We are going to random initialize the latent factors and use gradient decent to learn them
- We are using the "Embedding" layer of Keras - which convenietly looks up the latent factor for each user and movie and saves us the implementation of the one hot encoding matrix multiplication which could be slow.

Creating an Input layer for both the users and the movies.<br>
Creating the Embedding layers with l2 weight regularization for both the users and movies

In [16]:
user_in = Input(shape=(1,), dtype='int64', name='user_in')
u = Embedding(n_users, n_factors, input_length=1, W_regularizer=l2(1e-4))(user_in)

movie_in = Input(shape=(1,), dtype='int64', name='movie_in')
m = Embedding(n_movies, n_factors, input_length=1, W_regularizer=l2(1e-4))(movie_in)

Instructions for updating:
keep_dims is deprecated, use keepdims instead


Taking the dot product of the embed layers and define the model.<br>
Compile the model with a learing rate of 0.001 and train it for just 1 epochs first<br>
Using the mean sqaured error loss function

In [17]:
x = merge([u, m], mode='dot')
x = Flatten()(x)
model = Model([user_in, movie_in], x)
model.compile(Adam(0.001), loss='mse')

Instructions for updating:
keep_dims is deprecated, use keepdims instead
Instructions for updating:
keep_dims is deprecated, use keepdims instead


In [18]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=1, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 80727 samples, validate on 20109 samples
Epoch 1/1


<keras.callbacks.History at 0x7fe9d0f22be0>

Some learing rate Annealing

In [21]:
model.optimizer.lr=0.01

In [22]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=3, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 80727 samples, validate on 20109 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fea98446240>

In [23]:
model.optimizer.lr=0.001

In [24]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=6, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 80727 samples, validate on 20109 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


<keras.callbacks.History at 0x7fe9d1ca8e80>

### ADDING BIAS

Our Model is missing some parts here - there are users who are movie buffs and rate most movies and there are movies that most people like to account for this we add a user & movie bias to our model and train again.

This is just a helper function that creates the embeddings

In [25]:
def create_embed(name, n_in, n_out, reg):
    inp = Input(shape=(1,), dtype='int64', name=name)
    return inp, Embedding(n_in, n_out, input_length=1, W_regularizer=l2(reg))(inp)

In [26]:
user_in, u = create_embed('user_in', n_users, n_factors, 1e-4)
movie_in, m = create_embed('movie_in', n_movies, n_factors, 1e-4)

Another helper that creates a bias with an embedding with a single output 

In [27]:
def create_bias(inp, n_in):
    x = Embedding(n_in, 1, input_length=1)(inp)
    return Flatten()(x)

creating the user bias & the movie bias

In [28]:
ub = create_bias(user_in, n_users)
mb = create_bias(movie_in, n_movies)

In [29]:
x = merge([u, m], mode='dot')
x = Flatten()(x)
x = merge([x, ub], mode='sum')
x = merge([x, mb], mode='sum')
model = Model([user_in, movie_in], x)
model.compile(Adam(0.01), loss='mse')

In [30]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=1, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 80727 samples, validate on 20109 samples
Epoch 1/1


<keras.callbacks.History at 0x7fe9d02f3630>

##### Some Learning Rate Annealing 

In [31]:
model.optimizer.lr=0.01

In [32]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=6, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 80727 samples, validate on 20109 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


<keras.callbacks.History at 0x7fe9d041ab00>

In [33]:
model.optimizer.lr=0.001

In [34]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=10, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 80727 samples, validate on 20109 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fea3eb04588>

SO that took us to a Vaidation loss of about 1.1 that is not very great since the Academic state of the art for such a problem for this dataset RMSE is 0.89. So we are not there yet!

###### Saving the Model & Reloading 

In [35]:
model.save_weights(model_path+'bias.h5')

In [36]:
model.load_weights(model_path+'bias.h5')

Making a Prediction for user #3 and movie #6

In [52]:
model.predict([np.array([3]), np.array([6])])

array([[3.0126]], dtype=float32)

# Neural Net Model

Let us now create a Neural net model by concatenating the user and movie vectors

In [39]:
user_in, u = create_embed('user_in', n_users, n_factors, 1e-4)
movie_in, m = create_embed('movie_in', n_movies, n_factors, 1e-4)

In [41]:
x = merge([u, m], mode='concat')
x = Flatten()(x)
x = Dropout(0.3)(x)
x = Dense(70, activation='relu')(x)
x = Dropout(0.75)(x)
x = Dense(1)(x)
nn = Model([user_in, movie_in], x)
nn.compile(Adam(0.001), loss='mse')

In [42]:
nn.fit([trn.userId, trn.movieId], trn.rating, batch_size=128, nb_epoch=10, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 80727 samples, validate on 20109 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fe9c8533a20>

In [43]:
nn.optimizer.lr = 0.001

In [44]:
nn.fit([trn.userId, trn.movieId], trn.rating, batch_size=128, nb_epoch=5, 
          validation_data=([val.userId, val.movieId], val.rating))

Train on 80727 samples, validate on 20109 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fe9d0589438>

This is much better than the Academic State of Art Benchmark

###### Saving the Model & Making a Prediction

In [45]:
nn.save_weights(model_path+'nn_model.h5')

In [46]:
nn.load_weights(model_path+'nn_model.h5')

We can use the model get predictions for a user movie pair passing a user id and a movie id <br> -
In this case we pass in the user #4 and movie #6 to get if the user would like the movie.

In [54]:
nn.predict([np.array([4]), np.array([6])])

array([[3.9541526]], dtype=float32)