## Neural Collaborative Filtering

### Motivation

In [Lesson 5](http://course.fast.ai/lessons/lesson5.html) of Fast.ai course, Jeremy Howard mentions about using neural networks for collaborative filtering. The course implements this using PyTorch and Fast.ai library. I have tried to implement the same using Keras

### Recommender Systems

Recommender systems are all around us. From Amazon to Google to Netflix, everyone is trying to use recommender systems to recommend what products to buy next or what movies to watch.  
Below are some of the statistics which will help in determining the importance of these systems.  
* **Netflix**: 2/3 movies watched are recommended  
* **Google News**: recommendations generate 38% more clickthrough  
* **Amazon**: 35% sales from recommendations  

### Collaborative Filtering

In the newer, narrower sense, collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue than that of a randomly chosen person.  
The traditional approach of CF uses matrix factorization to learn the attitudes and preferences of users and create a small latent space which will capture all the information. These are called **Embeddings**.  
Let us use deep learning to learn these embedding matrices instead of the traditional matrix factorization methods.

### Data Set

We will be using the new MovieLens Dataset that has approximately 100000 ratings, 9000 movie and 700 users Available here: [https://grouplens.org/datasets/movielens/](https://grouplens.org/datasets/movielens/)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from keras.models import Sequential, Model
from keras.layers import Embedding, Reshape, Activation, Input, Dense, Flatten, Dropout
from keras.layers.merge import Dot, multiply, concatenate
from keras.utils import np_utils
from keras.layers import merge, Merge
from keras.optimizers import Adagrad, Adam, SGD, RMSprop
from sklearn.metrics import mean_absolute_error

%matplotlib inline

Using TensorFlow backend.


In [2]:
data = pd.read_csv('ml-latest-small/ratings.csv')

In [3]:
data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [4]:
data.shape

(100004, 4)

In [5]:
print('Number of unique users ' + str(len(data.userId.unique())))
print('Number of unique movies ' + str(len(data.movieId.unique())))

Number of unique users 671
Number of unique movies 9066


Assign a unique number between 0 and # users to each user. Do the same for movies.

In [6]:
data.userId = data.userId.astype('category').cat.codes.values
data.movieId = data.movieId.astype('category').cat.codes.values

In [7]:
data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,0,30,2.5,1260759144
1,0,833,3.0,1260759179
2,0,859,3.0,1260759182
3,0,906,2.0,1260759185
4,0,931,4.0,1260759205


In [8]:
print('Number of unique users ' + str(len(data.userId.unique())))
print('Number of unique movies ' + str(len(data.movieId.unique())))

Number of unique users 671
Number of unique movies 9066


Split the data set into train and test set.

In [9]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2)

### Model 1

The first model is a simple linear model where we learn a dense representation of each movies and users in our data set.

In [10]:
dim_embedddings = 30
num_movies = len(data.movieId.unique())
num_users = len(data.userId.unique())

In [11]:
m_inputs = Input(shape=(1,), dtype='int32')
m = Embedding(num_movies + 1, dim_embedddings, name="movie")(m_inputs)

u_inputs = Input(shape=(1,), dtype='int32')
u = Embedding(num_users + 1, dim_embedddings, name="user")(u_inputs)

o = multiply([m, u])
o = Dropout(0.5)(o)
o = Flatten()(o)
o = Dense(1)(o)

rec_model = Model(inputs=[m_inputs, u_inputs], outputs=o)
rec_model.compile(loss='mae', optimizer='adam', metrics=["mae"])

In [12]:
history = rec_model.fit([train.movieId, train.userId], train.rating, epochs=5, verbose=2, validation_split=0.1)

Train on 72002 samples, validate on 8001 samples
Epoch 1/5
 - 28s - loss: 2.2587 - mean_absolute_error: 2.2587 - val_loss: 0.9891 - val_mean_absolute_error: 0.9891
Epoch 2/5
 - 10s - loss: 0.8437 - mean_absolute_error: 0.8437 - val_loss: 0.7412 - val_mean_absolute_error: 0.7412
Epoch 3/5
 - 10s - loss: 0.7211 - mean_absolute_error: 0.7211 - val_loss: 0.7190 - val_mean_absolute_error: 0.7190
Epoch 4/5
 - 10s - loss: 0.6803 - mean_absolute_error: 0.6803 - val_loss: 0.7172 - val_mean_absolute_error: 0.7172
Epoch 5/5
 - 10s - loss: 0.6524 - mean_absolute_error: 0.6524 - val_loss: 0.7160 - val_mean_absolute_error: 0.7160


In [13]:
print(mean_absolute_error(test.rating, rec_model.predict([test.movieId, test.userId])))

0.70013000026869


### Model 2

In this model we introduce a bias. The first model does not explicitly take into account the bias that a user might have in giving consistently high scores to every movie he watches or a movie having consistently bad scores for all users.

In [14]:
bias = 1
m_inputs = Input(shape=(1,), dtype='int32')
m = Embedding(num_movies + 1, dim_embedddings, name="movie")(m_inputs)
m_bias = Embedding(num_movies + 1, bias, name="moviebias")(m_inputs)

u_inputs = Input(shape=(1,), dtype='int32')
u = Embedding(num_users + 1, dim_embedddings, name="user")(u_inputs)
u_bias = Embedding(num_users + 1, bias, name="userbias")(u_inputs)

o = multiply([m, u])
o = concatenate([o, m_bias, u_bias])
o = Dropout(0.5)(o)
o = Flatten()(o)
o = Dense(1)(o)

rec_model = Model(inputs=[m_inputs, u_inputs], outputs=o)
rec_model.compile(loss='mae', optimizer='adam', metrics=["mae"])

In [15]:
history = rec_model.fit([train.movieId, train.userId], train.rating, epochs=5, verbose=2, validation_split=0.1)

Train on 72002 samples, validate on 8001 samples
Epoch 1/5
 - 12s - loss: 2.0693 - mean_absolute_error: 2.0693 - val_loss: 0.9141 - val_mean_absolute_error: 0.9141
Epoch 2/5
 - 11s - loss: 0.9005 - mean_absolute_error: 0.9005 - val_loss: 0.7349 - val_mean_absolute_error: 0.7349
Epoch 3/5
 - 11s - loss: 0.7172 - mean_absolute_error: 0.7172 - val_loss: 0.7117 - val_mean_absolute_error: 0.7117
Epoch 4/5
 - 11s - loss: 0.6581 - mean_absolute_error: 0.6581 - val_loss: 0.7104 - val_mean_absolute_error: 0.7104
Epoch 5/5
 - 13s - loss: 0.6223 - mean_absolute_error: 0.6223 - val_loss: 0.7100 - val_mean_absolute_error: 0.7100


In [16]:
print(mean_absolute_error(test.rating, rec_model.predict([test.movieId, test.userId])))

0.6915087548263502


### Model 3

Let's build a deeper model. This model is an implementation of this [Neural Collaborative Filtering](https://arxiv.org/abs/1708.05031) paper, code for which can be found at this github [link](https://github.com/hexiangnan/neural_collaborative_filtering).  
I would recommend reading the paper and going through the code to gain a deeper understanding of the model

In [17]:
mf_dim = 8
layers = [64, 32, 16, 8]
user_input = Input(shape=(1,), dtype='int32', name = 'user_input')
movie_input = Input(shape=(1,), dtype='int32', name = 'movie_input')

MF_Embedding_User = Embedding(input_dim = num_users + 1, output_dim = mf_dim, name = 'mf_embedding_user', input_length=1)
MF_Embedding_Movie = Embedding(input_dim = num_movies + 1, output_dim = mf_dim, name = 'mf_embedding_movie', input_length=1)

MLP_Embedding_User = Embedding(input_dim = num_users + 1, output_dim = int(layers[0]/2), name = "mlp_embedding_user", input_length=1)
MLP_Embedding_Movie = Embedding(input_dim = num_movies + 1, output_dim = int(layers[0]/2), name = 'mlp_embedding_movie', input_length=1)

mf_user_latent = Flatten()(MF_Embedding_User(user_input))
mf_movie_latent = Flatten()(MF_Embedding_Movie(movie_input))

mf_vector = merge([mf_user_latent, mf_movie_latent], mode = 'mul')

mlp_user_latent = Flatten()(MLP_Embedding_User(user_input)) 
mlp_movie_latent = Flatten()(MLP_Embedding_Movie(movie_input))

mlp_vector = merge([mlp_user_latent, mlp_movie_latent], mode = 'concat')

for idx in range(1, len(layers)):
    layer = Dense(layers[idx], activation='relu', name="layer%d" %idx)
    mlp_vector = layer(mlp_vector)
                  
predict_vector = merge([mf_vector, mlp_vector], mode = 'concat')
prediction = Dense(1, activation='sigmoid', init='lecun_uniform', name = "prediction")(predict_vector)

model = Model(input=[user_input, movie_input], output=prediction)
model.compile(optimizer=Adam(lr=0.001), loss='binary_crossentropy')

In [18]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
user_input (InputLayer)         (None, 1)            0                                            
__________________________________________________________________________________________________
movie_input (InputLayer)        (None, 1)            0                                            
__________________________________________________________________________________________________
mlp_embedding_user (Embedding)  (None, 1, 32)        21504       user_input[0][0]                 
__________________________________________________________________________________________________
mlp_embedding_movie (Embedding) (None, 1, 32)        290144      movie_input[0][0]                
__________________________________________________________________________________________________
flatten_5 

In [19]:
history = model.fit([train.userId, train.movieId], train.rating, epochs=1, verbose=2, validation_split=0.1)

Train on 72002 samples, validate on 8001 samples
Epoch 1/1
 - 15s - loss: -3.8657e+01 - val_loss: -4.0298e+01


In [20]:
print(mean_absolute_error(test.rating, model.predict([test.userId, test.movieId])))

2.5652217389130545


As you can see, the model heavily overfitted in the train data set. This can be overcome by adding a L2 or L1 regularizer or training on bigger data set. May be I will try this out a later stage.

## References

* [Neural Collaborative Filtering](https://arxiv.org/abs/1708.05031)
* [Xavier Amatriain Lecture](https://www.youtube.com/watch?v=bLhq63ygoU8)
* [Fast ai lectures](http://course.fast.ai/lessons/lesson5.html)