# Recommender System for movies

In this notebook we will create a Recommender System for movies using a Restricted Boltzman Machine. 
The Restricted Boltzman Machine is a Generative Model, with undirectional edges, composed of visible and hidden nodes.
We can see the structure of the network in this picture:

<img src="restrictedBoltzman.png">

In particular, in this model we will use the following datasets:

https://grouplens.org/datasets/movielens/1m/

https://grouplens.org/datasets/movielens/100k/

MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.
 
This data set consists of:
 * 100,000 ratings (1-5) from 943 users on 1682 movies. 
 * Each user has rated at least 20 movies. 
 * Simple demographic info for the users (age, gender, occupation, zip)


In [1]:
# Libraries
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.optim as optim
import torch.utils.data
from torch.autograd import Variable

## 1. Data

In [3]:
# Movies Data
movies = pd.read_csv('ml-1m/movies.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1', names = ['MovieID', 'Title', 'Genre' ])
movies.head()

Unnamed: 0,MovieID,Title,Genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
# Users Data
users = pd.read_csv('ml-1m/users.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1', names = ['UserID', 'Gender', 'Age', 'Job', 'ZipCode' ])
users.head()

Unnamed: 0,UserID,Gender,Age,Job,ZipCode
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [5]:
# Ratings Data
ratings = pd.read_csv('ml-1m/ratings.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1', names = ['UserID', 'MovieID', 'Rating', 'Timestamp' ])
ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


### 1.1 Prepare Training Set and Test set

We will use an holdout train test split. In particular we will use 80% of the data as training set and the remaining 20 % as test set

In [6]:
# Train set
training_set = pd.read_csv('ml-100k/u1.base', sep = '\t')
training_set.head()

Unnamed: 0,1,1.1,5,874965758
0,1,2,3,876893171
1,1,3,4,878542960
2,1,4,3,876893119
3,1,5,3,889751712
4,1,7,4,875071561


In [7]:
training_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79999 entries, 0 to 79998
Data columns (total 4 columns):
1            79999 non-null int64
1.1          79999 non-null int64
5            79999 non-null int64
874965758    79999 non-null int64
dtypes: int64(4)
memory usage: 2.4 MB


In [8]:
# Convert the df into an array
training_set = np.array(training_set, dtype = 'int')
training_set

array([[        1,         2,         3, 876893171],
       [        1,         3,         4, 878542960],
       [        1,         4,         3, 876893119],
       ...,
       [      943,      1188,         3, 888640250],
       [      943,      1228,         3, 888640275],
       [      943,      1330,         3, 888692465]])

In [10]:
# Test set
test_set = pd.read_csv('ml-100k/u1.test', sep = '\t')
test_set.head()

Unnamed: 0,1,6,5,887431973
0,1,10,3,875693118
1,1,12,5,878542960
2,1,14,5,874965706
3,1,17,3,875073198
4,1,20,4,887431883


In [11]:
test_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 4 columns):
1            19999 non-null int64
6            19999 non-null int64
5            19999 non-null int64
887431973    19999 non-null int64
dtypes: int64(4)
memory usage: 625.0 KB


In [12]:
test_set = np.array(test_set, dtype = 'int')
test_set

array([[        1,        10,         3, 875693118],
       [        1,        12,         5, 878542960],
       [        1,        14,         5, 874965706],
       ...,
       [      459,       934,         3, 879563639],
       [      460,        10,         3, 882912371],
       [      462,       682,         5, 886365231]])

In [13]:
# Getting the number of users and movies
n_users = int(max(max(training_set[:,0]), max(test_set[:,0])))
n_movies = int(max(max(training_set[:,1]), max(test_set[:,1])))

Now we will convert the data into an array whith users in line and movies in columns. For this reason we extracted the number of users and movies. The values will be the ratings. In particular we will create a list of list, in particular a list of movie ratings for each user.

In [14]:
def convert(data):
    new_data = []
    for id_users in range(1, n_users + 1):
        id_movies = data[:, 1][data[:,0] == id_users] # Select all the movies ID that corresponds to the user id
        id_ratings = data[:, 2][data[:,0] == id_users] # Select all the ratings ID that corresponds to the user id
        ratings = np.zeros(n_movies)
        ratings[id_movies - 1] = id_ratings # id_movies start at 0
        new_data.append(list(ratings))
    return new_data

In [15]:
training_set = convert(training_set)
test_set = convert(test_set)

Now we convert the data into Torch Tensor 

In [16]:
training_set = torch.FloatTensor(training_set)
test_set = torch.FloatTensor(test_set)

Convert the ratings into binary ratings: 1(liked) or 0(not liked). We put a kind of threshold on ratings, we will consider all the ratings grater or equal to 3 as liked and all the ratings above 3 will be considered not liked.
The movie with ratings equal to 0 will be converted to -1, that means they don't have a rating.

In [17]:
training_set[training_set == 0] = -1
training_set[training_set == 1] = 0
training_set[training_set == 2] = 0
training_set[training_set >= 3] = 1

Let's do the same for the test set

In [18]:
test_set[test_set == 0] = -1
test_set[test_set == 1] = 0
test_set[test_set == 2] = 0
test_set[test_set >= 3] = 1

## 2. Restricted Boltzman Machine

Now we will create the architecture of our Restricted Boltzman Machine

In [19]:
class RBM():
    
    def __init__(self, n_visible, n_hidden):
        '''
        Args:
        -----
                n_visible(int): number of visible nodes 
                n_hidden(int): number of hidden nodes 
        '''
        self.weights = torch.randn(n_hidden, n_visible) # Weights initialization using normal distribution
        self.bias_hidden = torch.randn(1, n_hidden) # Hidden layer bias initialization
        self.bias_visible = torch.randn(1, n_visible) # Visible layer bias initialization
        
    def sample_hidden(self, x):
        '''
        Args:
        -----
                x(ndarray): visible nodes  
        Return:
        -------
                prob_h_v(float): probability that the hidden node is activated given the value of the visible node
                sample_h(): sample of the hidden nodes
        '''
        wx = torch.mm(x, self.weights.t()) # Product of x(visible node) and the tensor of weights
        activation = wx + self.bias_hidden.expand_as(wx) # Sum the bias to the product obtained before
        prob_h_v = torch.sigmoid(activation) # Apply sigmoid activation function
        sample_h = torch.bernoulli(prob_h_v) # Use prob_h_V to return a sample of the hidden nodes using bernoulli    
        return prob_h_v, sample_h  
    
    def sample_visible(self, y):
        '''
        Args:
        -----
                x(ndarray): hidden nodes       
        Return:
        -------
                prob_v_h(float): probability that the visible node is activated given the value of the hidden node
                sample_h(): sample of the visible nodes
        '''
        wy = torch.mm(y, self.weights) # Product of y(visible nodes) and the tensor of weights
        activation = wy + self.bias_visible.expand_as(wy) # Sum the bias to the product obtained before
        prob_v_h = torch.sigmoid(activation) # Apply sigmoid activation function
        sample_v = torch.bernoulli(prob_v_h) # Use prob_h_V to return a sample of the visible nodes using bernoulli 
        return prob_v_h, sample_v
    
    def train(self, vo, vk, ph0, phk):
        '''
        Args:
        -----
                v0(torch.tensor): input vector containing the ratings by one user
                vk(torch.tensor): visible nodes obtained after k sampling
                ph0(torch.tensor): vector of probabilities that at the first iteration equal 1 gives the value of v0
                phk(torch.tensor): probabilities of the hidden nodes after k sampling
        Return:
        -------
        
        '''
        self.weights += torch.mm(v0.t(), ph0) - torch.mm(vk.t(), phk)
        self.bias_visible += torch.sum((vo - vk), 0)
        self.bias_hidden += torch.sum((ph0 - phk), 0)

Now we will define the parameters of the Neural Network

In [20]:
n_visible = len(training_set[0])
n_hidden = 100
batch_size = 100

In [21]:
# Neural Network initialization
rbm = RBM(n_visible, n_hidden)

In [22]:
# Train the RNN
nb_epoch = 10
for epoch in range(1, nb_epoch + 1):
    train_loss = 0
    s = 0.
    for id_user in range(0, n_users - batch_size, batch_size): # Batch Learning implementation
        vk = training_set[id_user:id_user + batch_size]
        v0 = training_set[id_user:id_user + batch_size]
        ph0,_ = rbm.sample_hidden(v0)
        for k in range(10): # Gibbs sampling
            _,hk = rbm.sample_hidden(vk)
            _,vk = rbm.sample_visible(hk)
            vk[v0 < 0] = v0 [ v0 < 0] # In this way we avoid to do the training on raiting that are not present
        phk,_ = rbm.sample_hidden(vk) 
        rbm.train(v0, vk, ph0, phk)
        train_loss += torch.mean(torch.abs(v0[v0 >= 0] - vk[vk >= 0])) # Loss calculation between target(v0) and prediction(vk)
        s += 1.
    print('epoch: ' + str(epoch) + ' loss: ' + str(train_loss/s)) 

  return self.add_(other)


epoch: 1 loss: 0.2912355886781126
epoch: 2 loss: 0.2530179739246638
epoch: 3 loss: 0.24918280971576479
epoch: 4 loss: 0.251760895520998
epoch: 5 loss: 0.25291972365011123
epoch: 6 loss: 0.2511846800507589
epoch: 7 loss: 0.25291324627891787
epoch: 8 loss: 0.2511104385825301
epoch: 9 loss: 0.2510083409654801
epoch: 10 loss: 0.24845888094162605


In [27]:
# Test the RBM
test_loss = 0
s = 0.
for id_user in range(n_users): 
    v = training_set[id_user:id_user + 1]
    vt = test_set[id_user:id_user + 1] # Target used to calculate the erro
    if len(vt[vt >= 0]) > 0: # Gibbs sampling with only one step
        _,h = rbm.sample_hidden(v)
        _,v = rbm.sample_visible(h)
        test_loss += torch.mean(torch.abs(vt[vt >= 0] - v[vt >= 0])) # Loss calculation between target(vt) and prediction(v)
        s += 1.
print('test loss: ' + str(test_loss/s)) 

test loss: 0.24040919556467844


## 4. Discussion

The test loss is quite good, since for any new observations(new movies) we manage to obtain a correct rating prediction around 3 times over 4. 

Anyway this was a basic case in which we predicted just binary rating 'Like' or 'Not Liked'. An extension can be to predict the rating from 1 to 5 