# Multi-layer perceptron for recommending systems

## Libraries and packages

Usual numerical libraries are used. In particular:
- Numpy allows for better mathematical operations
- Scipy is mainly used to create sparse matrices
- Time is imported in order to track the time elapsed while running scripts
- Keras is used in order to create and train neural networks
- Heapq allows for a fast creation of a ranking list within the evaluation process

In [26]:
import scipy.sparse as sp
import numpy as np
import time

from keras.regularizers import l2
from keras.layers import Input, Dense, Flatten, Embedding, Concatenate, Multiply
from keras.models import Sequential, Model
from keras.initializers import RandomNormal 

import heapq

## Loading data

Loading function has been imported by the notebooks we've seen during classes: it allows to convert a .csv file (or .dat) in a list of tuples with the form $(userid,itemid,rating)$.

In [27]:
def load_data(filename):
    input_lines = []
    users = {}
    num_users = 0
    items = {}
    num_items = 0
    raw_lines = open(filename, 'r').read().splitlines()
    # remove the first line
    del raw_lines[0]
    for line in raw_lines:
        line_content = line.split('::')
        user_id = int(line_content[0])
        item_id = int(line_content[1])
        rating = float(line_content[2])
        if user_id not in users:
            users[user_id] = num_users
            num_users += 1
        if item_id not in items:
            items[item_id] = num_items
            num_items += 1
        input_lines.append([users[user_id], items[item_id], rating])
    return input_lines, num_users, num_items

In [28]:
#1M Movielens dataset is imported
input_file = "./ratings_1m.dat"
input_ratings, num_users, num_items = load_data(input_file)

## Preprocessing data

The list obtained above is processed in order to get the (sparse) `ratings` matrix. The entry $(i,j)$ of such a matrix is $1$ if $(i,j,x)$ is a tuple in the list obtained by loading the file above, $0$ otherwise. $x$ represents the rating of user i toward item j but since we're only analyzing implicit feedback we basically convert every non-zero rating as $1$. 

An explaination of why this is not necessarily an over-semplification of the problem is given in the report.

A `test` matrix is created too. For every row (user) it contains only one non-zero entry which represents the test interaction for that user that will be used in the evaluation part below.

Notice that sparse matrices not only allows to save memory but also allows to represent the matrix as a list of non-zero elements and their location. Non-sparse matrix *ratings* shouldn't be used when dealing with a massive dataset but in our case. 

In [29]:
ratings = sp.dok_matrix((num_users, num_items))
test = sp.dok_matrix((num_users, num_items))
        
for i in range(len(input_ratings)):
        if i > 0 and input_ratings[i - 1][0] == input_ratings[i][0]:
            ratings[input_ratings[i][0], input_ratings[i][1]] = 1
        else:
            test[input_ratings[i][0], input_ratings[i][1]] = 1

## Building the neural network

Keras is used in order to create the NNs. Both Multi-layer perceptron (MLP) and Generalized Matrix Factorization (GMF) are feed-forward neural networks. 

MLP structure is obtained through a .txt file. The structure  of an MLP is:

input -> embedding -> hidden layers -> output 

The dimension of the latent space is the same for both users and items and is characterized by the first number in the .txt file divided by 2. The .txt file allows to get the dimension of the latent space for the GMF.


In [None]:
# Hidden layers and their dimensions are obtained through a .txt file
layers = open("MLP_architecture.txt", "r")
layers = layers.read()
layers = layers.split(',')
layers = list(map(int, layers))

In [58]:
# This boolean variable states whether to build a GMF or a MLP
# If GMF = True, hidden layers are skipped and embedding layers are multiplied instead of concatenated.
GMF = False

model = Sequential()

user_in = Input(shape=(1,), dtype='int32', name = 'user_in')
item_in = Input(shape=(1,), dtype='int32', name = 'item_in')

Embedding_User = Embedding(input_dim = num_users, output_dim = int(layers[0]/2), name = 'user_em',
                                  embeddings_initializer = RandomNormal(), embeddings_regularizer = l2(0),
                                   input_length=1)
Embedding_Item = Embedding(input_dim = num_items, output_dim = int(layers[0]/2), name = 'item_em',
                                  embeddings_initializer = RandomNormal(), embeddings_regularizer = l2(0),
                                   input_length=1)   
    
# Embedding layers are flattened because they return matrices but not in this case (one-hot encoding)    
user_latent = Flatten()(Embedding_User(user_in))
item_latent = Flatten()(Embedding_Item(item_in))

if GMF == False:
    vector = Concatenate()([user_latent, item_latent])
else:
    vector = Multiply()([user_latent, item_latent])

# hidden layers
if GMF == False:
    for n in range(len(layers)):
        layer = Dense(layers[n], kernel_regularizer = l2(0), activation='relu')
        vector = layer(vector)
        
# output layer
prediction = Dense(1, activation='sigmoid', name = 'pred')(vector)
    
model = Model(inputs =[user_in, item_in], outputs = prediction)

model.compile(optimizer='adam', loss='binary_crossentropy')

model.summary()

Model: "functional_25"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
user_in (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
item_in (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
user_em (Embedding)             (None, 1, 32)        193280      user_in[0][0]                    
__________________________________________________________________________________________________
item_em (Embedding)             (None, 1, 32)        118592      item_in[0][0]                    
______________________________________________________________________________________

## Training
We train the models with all the positive interactions between users and items ($1$ in ratings matrix) and we also pick some negative interactions (4 negative interactions for each positive one).
Doing this we almost follow the training process of the Neural Collaborative Filtering article in bibliography.

So after obtaining all the training instances, which are tuples $(userid, itemid, rating)$ where $rating$ is boolean we use the fit function of keras in order to train the model.

In [59]:
# These parameters are fixed.
n_epochs = 10
batch_size = 256
t = time.time()

for i in range(n_epochs):
    
    user_train, item_train, labels_train = [], [], []
    num_neg = 4

    for (u, i) in ratings.keys():
    
        # Training uses all the ratings (i.e. 1 in the rating matrix)
        user_train.append(u)
        item_train.append(i)
        labels_train.append(1)
        
        # Also for each rating used, some random couples user-item with label 0 (item not rated) are added ..
        for k in range(num_neg):
            j = np.random.randint(num_items)
            while (u, j) in ratings:
                j = np.random.randint(num_items)
            user_train.append(u)
            item_train.append(j)
            labels_train.append(0)

    train_history = model.fit([np.array(user_train), np.array(item_train)], np.array(labels_train),
                             batch_size=batch_size, epochs = 1, shuffle=True)

print('Elapsed time:', time.time() - t)

Elapsed time: 66.05591201782227


## Evaluation

Evaluation is made for each user. For a better explaination of this part see the report and the other notebook involving CF.

In general the idea is to evaluate a single user picking 1 positive interaction (test interaction) and 99 negative ones. Then we predict the score for each of these 100 interactions and we evaluate wheter the positive one is in the top-10: if this is the case, we talk about an *hit*.

Then the *hit ratio* is the number of hits divided by the total number of users.

In [35]:
def evaluate_user_k(k, ratings, test, model):
    num_neg = 99
    
    
    # two vectors are created
    # users contains 100 times the userid
    # items contains the itemid of the items (both the positive and negatives) involved
    positive = list(test.keys())[k]
    negatives = []
    
    for i in range(num_neg):
        j = np.random.randint(num_items)
        while (k,j) in ratings.keys():
            j = np.random.randint(num_items)
        negatives.append((k,j))
    
    users = np.full(num_neg + 1 , k, dtype = 'int32')
    items = [positive[1]]
    for i in range(num_neg):
        items.append(negatives[i][1])
    
    # scores are predicted using keras
    scores = model.predict([users, np.array(items)], batch_size=100, verbose=0)
    
    item_score_dict = {}
    for i in range(num_neg + 1):
        item = items[i]
        item_score_dict[item] = scores[i]
    
    # Heapq allows a fast creation for the top-10 scoreboard using a dictionary
    ranklist = heapq.nlargest(10, item_score_dict, key = item_score_dict.get)
    
    #returns 1 if the positive is in top-10, 0 otherwise
    hr = 0
    for item in ranklist:
        if item == positive[1]:
            hr = 1
    
    return hr

In [36]:
# Finally the HR (Hit Ratio) is calculated dividing the number of hits by the total number of users.
s = 0
t = time.time()

for i in range(num_users):
    s += evaluate_user_k(i, ratings, test, model)
print('Hit Ratio:', s/num_users)
print('Elapsed time:', time.time() - t)

Hit Ratio: 0.7021666666666667
Elapsed time: 150.94761633872986


Notice that the notebook has been used to test and gain data in order to write the report.

Parameters such as the .txt file and the GMF boolean variable has been changed according to each model.
For a complete overview of all the models and parameters please see the report.