# Matrix Factorization Recommendation Engine: predicting user preferences in the MovieLens dataset

Using collaborative filtering, we are solving the problem of building a recommendation engine for the purpose of finding out user preferences based on similarities between users and between movies.

This sample movie dataset contains 1 million ratings collected from 6000 users on 4000 movies, and it is organized into three tables:
1. Ratings
2. Users
3. Movie information
Each table is available as a separate file, each containing a series of rows where columns are separated by: http://files.grouplens.org/datasets/movielens/ml-1m.zip
This example illustrates a series of interesting things that we can learn from this dataset. Most operations will be performed using the pandas library. 

# Pre-Processing

Let's begin by importing pandas. It is conventional to use pd to denote pandas.

In [1]:
import pandas as pd

Next we will import each of the three tables and assign names to each of the columns:

In [2]:
rnames = ['userId', 'movieId', 'rating', 'timestamp']
ratings = pd.read_csv('ml-latest-small/ratings.csv')

mnames = ['movieId', 'title', 'genres']
movies = pd.read_csv('ml-latest-small/movies.csv')

Let's take a look at the first 5 rows of each table:

In [3]:
ratings[:5]
ratings.drop(axis=1, columns='timestamp')

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0
...,...,...,...
100831,610,166534,4.0
100832,610,168248,5.0
100833,610,168250,5.0
100834,610,168252,5.0


In [4]:
movies[:5]

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Having all information spread across different tables makes it much more dificult to analyse the data. Using pandas's merge function, we first merge ratings with users then we merge that result with the movies data. pandas infers which columns to use as the merge (or join) keys based on overlapping names:

In [5]:
data = pd.merge(ratings, movies)

Below are the columns contained in the final table followed by a print out of the first row.


In [6]:
data

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
...,...,...,...,...,...,...
100831,610,160341,2.5,1479545749,Bloodmoon (1997),Action|Thriller
100832,610,160527,4.5,1479544998,Sympathy for the Underdog (1971),Action|Crime|Drama
100833,610,160836,3.0,1493844794,Hazard (2005),Action|Drama|Thriller
100834,610,163937,3.5,1493848789,Blair Witch (2016),Horror|Thriller


# Encoding

In this section, we encode two sets of data: one for training & one for testing. We will sort these tables so that each user is grouped in ascending order. For example: user 0 will appear first with all their ratings, then user 1 will appear, and so on. The result of this encoding is a contiguous sorting of users to movie ratings.

In [7]:
import numpy as np

# split train and validation before encoding
np.random.seed(9)
mask = np.random.rand(len(data)) < 0.8
# train and val sets created & separated
train = ratings[mask].copy()
val = ratings[~mask].copy()

# encode dataset columns with contiguous ids
def process_column(col, tcol=None):
    if tcol is not None:
        uni = tcol.unique()
        
    else:
        uni = col.unique()
    name = {o:i for i,o in enumerate(uni)}
    
    return name, np.array([name.get(x, -1) for x in col]), len(uni)

# encode rating data with contiguous ids
def encode_data(df, train=None):
    df = df.copy()
    
    for cname in ["userId", "movieId"]:
        tcol = None
        
        if train is not None:
            tcol = train[cname]
            
        _,col,_ = process_column(df[cname], tcol)
        df[cname] = col
        df = df[df[cname] >= 0]
        
    return df

# we are left with the encoded training and test sets
trainset = encode_data(train)
valueset = encode_data(val, train)
trainset

Unnamed: 0,userId,movieId,rating,timestamp
0,0,0,4.0,964982703
1,0,1,4.0,964981247
2,0,2,4.0,964982224
3,0,3,5.0,964983815
4,0,4,5.0,964982931
...,...,...,...,...
100831,609,2785,4.0,1493848402
100832,609,2786,5.0,1493850091
100833,609,2787,5.0,1494273047
100834,609,1199,5.0,1493846352


In [26]:
valueset

Unnamed: 0,userId,movieId,rating,timestamp
11,0,466,5.0,964981208
12,0,702,3.0,964980985
16,0,251,3.0,964982967
22,0,392,4.0,964981710
24,0,258,4.0,964980868
...,...,...,...,...
100807,609,924,4.0,1493845817
100808,609,925,4.0,1493846503
100811,609,4339,5.0,1479542831
100813,609,3563,4.0,1493846563


# Embedding

Now that data is conveniently encoded it is time to add an embedding layer to simplify the mapping of user preferences to movies. This allows us to use less matrix space by finding user & item embeddings. This allows our model to extract features.

In [8]:
# these imports will be used for embedding
import torch
import torch.nn as nn
import torch.nn.functional as F

# Creating Matrix Factorization Model

Now that data is prepared for embedding to users & items, it is time to define a class capable of performing the calculations we need to factorize our matrix.

In [9]:
class MatrixFactorizer(nn.Module):
    '''
    Initializer: Creates an instance of MatrixFactorizer with default values
    '''
    def __init__(self, nusers, nitems, emsize=100):
        super(MatrixFactorizer, self).__init__()
        self.emitem = nn.Embedding(nitems, emsize)
        self.emitem.weight.data.uniform_(0, 0.05)
        self.emuser = nn.Embedding(nusers, emsize)
        self.emuser.weight.data.uniform_(0, 0.05)
    '''
    Modifier: Returns the dot product of an index
    '''
    def forward(self, x, y):
        x = self.emuser(x)
        y = self.emitem(y)
        return (x*y).sum(1)

# Setting Parameters

We define some parameters for MatrixFactorizer, such as the number of users & the number of items. We will be testing the embedding here to see the three matrix factors our program creates.

In [10]:
trainset

Unnamed: 0,userId,movieId,rating,timestamp
0,0,0,4.0,964982703
1,0,1,4.0,964981247
2,0,2,4.0,964982224
3,0,3,5.0,964983815
4,0,4,5.0,964982931
...,...,...,...,...
100831,609,2785,4.0,1493848402
100832,609,2786,5.0,1493850091
100833,609,2787,5.0,1494273047
100834,609,1199,5.0,1493846352


Testing embedding

In [11]:
nusers = max(trainset['userId']) + 1
nitems = max(trainset['movieId']) + 1
emsize = 3

emuser = nn.Embedding(nusers, emsize)
emitem = nn.Embedding(nitems, emsize)
users = torch.LongTensor(trainset.userId.values)
items = torch.LongTensor(trainset.movieId.values)

U = emuser(users)
V = emitem(items)

In [12]:
U

tensor([[ 2.4423, -0.8308, -0.4551],
        [ 2.4423, -0.8308, -0.4551],
        [ 2.4423, -0.8308, -0.4551],
        ...,
        [-0.9240, -0.6846,  1.9353],
        [-0.9240, -0.6846,  1.9353],
        [-0.9240, -0.6846,  1.9353]], grad_fn=<EmbeddingBackward>)

In [13]:
V

tensor([[ 0.1270, -0.2932, -0.1946],
        [-0.1914, -0.2229, -1.1384],
        [ 1.1558,  0.4126,  0.3474],
        ...,
        [ 0.6329,  0.2270,  0.5894],
        [ 0.7306,  0.9266, -0.1110],
        [ 0.3473,  0.7348,  0.3783]], grad_fn=<EmbeddingBackward>)

In [14]:
U * V

tensor([[ 0.3101,  0.2435,  0.0886],
        [-0.4675,  0.1852,  0.5181],
        [ 2.8228, -0.3428, -0.1581],
        ...,
        [-0.5848, -0.1554,  1.1407],
        [-0.6751, -0.6343, -0.2149],
        [-0.3209, -0.5031,  0.7322]], grad_fn=<MulBackward0>)

In [15]:
(U*V).sum(1)

tensor([ 0.6422,  0.2359,  2.3219,  ...,  0.4004, -1.5243, -0.0917],
       grad_fn=<SumBackward1>)

# Training Our Model

Now that we verified the embedding, it is time to train.

In [16]:
nusers = max(trainset['userId']) + 1
nitems = max(trainset['movieId']) + 1
print(nusers, nitems)

610 8966


In [17]:
MFmodel = MatrixFactorizer(nusers, nitems, emsize=100)

In [18]:
def EpochTrainer(MFmodel, epochs=10, learningrate=0.01, weightdecay=0.0):
    optimizer = torch.optim.Adam(MFmodel.parameters(), lr=learningrate, weight_decay=weightdecay)
    MFmodel.train()
    
    for i in range(epochs):
        users = torch.LongTensor(trainset.userId.values)
        items = torch.LongTensor(trainset.movieId.values)
        ratings = torch.FloatTensor(trainset.rating.values) 
        y = MFmodel(users, items)
        loss = F.mse_loss(y, ratings)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print(loss.item()) 
        
    EvaluateLoss(MFmodel)

# Evaluating Loss (Mean-Squared Error)

Now that we have defined functions for training, we need to test and see how effective the model is using the MSE metric built into pytorch.

In [19]:
def EvaluateLoss(MFmodel):
    MFmodel.eval()
    users = torch.LongTensor(valueset.userId.values) 
    items = torch.LongTensor(valueset.movieId.values)
    ratings = torch.FloatTensor(valueset.rating.values)
    y = MFmodel(users, items)
    loss = F.mse_loss(y, ratings)
    print("test loss %.3f " % loss.item())

In [20]:
EpochTrainer(MFmodel, epochs=10, learningrate=0.1)

12.925970077514648
4.859316825866699
2.5830864906311035
3.105408191680908
0.8451338410377502
1.8163740634918213
2.6548707485198975
2.1342058181762695
1.0872224569320679
0.9718915224075317
test loss 1.893 


In [21]:
EpochTrainer(MFmodel, epochs=15, learningrate=0.01)

1.6408931016921997
1.001889944076538
0.7083783745765686
0.6572748422622681
0.7224433422088623
0.8011520504951477
0.8411889672279358
0.8332117199897766
0.7907929420471191
0.7348751425743103
0.6846339702606201
0.6524309515953064
0.6416072249412537
0.6471261382102966
0.6590015888214111
test loss 0.848 


In [22]:
EpochTrainer(MFmodel, epochs=15, learningrate=0.01)

0.6672634482383728
0.6285321712493896
0.6365256309509277
0.612358808517456
0.6033403277397156
0.6115406155586243
0.60948246717453
0.5949985980987549
0.5831587314605713
0.5813949108123779
0.5827144384384155
0.5781899094581604
0.5678359866142273
0.5576778650283813
0.5514805316925049
test loss 0.777 


# Visualizing Final Matrix

After testing the loss, it is time to visualize the matrix which is now filled with values.

In [23]:
valueset

Unnamed: 0,userId,movieId,rating,timestamp
11,0,466,5.0,964981208
12,0,702,3.0,964980985
16,0,251,3.0,964982967
22,0,392,4.0,964981710
24,0,258,4.0,964980868
...,...,...,...,...
100807,609,924,4.0,1493845817
100808,609,925,4.0,1493846503
100811,609,4339,5.0,1479542831
100813,609,3563,4.0,1493846563
