# Recommender System - Matrix Factorization

# Table of Contents

- [1. Import Package](#package)
- [2. Import Data](#data)
- [3. Data Analysis](#analysis)
    - [3.1. Shape of Data](#analysis_shape)
    - [3.2. The Number of User and Movie](#analysis_numbof)
    - [3.3. How Much Rating Given by User](#analysis_ratinguser)
    - [3.4. How Much Rater of Each Movie](#analysis_rater)
- [4. Data Preprocessing](#prep)
    - [4.1. Create Data Train and Data Test](#prep_train_test)
    - [4.2. Convert Dataframe to Surprise Data Autofold Type](#prep_daf)
- [5. Modeling](#model)
    - [5.1. Training](#model_train)
    - [5.2. Evaluation](#model_eval)
- [6. Generate Recommendation](#generate)

# Import Package <a class='anchor' id='package'></a>

We use <a href = 'https://surprise.readthedocs.io/en/stable/getting_started.html' target = '_blank'>surprise</a> library to calculate matrix factorization

In [2]:
import pymysql
from sqlalchemy import create_engine

import pandas as pd
import numpy as np
from tqdm import tqdm
from collections import defaultdict

from surprise.prediction_algorithms.matrix_factorization import SVD
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import KFold
from surprise import accuracy

import pickle
import random

# Import Data <a class='anchor' id='data'></a>

We use Movie Recommender System Dataset from <a href = 'https://www.kaggle.com/datasets/gargmanas/movierecommenderdataset' target = '_blank'>Kaggle</a>

In [3]:
data_mv = pd.read_csv('Data/movies.csv')
data_rating = pd.read_csv('Data/ratings.csv')

In [4]:
data_mv.head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [5]:
data_rating.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


# Data Analysis <a class='anchor' id='analysis'></a>

## Shape of Data <a class='anchor' id='analysis_shape'></a>

In [6]:
print("shape of data movie : ", data_mv.shape)
print("shape of data rating : ", data_rating.shape)

shape of data movie :  (9742, 3)
shape of data rating :  (100836, 4)


## Number of User and Movie <a class='anchor' id='analysis_numbof'></a>

In [7]:
print("number of user : ", data_rating['userId'].nunique())
print("number of movie : ", data_mv['movieId'].nunique())

number of user :  610
number of movie :  9742


## How Much Rating Given by User <a class='anchor' id='analysis_ratinguser'></a>

In [8]:
ratingbyuser = data_rating.groupby('userId')[['rating']].count().sort_values('rating').reset_index()

In [9]:
ratingbyuser.head(10)

Unnamed: 0,userId,rating
0,442,20
1,406,20
2,147,20
3,194,20
4,569,20
5,576,20
6,431,20
7,207,20
8,278,20
9,320,20


In [10]:
ratingbyuser.tail(10)

Unnamed: 0,userId,rating
600,288,1055
601,606,1115
602,380,1218
603,68,1260
604,610,1302
605,274,1346
606,448,1864
607,474,2108
608,599,2478
609,414,2698


In [11]:
ratingbyuser[['rating']].describe()

Unnamed: 0,rating
count,610.0
mean,165.304918
std,269.480584
min,20.0
25%,35.0
50%,70.5
75%,168.0
max,2698.0


## How Much Rater of Each Movie <a class='anchor' id='analysis_rater'></a>

In [12]:
rater = data_rating.groupby('movieId')[['rating']].count().sort_values('rating').reset_index()

In [13]:
rater.head(10)

Unnamed: 0,movieId,rating
0,193609,1
1,4032,1
2,57526,1
3,57522,1
4,57502,1
5,57499,1
6,57421,1
7,57326,1
8,57147,1
9,4046,1


In [14]:
rater.tail(10)

Unnamed: 0,movieId,rating
9714,527,220
9715,589,224
9716,110,237
9717,480,238
9718,260,251
9719,2571,278
9720,593,279
9721,296,307
9722,318,317
9723,356,329


In [15]:
rater[['rating']].describe()

Unnamed: 0,rating
count,9724.0
mean,10.369807
std,22.401005
min,1.0
25%,1.0
50%,3.0
75%,9.0
max,329.0


# Data Preprocessing <a class='anchor' id='prep'></a>

Matrix factorization is a part of collaborative filtering recommender system so we only need the user-item rating data for training process. We will create two kinds of data, namely data train and data test

- data train : all historical user-item rating data that we have
- data test  : all unrated user-item pairs (we will predict the rating of those pairs)

we have 610 users and 9742 movies so to create the data test we need to create all user-item pair data. If it is too large then we can divide the data into several batches in the predict process later, while now we can create a function to generate the data test.

Surprise library has its own data type for its input, so we need to convert our data from dataframe to the surprise data autofold type using <a href = 'https://surprise.readthedocs.io/en/stable/reader.html' target = '_blank'>reader method</a>.

## Create Data Train and Data Test <a class='anchor' id='prep_train_test'></a>

In [16]:
# CREATE DATA TRAIN

data_train = data_rating[['userId','movieId','rating']].copy()

In [17]:
# CREATE DATA TEST
def generate_data_test(user_df, mv_df, rated_pair_df):
    temp_user = user_df[['userId']].copy()
    temp_user['key'] = 1

    temp_mv = mv_df[['movieId']].copy()
    temp_mv['key'] = 1

    #cross join temp_user and temp_mv
    data_test = temp_user.merge(temp_mv,on='key').drop('key',axis=1)

    #join data test and a data that contains rated user-item pairs to get the rating
    data_test = data_test.merge(rated_pair_df, how='left', on=['userId','movieId'])

    #exclude rated data on data test
    data_test = data_test[data_test['rating'].isna()]
    
    return data_test

## Convert Dataframe to Surprise Data Autofold Type <a class='anchor' id='prep_daf'></a>

In [18]:
def read_DAF(df, rating_scale = (1,5)):
    reader = Reader(rating_scale = rating_scale)
    daf = Dataset.load_from_df(df, reader)
    return daf

In [19]:
daf_train = read_DAF(data_train)
trainset = daf_train.build_full_trainset()

In [20]:
daf_train

<surprise.dataset.DatasetAutoFolds at 0x7fc256a16280>

In [21]:
trainset

<surprise.trainset.Trainset at 0x7fc256a63eb0>

# Modeling <a class='anchor' id='model'></a>

## Training <a class='anchor' id='model_train'></a>

We will use <a href = 'https://surprise.readthedocs.io/en/stable/matrix_factorization.html' target = '_blank'>SVD</a> model from surprise library.

In [22]:
model = SVD(n_factors= 50, n_epochs= 30, lr_all= 0.01, reg_all= 0.1)

In [23]:
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fc256a16e20>

## Evaluation <a class='anchor' id='model_eval'></a>

There are two types of evaluation in recommender system, namely online evaluation and offline evaluation. We will use offline evaluation and calculate the <a href = 'https://surprise.readthedocs.io/en/stable/FAQ.html?highlight=precision#how-to-compute-precision-k-and-recall-k' target = '_blank'>precision, recall</a>, and <a href = 'https://surprise.readthedocs.io/en/stable/accuracy.html#surprise.accuracy.rmse' target = '_blank'>RMSE</a>. To evaluate our model we will use K-fold cross validation with <a href = 'https://surprise.readthedocs.io/en/stable/model_selection.html' target = '_blank'>model_selection package</a> in surprise library.

In [24]:
def precision_recall_at_k(predictions, k=10, threshold=3.5):
    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set it to 0.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set it to 0.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    return precisions, recalls

In [25]:
def model_evaluation(daf_train, model, n_fold=5, k=10, rating_threshold = 4):
    evaluation = {"rmse":[],
                  "precision_k":[],
                  "recall_k":[]}
    kf = KFold(n_fold,random_state=42)
    i = 1

    for trainset, testset in kf.split(daf_train):
        model.fit(trainset)
        predictions = model.test(testset)

        rmse = accuracy.rmse(predictions)
        precisions, recalls = precision_recall_at_k(predictions, k, threshold=rating_threshold)

        print("--------Fold {}--------".format(i))
        #compute RMSE
        print("RMSE : ",rmse)

#             Precision and recall can then be averaged over all users
        print("Precision : ", sum(prec for prec in precisions.values()) / len(precisions))
        print("Recall :",sum(rec for rec in recalls.values()) / len(recalls))

        evaluation['rmse'].append(rmse)
        evaluation['precision_k'].append(sum(prec for prec in precisions.values()) / len(precisions))
        evaluation['recall_k'].append(sum(rec for rec in recalls.values()) / len(recalls))

        i += 1

    return evaluation

In [36]:
# EVALUATE USING 5 FOLD CROSS VALIDATION

model_evaluation(daf_train,SVD(n_factors= 50, n_epochs= 30, lr_all= 0.01, reg_all= 0.1))

RMSE: 0.8667
--------Fold 1--------
RMSE :  0.8667433037398852
Precision :  0.5615755919854287
Recall : 0.28656963730142954
RMSE: 0.8517
--------Fold 2--------
RMSE :  0.8516753553171849
Precision :  0.5786184984702288
Recall : 0.2975460372380142
RMSE: 0.8478
--------Fold 3--------
RMSE :  0.8478187945932568
Precision :  0.5848189850652906
Recall : 0.2874227828704104
RMSE: 0.8534
--------Fold 4--------
RMSE :  0.8534323844152234
Precision :  0.5631115014311738
Recall : 0.2845462021563749
RMSE: 0.8631
--------Fold 5--------
RMSE :  0.8631255510011435
Precision :  0.5624108769190739
Recall : 0.28245406832428116


{'rmse': [0.8667433037398852,
  0.8516753553171849,
  0.8478187945932568,
  0.8534323844152234,
  0.8631255510011435],
 'precision_k': [0.5615755919854287,
  0.5786184984702288,
  0.5848189850652906,
  0.5631115014311738,
  0.5624108769190739],
 'recall_k': [0.28656963730142954,
  0.2975460372380142,
  0.2874227828704104,
  0.2845462021563749,
  0.28245406832428116]}

The evaluation results of our model are not so good, we can use hyperparameter tuning to fix that but for now let's continue to the next part :)

In [26]:
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fc256a16e20>

# Generate Recommendation <a class='anchor' id='generate'></a>

We will generate top n recommendation for every user by predict the rating of the unrated user-item pairs (data test).

In [46]:
# create user data
user_df = data_rating[['userId']].copy()
user_df = user_df.drop_duplicates().reset_index(drop=True)

# create movie data
mv_df = data_mv[['movieId']].copy()
mv_df = mv_df.drop_duplicates().reset_index(drop=True)

In [52]:
# create batch (we will predict 100 by 100 users)

batch = list(range(0,len(user_df['userId']),100)) + [len(user_df['userId'])]
batch

[0, 100, 200, 300, 400, 500, 600, 610]

In [61]:
n_recommendation = 10 #get top 10 recommendation
res_recommendation = pd.DataFrame()

for i in tqdm(range(1,len(batch))):
    #generate data test
    data_test = generate_data_test(user_df.loc[batch[i-1]:batch[i]],mv_df,data_train)
    
    #read to data auto fold
    daf_user_test = read_DAF(data_test)
    testset = daf_user_test.build_full_trainset().build_testset()
    
    #predict
    user_predicted = model.test(testset)
    df_user_predicted = pd.DataFrame(user_predicted)
    
    #sort to get top n
    df_user_predicted.sort_values(['est'],ascending=False,ignore_index=True)
    df_user_predicted = df_user_predicted.groupby('uid').head(n_recommendation).reset_index(drop=True)
    
    res_recommendation = res_recommendation.append(df_user_predicted)

100%|██████████| 7/7 [01:21<00:00, 11.62s/it]


In [63]:
res_recommendation

Unnamed: 0,uid,iid,r_ui,est,details
0,1,2,,4.072420,{'was_impossible': False}
1,1,4,,3.474170,{'was_impossible': False}
2,1,5,,3.747182,{'was_impossible': False}
3,1,7,,3.794301,{'was_impossible': False}
4,1,8,,3.941982,{'was_impossible': False}
...,...,...,...,...,...
95,610,8,,3.278242,{'was_impossible': False}
96,610,9,,3.020849,{'was_impossible': False}
97,610,10,,3.557651,{'was_impossible': False}
98,610,11,,3.696350,{'was_impossible': False}


Here it is! in our results we have uid, iid, r_ui, est, and details. uid means user id, iid means item id, r_ui means real rating, est means estimated rating, and details contains the details of calculation results