In [34]:
import pandas as pd
from surprise import SVD, Reader
from surprise import Dataset
from surprise.model_selection import cross_validate

import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

# TL;DR

- Implement an algorithm for predicting rating, based on matrix factorization
- Comparing with varioud model. We could see that while our model may not be perfect, class MatrixFactorization, it demonstrates respectable performance in accurately predicting ratings.

# Introduction

There are numerous methods available for making suggestions based on data. In this report, we will be implementing an algorithm aimed at predicting ratings. Specifically, we'll utilize matrix factorization to accomplish this task. Furthermore, we'll conduct a comparative analysis with other models to evaluate the effectiveness of our approach.

# Analysis

For matrix factorization, it's crucial to have dense data for effective analysis. However, encountering missing data is inevitable, as it represents the very information we aim to predict. Minimizing both Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) is essential for enhancing predictive accuracy. To address this, we will employ Stochastic Gradient Descent (SGD) to fill in these missing values, optimizing our model's performance. Furthermore, determining the appropriate number of factors is vital for obtaining meaningful results. Therefore, part of our analysis will involve fine-tuning the number of factors to ensure optimal outcomes

In [59]:
class MatrixFactorization(surprise.AlgoBase):
    '''A basic rating prediction algorithm based on matrix factorization.'''
    
    def __init__(self, learning_rate, n_epochs, n_factors):
        
        self.lr = learning_rate  # learning rate for SGD
        self.n_epochs = n_epochs  # number of iterations of SGD
        self.n_factors = n_factors  # number of factors
        
    def fit(self, trainset):
        '''Learn the vectors p_u and q_i with SGD'''
        
        print('Start fitting the data')
        
        # Randomly initialize the user and item factors.
        p = np.random.normal(0, .1, (trainset.n_users, self.n_factors))
        q = np.random.normal(0, .1, (trainset.n_items, self.n_factors))
        
        # SGD procedure
        for _ in range(self.n_epochs):
            for u, i, r_ui in trainset.all_ratings():
                err = r_ui - np.dot(p[u], q[i])
                # Update vectors p_u and q_i
                p[u] += self.lr * err * q[i]
                q[i] += self.lr * err * p[u]
        
        self.p, self.q = p, q
        self.trainset = trainset

    def estimate(self, u, i):
        '''Return the estmimated rating of user u for item i.'''
        
        if self.trainset.knows_user(u) and self.trainset.knows_item(i):
            return np.dot(self.p[u], self.q[i])
        else:
            return self.trainset.global_mean

In [46]:
reader = Reader()

ratingDf = pd.read_csv('~/Desktop/Work/movielens_recommendation/ratings.csv')

ratingDf = ratingDf.iloc[:, :3]
data = Dataset.load_from_df(ratingDf, reader)

Fitting data with SGD...
Fitting data with SGD...
Evaluating RMSE, MAE of algorithm MatrixFacto on 2 split(s).

                  Fold 1  Fold 2  Mean    Std     
RMSE (testset)    0.9202  0.9238  0.9220  0.0018  
MAE (testset)     0.7225  0.7259  0.7242  0.0017  
Fit time          12.95   12.81   12.88   0.07    
Test time         1.31    1.30    1.30    0.00    


{'test_rmse': array([0.92024193, 0.92378354]),
 'test_mae': array([0.72245294, 0.7258521 ]),
 'fit_time': (12.950942277908325, 12.808483123779297),
 'test_time': (1.3050119876861572, 1.302353858947754)}

In [60]:
n_factors_range = [5, 10, 15, 20, 25]
results = {}

for n_factors in n_factors_range:

    algo = MatrixFactorization(learning_rate=.01, n_epochs=10, n_factors=n_factors)
    print('')
    print("number of factors is:", n_factors)
    cvResults = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=2, verbose=True)
    
    results[n_factors] = {
        'RMSE': cvResults['test_rmse'],
        'MAE': cvResults['test_mae'],
        'fit_time': cvResults['fit_time'],
        'test_time': cvResults['test_time']
    }

number of factors is: 5
Start fitting the data
Start fitting the data
Evaluating RMSE, MAE of algorithm MatrixFactorization on 2 split(s).

                  Fold 1  Fold 2  Mean    Std     
RMSE (testset)    0.9172  0.9245  0.9208  0.0036  
MAE (testset)     0.7221  0.7273  0.7247  0.0026  
Fit time          12.71   13.07   12.89   0.18    
Test time         1.14    1.19    1.17    0.02    
number of factors is: 10
Start fitting the data
Start fitting the data
Evaluating RMSE, MAE of algorithm MatrixFactorization on 2 split(s).

                  Fold 1  Fold 2  Mean    Std     
RMSE (testset)    0.9198  0.9238  0.9218  0.0020  
MAE (testset)     0.7223  0.7247  0.7235  0.0012  
Fit time          13.01   12.95   12.98   0.03    
Test time         1.31    1.23    1.27    0.04    
number of factors is: 15
Start fitting the data
Start fitting the data
Evaluating RMSE, MAE of algorithm MatrixFactorization on 2 split(s).

                  Fold 1  Fold 2  Mean    Std     
RMSE (testset)   

Based on the results derived from our analysis, it is apparent that employing 5 factors yields the most optimal outcomes for our model. However, the question arises: with a Root Mean Square Error (RMSE) of 0.92 and a Mean Absolute Error (MAE) of 0.72, is this the pinnacle of our performance, or is there room for improvement?

In [47]:
algo = surprise.KNNBasic()
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=2, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 2 split(s).

                  Fold 1  Fold 2  Mean    Std     
RMSE (testset)    0.9437  0.9421  0.9429  0.0008  
MAE (testset)     0.7461  0.7439  0.7450  0.0011  
Fit time          3.99    3.95    3.97    0.02    
Test time         68.52   70.45   69.48   0.96    


{'test_rmse': array([0.94373928, 0.94205207]),
 'test_mae': array([0.74610588, 0.74393816]),
 'fit_time': (3.991666078567505, 3.9493441581726074),
 'test_time': (68.52197599411011, 70.44561815261841)}

By employing KNNBasic, we achieved a Root Mean Square Error (RMSE) of 0.94 and a Mean Absolute Error (MAE) of 0.74. These metrics suggest that our model is performing quite satisfactorily, indicating that our approach is effective in generating accurate predictions.

In [48]:
algo = surprise.SVD()
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=2, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 2 split(s).

                  Fold 1  Fold 2  Mean    Std     
RMSE (testset)    0.9022  0.9034  0.9028  0.0006  
MAE (testset)     0.7105  0.7118  0.7112  0.0006  
Fit time          2.88    3.00    2.94    0.06    
Test time         1.29    1.26    1.27    0.01    


{'test_rmse': array([0.90222826, 0.90337181]),
 'test_mae': array([0.71054359, 0.71178012]),
 'fit_time': (2.8758890628814697, 3.0029540061950684),
 'test_time': (1.286635160446167, 1.2597870826721191)}

Utilizing Singular Value Decomposition (SVD) from the Surprise library, we observed a Root Mean Square Error (RMSE) of 0.90 and a Mean Absolute Error (MAE) of 0.711. These results suggest that while our model may not be perfect, it demonstrates respectable performance in accurately predicting ratings.