# Non-Negative Matrix Factorisation

- Matrix Factorization is a general technique used in collaborative filtering and other applications where a matrix is decomposed into the product of two lower-rank matrices.
- Stochastic Gradient Descent is an optimization algorithm commonly used to minimize the error in the factorization process.

- The key idea is to iteratively update the elements of the factorized matrices using the gradient of the error with respect to the elements.


## Algorithm Summary

Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix. 

1. **Load the data**
- data is provided in a dataframe where each row is a review

2. **Create a user-item matrix**
- convert dataframe into user-item matrix where each row is a user and each column is an item

3. **Create test and train set**
- hide $N$ ratings for each user in the training set and use them to test the performance of the model
- Typically, a certain percentage of ratings for each user are masked in the training set and used for testing the model's performance.

4. **Apply Non-negative Matrix Factorization (NMF)**

    1.  Decompose the user-item interaction matrix into two non-negative matrices: a user matrix and an item matrix.
    2. Minimize the reconstruction error between the original matrix and the product of the decomposed matrices using optimization techniques like gradient descent.

5. **Make predictions**
- For each user-item pair in the test set, predict the rating by reconstructing the original rating matrix using the decomposed user and item matrices.
- The predicted rating is obtained by taking the dot product of the corresponding user and item latent factor vectors.

6. **Evaluate the model**
- Calculate the predictive accuracy of the model using various evaluation metrics such as Root Mean Squared Error (RMSE), Mean Squared Error (MSE), and Mean Absolute Error (MAE).
- Additionally, assess the Top-N recommendation performance of the model using metrics like Normalized Discounted Cumulative Gain (NDCG) and Hit Rate.


## (1) Manaul / From Fundamentals

In [222]:
%reset -f

# load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# ignore runtime warnings
import warnings
warnings.filterwarnings('ignore')

In [224]:
# load data
# amz_data = pd.read_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\set2_data_modelling.csv')
amz_data = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set2_data_modelling.csv")
display(amz_data.head())

# print details
print('Number of Rows: ', amz_data.shape[0])
print('Number of Columns: ', amz_data.shape[1])
print('Number of Unique Users: ', len(amz_data['reviewerID'].unique()))
print('Number of Unique Products: ', len(amz_data['asin'].unique()))
print('Fewest reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().min())
print('Most reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().max())
print("Fewest reviews per product:", amz_data.groupby('asin')['reviewerID'].count().min())
print("Most reviews per product:", amz_data.groupby('asin')['reviewerID'].count().max())

# Creating User Item Matrix =====================================================
# create user-item matrix
x = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
x = x.fillna(0)
print("\n\n\nUser-Item Matrix")
display(x.head())
print('Shape: ', x.shape)

# scale the data
np.random.seed(2207)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
x_scaled = pd.DataFrame(x_scaled, index=x.index, columns=x.columns)
display(x_scaled.head())
x = x_scaled

Unnamed: 0,reviewerID,reviewTime,asin,overall,reviewText,stemmed_words_revText,lemmatised_reviewText,filtered_tokens_revText,sentiments_afinn,sentiments_bing,sentiments_vader
0,A14638TGYH7GD9,2010-10-28,321719816,5.0,even though i use dreamweaver a great deal and...,even though use dreamweav great deal sever boo...,even though use dreamweaver great deal several...,even though use dreamweaver great deal several...,20,11,0.99
1,A2JMJVNTBL7K7E,2011-04-07,321719816,5.0,i spent several hours on the lesson and i love...,spent sever hour lesson love detail clear inst...,spent several hour lesson love detailed clear ...,spent several hours lesson love detailed clear...,19,8,0.9766
2,A2BVNVJOFXGZUB,2010-09-26,321719816,5.0,the video is wellpaced and delivered in an und...,video wellpac deliv understand manner allow wo...,video wellpaced delivered understandable manne...,video wellpaced delivered understandable manne...,3,3,0.4939
3,A14JBDSWKPKTZA,2011-01-08,321719816,5.0,i have had dreamweaver mx2004 since it came ou...,dreamweav mx2004 sinc came back spent year fee...,dreamweaver mx2004 since came back spent year ...,dreamweaver mx2004 since came back spent years...,12,13,0.989
4,ACJT8MUC0LRF0,2010-10-16,321719816,5.0,if youve been wanting to learn how to create y...,youv want learn creat websit either lack confi...,youve wanting learn create website either lack...,youve wanting learn create website either lack...,39,18,0.9995


Number of Rows:  256725
Number of Columns:  11
Number of Unique Users:  11675
Number of Unique Products:  10487
Fewest reviews by a reviewer: 12
Most reviews by a reviewer: 365
Fewest reviews per product: 12
Most reviews per product: 266



User-Item Matrix


asin,0321719816,0763855553,076780192X,0767824571,0767827759,0767834739,0768881714,0782010792,0783239408,0788857746,...,B01HD8OXO0,B01HD8OYSK,B01HDW58I6,B01HE0W2WC,B01HGBAFNC,B01HGD8OYM,B01HGSJPMW,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A0380485C177Q6QQNJIX,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A0685888WB02Q69S553P,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1004703RC79J9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100JCBNALJFAW,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100RH4M1W1DF0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Shape:  (11675, 10487)


asin,0321719816,0763855553,076780192X,0767824571,0767827759,0767834739,0768881714,0782010792,0783239408,0788857746,...,B01HD8OXO0,B01HD8OYSK,B01HDW58I6,B01HE0W2WC,B01HGBAFNC,B01HGD8OYM,B01HGSJPMW,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A0380485C177Q6QQNJIX,-0.031768,-0.04016,-0.032036,-0.036761,-0.03221,-0.043036,-0.037939,-0.040333,-0.040791,-0.032207,...,-0.035657,-0.033146,-0.033116,-0.036737,-0.051773,-0.033258,-0.033082,-0.054567,-0.038109,-0.035904
A0685888WB02Q69S553P,-0.031768,-0.04016,-0.032036,-0.036761,-0.03221,-0.043036,-0.037939,-0.040333,-0.040791,-0.032207,...,-0.035657,-0.033146,-0.033116,-0.036737,-0.051773,-0.033258,-0.033082,-0.054567,-0.038109,-0.035904
A1004703RC79J9,-0.031768,-0.04016,-0.032036,-0.036761,-0.03221,-0.043036,-0.037939,-0.040333,-0.040791,-0.032207,...,-0.035657,-0.033146,-0.033116,-0.036737,-0.051773,-0.033258,-0.033082,-0.054567,-0.038109,-0.035904
A100JCBNALJFAW,-0.031768,-0.04016,-0.032036,-0.036761,-0.03221,-0.043036,-0.037939,-0.040333,-0.040791,-0.032207,...,-0.035657,-0.033146,-0.033116,-0.036737,-0.051773,-0.033258,-0.033082,-0.054567,-0.038109,-0.035904
A100RH4M1W1DF0,-0.031768,-0.04016,-0.032036,-0.036761,-0.03221,-0.043036,-0.037939,-0.040333,-0.040791,-0.032207,...,-0.035657,-0.033146,-0.033116,-0.036737,-0.051773,-0.033258,-0.033082,-0.054567,-0.038109,-0.035904


### Train and Test Split

In [226]:
# create a copy of the original matrix to store hidden ratings
x_hidden = x.copy()
indices_tracker = []

# number of products to hide for each user
N = 3

# identifies rated items and randomly selects N products to hide ratings for each user
np.random.seed(2207)  # You can use any integer value as the seed
for user_id in range(x_hidden.shape[0]):
    rated_products = np.where(x_hidden.iloc[user_id, :] > 0)[0]
    # print("User:", user_id)
    # print("Indices of Rated Products:", rated_products)
    hidden_indices = np.random.choice(rated_products, N, replace=False)
    indices_tracker.append(hidden_indices)
    # print("Indices to Hide:", hidden_indices, "\n")
    x_hidden.iloc[user_id, hidden_indices] = 0

In [227]:
# check tracker - all hidden ratings 
indices_tracker = pd.DataFrame(indices_tracker).to_numpy()

# flattened
indices_tracker_flat = indices_tracker.flatten()

### Decomposition, Optimisation and Prediction

In [228]:
class NMF:
    def __init__(self, n_factors, max_iter=100, learning_rate=0.01, reg_param=0.1, reg_param_bias=0.1, seed=2207):
        """
        Non-negative Matrix Factorization (NMF) with Stochastic Gradient Descent (SGD) and L2 regularization.

        Parameters:
        - n_factors: int, number of latent factors (dimensionality of the latent space).
        - max_iter: int, maximum number of iterations for optimization.
        - learning_rate: float, learning rate for SGD.
        - reg_param: float, regularization parameter to avoid overfitting.
        """
        self.n_factors = n_factors
        self.max_iter = max_iter
        self.learning_rate = learning_rate
        self.reg_param = reg_param
        self.reg_param_bias = reg_param_bias
        self.seed = seed
    
    def fit(self, X):
        """
        Fit the NMF model to the input matrix X using SGD.

        Parameters:
        - X: 2D array, input matrix where each row represents a user and each column represents an item.
        """
        # Initialize user and item factors and biases randomly
        np.random.seed(self.seed)
        self.n_users, self.n_items = X.shape
        self.user_factors = np.random.rand(self.n_users, self.n_factors)
        self.item_factors = np.random.rand(self.n_items, self.n_factors)
        self.user_bias = np.random.rand(self.n_users, 1)
        self.item_bias = np.random.rand(self.n_items, 1)
        
        # SGD optimization
        for epoch in range(self.max_iter):
            for i in range(self.n_users):
                for j in range(self.n_items):
                    if X[i][j] > 0:
                        # Calculate error
                        pred = np.dot(self.user_factors[i], self.item_factors[j]) + self.user_bias[i] + self.item_bias[j]
                        error = X[i][j] - pred
                        # Compute gradients for user and item factors and biases
                        grad_u = -2 * error * self.item_factors[j] + 2 * self.reg_param * self.user_factors[i]
                        grad_v = -2 * error * self.user_factors[i] + 2 * self.reg_param * self.item_factors[j]
                        grad_ub = -2 * error + 2 * self.reg_param_bias * self.user_bias[i]
                        grad_vb = -2 * error + 2 * self.reg_param_bias * self.item_bias[j]
                        # Update user and item factors and biases
                        self.user_factors[i] -= self.learning_rate * grad_u
                        self.item_factors[j] -= self.learning_rate * grad_v
                        self.user_bias[i] -= self.learning_rate * grad_ub
                        self.item_bias[j] -= self.learning_rate * grad_vb

    def predict(self):
        """
        Predict the ratings for user-item pairs.

        Returns:
        - pred_matrix: 2D array, predicted ratings matrix.
        """
        return np.dot(self.user_factors, self.item_factors.T)

    def get_user_factors(self):
        """
        Get the learned user factors.

        Returns:
        - user_factors: 2D array, learned user factors.
        """
        return self.user_factors

    def get_item_factors(self):
        """
        Get the learned item factors.

        Returns:
        - item_factors: 2D array, learned item factors.
        """
        return self.item_factors


In [229]:
# Fit the NMF model to the user-item matrix
nmf = NMF(n_factors=2, max_iter=100, learning_rate=0.01, reg_param=0.1, reg_param_bias=0.1, seed=2207)

# Fit the model to the user-item matrix with hidden ratings
nmf.fit(x_hidden.to_numpy())

# Predict the ratings for user-item pairs
pred_matrix = nmf.predict()
pred_matrix = pd.DataFrame(pred_matrix, index=x_hidden.index, columns=x_hidden.columns)
display(pred_matrix.head(15))

asin,0321719816,0763855553,076780192X,0767824571,0767827759,0767834739,0768881714,0782010792,0783239408,0788857746,...,B01HD8OXO0,B01HD8OYSK,B01HDW58I6,B01HE0W2WC,B01HGBAFNC,B01HGD8OYM,B01HGSJPMW,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A0380485C177Q6QQNJIX,20.25884,16.59452,25.095586,10.4837,6.879355,24.028779,18.425988,20.985803,19.766092,17.771771,...,19.536111,29.181921,24.391538,20.881396,11.633248,21.99416,25.631639,10.463764,22.434826,22.551192
A0685888WB02Q69S553P,32.150804,23.706779,44.212074,21.564571,8.567304,40.95352,30.161937,32.360235,27.932287,27.205046,...,21.095512,49.269109,44.17764,35.011349,19.110324,36.250404,39.947688,18.84685,29.677291,38.722778
A1004703RC79J9,19.785235,17.675438,22.058514,7.485602,8.031773,21.891415,17.481235,21.022827,21.224229,17.914393,...,24.615862,26.847233,20.765839,19.346915,10.999015,20.728083,25.440195,8.967032,25.222066,20.384565
A100JCBNALJFAW,13.080437,8.792527,19.409668,10.371246,2.723404,17.576244,12.569584,12.859424,10.249699,10.744375,...,5.36941,21.004016,19.746836,14.851498,7.98522,15.184731,16.015915,8.394467,10.152062,16.705706
A100RH4M1W1DF0,21.145159,16.019127,28.364542,13.381506,6.016804,26.476024,19.687516,21.436457,18.929551,18.054822,...,15.485552,31.922709,28.165794,22.721984,12.463172,23.62258,26.391741,12.030913,20.482192,24.990324
A100UD67AHFODS,10.610515,13.476218,5.161468,-3.47726,7.88099,7.452332,7.976198,12.710018,16.607658,11.125908,...,28.267271,9.900914,2.821628,7.528133,4.912756,9.070019,14.752816,1.401551,22.538249,6.470599
A100V5QEICGPDA,18.971696,14.930181,24.518773,10.960926,5.896896,23.156459,17.468753,19.433354,17.712799,16.410874,...,15.995632,28.014116,24.110745,19.989228,11.044581,20.909024,23.833794,10.318944,19.634099,21.799249
A100WO06OQR8BQ,12.659361,5.247464,24.226644,16.151227,-0.28069,20.509567,13.306425,11.273726,5.655349,9.159113,...,-7.098718,23.997676,25.896723,16.697064,8.5327,16.365732,14.594786,10.904911,2.470654,19.808752
A100ZQDV7L8PVV,12.339668,10.672163,14.344126,5.327734,4.694853,14.030478,11.025765,12.985209,12.777413,11.039247,...,14.026932,17.139726,13.682775,12.3168,6.946606,13.10772,15.768932,5.892337,14.937644,13.10596
A1017UZIPW58F4,14.904414,15.299557,13.306339,1.919575,7.824643,14.362301,12.474364,16.549539,18.582702,14.24908,...,26.023299,17.991733,11.515065,13.16061,7.796223,14.598825,19.715253,5.063312,23.474259,13.140972


### Grid Search

In [231]:
# create params to tune
n_factors = [2, 25, 55]
learning_rate = [0.01, 0.05, 0.1]
reg_param = [0.1, 0.5, 1]
reg_param_bias= [0.01, 0.1, 1]

# create a list to store the results
results = []

# loop through the parameters
for n in n_factors:
    for l in learning_rate:
        for r in reg_param:
            for rb in reg_param_bias:
                # Fit the NMF model to the user-item matrix
                nmf = NMF(n_factors=n, max_iter=100, learning_rate=l, reg_param=r, seed=2207, reg_param_bias=rb)
                nmf.fit(x_hidden.to_numpy())
                # Predict the ratings for user-item pairs
                pred_matrix = nmf.predict()
                pred_matrix = pd.DataFrame(pred_matrix, index=x_hidden.index, columns=x_hidden.columns)
                # calculate RMSE
                rmse = np.sqrt(np.mean((x_hidde
                n.to_numpy() - pred_matrix.to_numpy())**2))
                results.append([n, l, r, rb, rmse])

# create a dataframe to store the results
results = pd.DataFrame(results, columns=['n_factors', 'learning_rate', 'reg_param', 'reg_param_bias', 'rmse'])

# find the best parameters
best_params = results.loc[results['rmse'].idxmin()]
print("Best Parameters:\n", best_params.round(2))

In [None]:
# re-fit the model with the best parameters
nmf = NMF(n_factors=int(best_params['n_factors']), max_iter=100, learning_rate=best_params['learning_rate'], reg_param=best_params['reg_param'], seed=2207, reg_param_bias=best_params['reg_param_bias'])
nmf.fit(x_hidden.to_numpy())

# Predict the ratings for user-item pairs
pred_matrix = nmf.predict()
pred_matrix = pd.DataFrame(pred_matrix, index=x_hidden.index, columns=x_hidden.columns)
display(pred_matrix.head(15))

Unnamed: 0_level_0,item1,item2,item3,item4,item5,item6,item7,item8,item9,item10,...,item21,item22,item23,item24,item25,item26,item27,item28,item29,item30
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.625112,0.009861,0.316083,-0.2134,-0.165897,-0.315148,-0.217095,-0.394489,-0.244454,-0.703293,...,-0.15852,-0.044007,-0.392846,0.148956,0.547623,-0.153336,-0.078864,-0.075624,0.164361,0.095196
2,-0.04769,0.040538,0.479887,0.736737,0.011362,-0.389337,-0.159143,0.064065,-1.158809,-0.095251,...,-0.862588,-0.56887,-0.303537,-0.23026,-0.144245,-0.389,1.224393,1.304839,0.395963,0.19072
3,0.59603,-0.03128,0.403804,0.240734,0.190859,-0.122587,-0.292307,-0.867386,-0.370926,-0.13963,...,-0.759229,0.012704,-0.905726,-0.422056,0.523618,-0.001736,-0.066654,0.344451,-0.105188,0.194731
4,-0.369195,0.188511,-0.068377,-0.653124,-0.186243,-0.019262,-0.02954,0.123114,0.916939,0.076453,...,1.829128,-0.497116,0.190787,1.043101,-0.078348,0.010888,-0.210572,-0.400024,-0.067231,-0.276248
5,-0.546848,0.630229,-0.141107,-0.117145,0.591202,0.035245,0.147768,0.268756,-0.126203,-0.457155,...,0.063682,0.245712,0.582064,-0.203379,-0.408262,0.125858,-0.270432,-0.507987,-0.068962,0.075087
6,0.126681,-0.018863,0.031723,-0.122559,-0.121679,0.093595,-0.234619,0.153773,0.019119,-0.139028,...,-0.07994,0.17999,-0.33775,-0.128705,0.289251,-0.089321,-0.069139,0.096313,-0.014191,-0.010336
7,-0.168611,-0.012561,-0.055855,0.037247,0.011611,-0.011057,0.230726,0.115563,-0.169624,0.011657,...,0.031557,-0.110647,0.476628,-0.043937,-0.252005,0.134317,0.090027,-0.152478,0.035613,-0.045713
8,0.159636,-0.241015,0.031382,-0.065447,-0.350379,0.026628,-0.043825,0.12549,0.17282,-0.030925,...,-0.070137,0.226028,-0.210459,-0.006168,0.255822,0.026588,-0.093481,-0.004839,-0.038409,-0.249603
9,0.254722,-0.277106,0.595996,0.516743,0.123691,-0.897677,-0.13968,0.05299,-0.604986,-0.032058,...,-0.552628,-0.309406,-0.628303,-0.093919,-0.058692,-0.178327,0.752858,1.343118,0.363823,0.408362
10,-0.018798,0.192662,0.144962,0.749469,0.012956,-0.067644,-0.016769,-0.09767,-0.439068,-0.003532,...,-0.705431,-0.173525,0.008853,-0.063951,-0.101852,-0.211493,0.396686,0.363593,0.243213,-0.038785


### Evaluation (Predictive Accuracy)

Now evaluate how good the predictions are vs the hidden ratings
- ***step 1***: identify the hidden ratings indices
- ***step 2***: extract hidden ratings indices and corresponding predicted ratings indices
- ***step 3***: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values

In [None]:
# step 1: identify the hidden ratings indices = indices_tracker and get the hidden ratings ==========================================================================
hidden_ratings_ind = indices_tracker.copy()

# Loop through users to append hidden ratings
hidden_ratings_arrays = []

# Loop through users to append hidden ratings arrays
for user in range(x.shape[0]):
    user_hidden_ratings = x.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    hidden_ratings_arrays.append(user_hidden_ratings)


hidden_ratings_array = pd.DataFrame(hidden_ratings_arrays).to_numpy().flatten()

# step 2: extract corresponding predicted ratings indices ==========================================================================

# Create an empty list to store predicted ratings arrays
predicted_ratings_arrays = []

# Loop through users to append predicted ratings arrays
for user in range(pred_matrix.shape[0]):
    user_predicted_ratings = pred_matrix.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    predicted_ratings_arrays.append(user_predicted_ratings)

predicted_ratings_array = pd.DataFrame(predicted_ratings_arrays).to_numpy().flatten()

# step 3: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values) ==========================================================================

from sklearn.metrics import mean_absolute_error, mean_squared_error

# calculate MAE, MSE and RMSE
print("Using sklearn")
mae = mean_absolute_error(hidden_ratings_array, predicted_ratings_array)
mse = mean_squared_error(hidden_ratings_array, predicted_ratings_array)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")


# Manually
print("\n\nManually")
mae = np.mean(np.abs(hidden_ratings_array - predicted_ratings_array)) # Calculate Mean Absolute Error (MAE)
mse = np.mean((hidden_ratings_array - predicted_ratings_array) ** 2) # Calculate Mean Squared Error (MSE)
rmse = np.sqrt(mse) # Calculate Root Mean Squared Error (RMSE)


print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")

Using sklearn
Mean Absolute Error (MAE): 1.5809247696644346
Mean Squared Error (MSE): 3.46668957423759
Root Mean Squared Error (RMSE): 1.861904824161963


Manually
Mean Absolute Error (MAE): 1.5809247696644346
Mean Squared Error (MSE): 3.46668957423759
Root Mean Squared Error (RMSE): 1.861904824161963


In [None]:
# round to 2 decimal places
mae = round(mae, 3)
mse = round(mse, 3)
rmse = round(rmse, 3)

# Save the results to a csv file
results = pd.DataFrame({'MAE': [mae], 'MSE': [mse], 'RMSE': [rmse]})
# results.to_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\results_NMF.csv', index=False)
results.to_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/Results/MF_results_1.csv', index=False)

## (2) Using Packages

In [None]:
# load libraries
%reset -f
import numpy as np
import pandas as pd

# import NMF from scikit-learn
from sklearn.decomposition import NMF


In [None]:
#  use the NMF model from scikit learn to fit the user-item matrix
n_components = 5  # Number of components (we can adjust this)
model = NMF(n_components=n_components, init='random', random_state=2207)

# Fit the model to your data
W = model.fit_transform(X)  # Transformed data (basis vectors)
H = model.components_  # Components matrix
        
# reconstruct the matrix
R_pred = np.dot(W, H)
R_pred = pd.DataFrame(R_pred, index=x.index, columns=x.columns)
display(R_pred.head())

In [None]:
# GRID SEARCH
# create params to tune
n_factors = [2, 10, 25, 55, 95, 125, 250]
learning_rate = [0.01, 0.05, 0.1]
reg_param = [0.1, 0.5, 1]

# create a list to store the results
results = []

# loop through the parameters
for n in n_factors:
    for l in learning_rate:
        for r in reg_param:
            # Fit the NMF model to the user-item matrix
            nmf = NMF(n_factors=n, alpha=l, l1_ratio=r, random_state=2207, max_iter=100)
            W = model.fit_transform(x_hidden.to_numpy)  # Transformed data (basis vectors)
            H = model.components_  # Components matrix
            R_pred = np.dot(W, H)
            R_pred = pd.DataFrame(R_pred, index=x_hidden.index, columns=x_hidden.columns)
            # calculate RMSE
            rmse = np.sqrt(np.mean((x_hidden.to_numpy() - pred_matrix.to_numpy())**2))
            results.append([n, l, r, rmse])

# create a dataframe to store the results
results = pd.DataFrame(results, columns=['n_factors', 'learning_rate', 'reg_param', 'rmse'])

# find the best parameters
best_params = results.loc[results['rmse'].idxmin()]
print("Best Parameters:\n", best_params.round(2))


In [None]:
# step 1: identify the hidden ratings indices = indices_tracker and get the hidden ratings ==========================================================================
hidden_ratings_ind = indices_tracker.copy()

# Loop through users to append hidden ratings
hidden_ratings_arrays = []

# Loop through users to append hidden ratings arrays
for user in range(x.shape[0]):
    user_hidden_ratings = x.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    hidden_ratings_arrays.append(user_hidden_ratings)


hidden_ratings_array = pd.DataFrame(hidden_ratings_arrays).to_numpy().flatten()
print("Hidden Ratings:", hidden_ratings_array)

# step 2: extract corresponding predicted ratings indices ==========================================================================

# Create an empty list to store predicted ratings arrays
predicted_ratings_arrays = []

# Loop through users to append predicted ratings arrays
for user in range(nR.shape[0]):
    user_predicted_ratings = nR.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    predicted_ratings_arrays.append(user_predicted_ratings)

predicted_ratings_array = pd.DataFrame(predicted_ratings_arrays).to_numpy().flatten()
print("Corresponding Predicted Ratings:", predicted_ratings_array)

# step 3: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values) ==========================================================================

from sklearn.metrics import mean_absolute_error, mean_squared_error

# calculate MAE, MSE and RMSE
print("Using sklearn")
mae = mean_absolute_error(hidden_ratings_array, predicted_ratings_array)
mse = mean_squared_error(hidden_ratings_array, predicted_ratings_array)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")


# Manually
print("\n\nManually")
mae = np.mean(np.abs(hidden_ratings_array - predicted_ratings_array)) # Calculate Mean Absolute Error (MAE)
mse = np.mean((hidden_ratings_array - predicted_ratings_array) ** 2) # Calculate Mean Squared Error (MSE)
rmse = np.sqrt(mse) # Calculate Root Mean Squared Error (RMSE)


print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")

In [None]:
# save results to csv
results = pd.DataFrame({'MAE': [mae.round(3)], 'MSE': [mse.round(3)], 'RMSE': [rmse.round(3)]})
results.to_csv("Data/Results/MF_results_3.csv", index=False)

## (3) Using Suprise Test and Train Split

In [36]:
# load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# load data
# amz_data = pd.read_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\set2_data_modelling.csv')
amz_data = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set2_data_modelling.csv", index_col=0)

# print details
print('Number of Rows: ', amz_data.shape[0])
print('Number of Columns: ', amz_data.shape[1])
print('Number of Unique Users: ', len(amz_data['reviewerID'].unique()))
print('Number of Unique Products: ', len(amz_data['asin'].unique()))
print('Fewest reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().min())
print('Most reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().max())
print("Fewest reviews per product:", amz_data.groupby('asin')['reviewerID'].count().min())
print("Most reviews per product:", amz_data.groupby('asin')['reviewerID'].count().max())

# Creating User Item Matrix =====================================================
# create user-item matrix
x = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
x = x.fillna(0)
print("\n\n\nUser-Item Matrix")
display(x.head())
print('Shape: ', x.shape)

Unnamed: 0,reviewerID,reviewTime,asin,overall,reviewText,stemmed_words_revText,lemmatised_reviewText,filtered_tokens_revText,sentiments_afinn,sentiments_bing,sentiments_vader
76,AQ8OO59DJFJNZ,2018-01-05,767834739,5.0,wonderful movie,wonder movi,wonderful movie,wonderful movie,4,1,0.5719
78,A244CRJ2QSVLZ4,2008-01-29,767834739,5.0,resident evil is a great science fictionhorror...,resid evil great scienc fictionhorror hybrid p...,resident evil great science fictionhorror hybr...,resident evil great science fictionhorror hybr...,-12,-5,-0.9455
81,A1VCLTAGM5RLND,2005-07-23,767834739,5.0,i this movie has people living and working und...,movi peopl live work underground place call hi...,movie people living working underground place ...,movie people living working underground place ...,-1,0,-0.1806
82,A119Q9NFGVOEJZ,2016-02-13,767834739,5.0,every single video game based movie from the s...,everi singl video game base movi super mario b...,every single video game based movie super mari...,every single video game based movie super mari...,18,6,0.9846
83,A1RP6YCOS5VJ5I,2006-09-26,767834739,5.0,i think that i like this movie more than the o...,think like movi origin origin still great real...,think like movie original original still great...,think like movie original original still great...,29,10,0.9951


Number of Rows:  83139
Number of Columns:  11
Number of Unique Users:  3668
Number of Unique Products:  3249
Fewest reviews by a reviewer: 13
Most reviews by a reviewer: 193
Fewest reviews per product: 13
Most reviews per product: 189



User-Item Matrix


asin,0767834739,7799146915,B00000DMAT,B00000DMAX,B00000DMB3,B00000F1GM,B00000I1BJ,B00000I1BY,B00000ID61,B00000INR2,...,B01H353FLA,B01H353HUY,B01H3VFR6U,B01H5GB8ZW,B01H6OXQFS,B01H9SH2LU,B01HGBAFNC,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100RH4M1W1DF0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100WO06OQR8BQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0
A1027EV8A9PV1O,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A103KKI1Y4TFNQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1047P9FLHTDZJ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Shape:  (3668, 3249)


### Train and Test Split

In [None]:
# using created testset from packages chapter
ratings = x.stack().reset_index()
ratings.columns = ['user', 'item', 'rating']
ratings = ratings[ratings['rating'] != 0]
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings, reader)
trainset, testset = train_test_split(data, test_size=.25, random_state=2207)
testset_df = pd.DataFrame(testset)
testset_df = testset_df


# convert each row of the testset to a tuple
testset_tuples = [tuple(x) for x in testset_df[[0, 1]].to_numpy()]

# find indices of the testset in the original matrix
testset_indices = []
for i in range(len(testset_tuples)):
    user = testset_tuples[i][0]
    item = testset_tuples[i][1]
    user_index = x.index.get_loc(user)
    item_index = x.columns.get_loc(item)
    testset_indices.append((user_index, item_index))

# shorten the testset_indices to 100
testset_indices = testset_indices
print("Testset Indices: ")
testset_indices[0:5]

In [None]:
# load hidden ratings matrix
x_hidden = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/suprise_hidden_ratings_matrix.csv')

In [None]:
#  get predicted ratings for the testset
predicted_ratings = []
for i in range(len(testset_indices)):
    user_id = testset_indices[i][0]
    item_id = testset_indices[i][1]
    predicted_ratings.append(predic_matrix.iloc[user_id, item_id])

print("Predicted Ratings:")
print(predicted_ratings)

# get actual ratings for the testset
print("\nActual Ratings:")
actual_ratings = testset_df[2].to_list()
print(actual_ratings)

In [None]:
# calculate MAE, MSE and RMSE
from sklearn.metrics import mean_absolute_error, mean_squared_error
print("Using sklearn")
mae = mean_absolute_error(actual_ratings, predicted_ratings)
mse = mean_squared_error(actual_ratings, predicted_ratings)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error (MAE): {mae.round(2)}")
print(f"Mean Squared Error (MSE): {mse.round(2)}")
print(f"Root Mean Squared Error (RMSE): {rmse.round(2)}")


# Manually
print("\n\nManually")

# calculate MAE, MSE and RMSE using actual and predicted ratings
mae = np.mean(np.abs(np.array(actual_ratings) - np.array(predicted_ratings))) # Calculate Mean Absolute Error (MAE)
mse = np.mean((np.array(actual_ratings) - np.array(predicted_ratings)) ** 2) # Calculate Mean Squared Error (MSE)
rmse = np.sqrt(mse) # Calculate Root Mean Squared Error (RMSE)

print(f"Mean Absolute Error (MAE): {mae.round(2)}")
print(f"Mean Squared Error (MSE): {mse.round(2)}")
print(f"Root Mean Squared Error (RMSE): {rmse.round(2)}")

In [None]:
# save results to csv
results = pd.DataFrame({'MAE': [mae.round(3)], 'MSE': [mse.round(3)], 'RMSE': [rmse.round(3)]})
results.to_csv("Data/Results/MF_results_3.csv", index=False)


# Sandbox

In [35]:
# Creating Dummy User-Item Matrix =====================================================
import pandas as pd
import numpy as np

# Number of users and items
num_users = 50
num_items = 30
min_ratings_per_user = 5
max_ratings_per_user = 10

# Generate random ratings for the user-item matrix - each user has rated only between 5 and 10 items
ratings = np.zeros((num_users, num_items))

for user_index in range(num_users):
    num_ratings = np.random.randint(min_ratings_per_user, max_ratings_per_user + 1)
    item_indices = np.random.choice(num_items, num_ratings, replace=False)
    ratings[user_index, item_indices] = np.random.randint(1, 6, size=num_ratings)

# Create a DataFrame for the user-item ratings
user_item_ratings_df = pd.DataFrame(ratings, columns=[f"item{i+1}" for i in range(num_items)])
user_item_ratings_df.index += 1
user_item_ratings_df.index.name = 'user'

# Save the user-item ratings matrix to a CSV file
user_item_ratings_df.to_csv('../Code/Data/user_item_ratings_matrix.csv')

# Print out the first few rows of the user-item ratings matrix
user_item_ratings_df

Unnamed: 0_level_0,item1,item2,item3,item4,item5,item6,item7,item8,item9,item10,...,item21,item22,item23,item24,item25,item26,item27,item28,item29,item30
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,5.0,4.0,0.0,0.0,0.0,0.0,0.0
2,0.0,3.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0,5.0,0.0
3,5.0,0.0,3.0,0.0,0.0,0.0,5.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,2.0
4,0.0,2.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,5.0,0.0,0.0,4.0,4.0,0.0,0.0,0.0,3.0,0.0
5,0.0,0.0,0.0,5.0,0.0,3.0,0.0,0.0,1.0,0.0,...,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
6,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,...,0.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,2.0,0.0,0.0,5.0,0.0,0.0,1.0,0.0,...,4.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,3.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0
9,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,5.0,2.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,3.0
10,4.0,3.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
import numpy as np

# Data with some missing ratings represented by 0s
ratings = np.array([
    [5, 0, 3, 0],
    [4, 4, 0, 2],
    [0, 3, 0, 0],
    [2, 0, 0, 4],
    [0, 1, 5, 0]
])

class FunkSVD:
    def __init__(self, num_factors, learning_rate, regularization):
        self.num_factors = num_factors
        self.learning_rate = learning_rate
        self.regularization = regularization

    def fit(self, X, num_iterations):
        self.num_users, self.num_items = X.shape

        # Initialize user and item matrices randomly
        self.user_vectors = np.random.randn(self.num_users, self.num_factors)
        self.item_vectors = np.random.randn(self.num_items, self.num_factors)

        for i in range(num_iterations):
            # Update user and item matrices using gradient descent
            for u in range(self.num_users):
                for i in range(self.num_items):
                    if X[u, i] != 0:
                        prediction = np.dot(self.user_vectors[u], self.item_vectors[i])
                        error = X[u, i] - prediction

                        self.user_vectors[u] += self.learning_rate * (error * self.item_vectors[i] - self.regularization * self.user_vectors[u])
                        self.item_vectors[i] += self.learning_rate * (error * self.user_vectors[u] - self.regularization * self.item_vectors[i])

    def predict(self, X):
        # Predict ratings for all users and items
        return np.dot(self.user_vectors, self.item_vectors.T)

# Initialize the FunkSVD model with parameters
num_factors = 2
learning_rate = 0.01
regularization = 0.1
num_iterations = 1000

model = FunkSVD(num_factors, learning_rate, regularization)

# Train the model using the observed ratings in the matrix
model.fit(ratings, num_iterations)

# Get predicted ratings
predicted_ratings = model.predict(ratings)

# Print the predicted ratings
print("Predicted Ratings Matrix:\n", predicted_ratings)


Predicted Ratings Matrix:
 [[ 4.78496146  4.56381781  2.96756047  2.6099398 ]
 [ 3.94057431  3.74974341  2.61609387  2.00440787]
 [ 3.12548658  2.91554045  3.23309482  0.61490819]
 [ 2.00552186  2.07570967 -1.97584102  3.80415768]
 [ 1.31727363  1.05442479  4.8093647  -2.64226551]]


In [None]:
def matrix_factorization_sgd(R, K, steps=50, alpha=0.001, beta=0.02, use_regularization=True, use_bias=True):
    # R = user-item ratings matrix
    # K = number of latent features
    # steps = number of iterations
    # alpha = learning rate
    # beta = bias term

    N, M = R.shape
    P = np.abs(np.random.randn(N, K))  # Initialize with non-negative values
    Q = np.abs(np.random.randn(M, K))
    counter = 0

    # Initialize bias terms
    if use_bias:
        b_u = np.zeros(N)
        b_i = np.zeros(M)
        b = np.mean(R[np.where(R != 0)])  # global bias

    for step in range(steps):
        for i in range(N):
            for j in range(M):
                if R[i][j] > 0:
                    eij = R[i][j] - np.dot(P[i, :], Q[j, :])

                    # Update P and Q
                    for k in range(K):
                        if use_regularization:
                            P[i][k] += alpha * (2 * eij * Q[j][k] - beta * P[i][k])
                            Q[j][k] += alpha * (2 * eij * P[i][k] - beta * Q[j][k])
                        else:
                            P[i][k] += alpha * (2 * eij * Q[j][k])
                            Q[j][k] += alpha * (2 * eij * P[i][k])

                    # Update bias terms
                    if use_bias:
                        b_u[i] += alpha * (eij - beta * b_u[i])
                        b_i[j] += alpha * (eij - beta * b_i[j])

        # Check for convergence within the loop
        if np.sqrt(np.sum((R - np.dot(P, Q.T))**2)) < 0.001:
            break

    # Add bias terms to the prediction
    if use_bias:
        R_pred = np.dot(P, Q.T) + b + b_u[:, np.newaxis] + b_i[np.newaxis:,]  
    else:
        R_pred = np.dot(P, Q.T)

    return P, Q, R_pred


# Use the function to reconstruct the original matrix
np.random.seed(42)
R = x_hidden.values
nP, nQ, nR_pred = matrix_factorization_sgd(R, K=2, alpha=0.001, beta=0.02, use_regularization=False, use_bias=False, steps=1000)

#  convert the reconstructed matrix to a dataframe
nR_pred = pd.DataFrame(nR_pred, columns=x_hidden.columns, index=x_hidden.index)
print("\nReconstructed Matrix as a DataFrame")
display(nR_pred.head(15))


Reconstructed Matrix as a DataFrame


Unnamed: 0_level_0,item1,item2,item3,item4,item5,item6,item7,item8,item9,item10,...,item21,item22,item23,item24,item25,item26,item27,item28,item29,item30
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.598194,2.466711,2.635027,4.427982,3.18805,2.483468,3.544901,2.603285,3.761474,2.887082,...,5.078196,3.903188,4.014551,4.160609,3.878936,3.298762,1.769712,2.534731,3.568962,2.922307
2,4.139108,-0.185499,1.661719,4.865041,-0.757941,5.462929,1.299585,1.815326,1.661141,2.604765,...,1.652384,4.965607,4.907197,2.138033,5.199372,1.307965,4.457433,2.818679,3.530592,3.072232
3,5.014395,2.328183,2.766732,4.960982,2.931087,3.193589,3.58134,2.759512,3.84257,3.149295,...,5.098919,4.474848,4.572432,4.295518,4.486836,3.347501,2.36064,2.844915,3.939817,3.253237
4,3.993806,2.103633,2.277209,3.860164,2.710429,2.209159,3.048415,2.252581,3.239212,2.507699,...,4.363582,3.413602,3.507765,3.587784,3.396664,2.838342,1.583356,2.210237,3.104991,2.545331
5,2.03123,-1.442101,0.416641,2.881169,-2.409111,4.49326,-0.424365,0.594327,-0.153377,1.281596,...,-0.828183,3.252273,3.134515,0.146697,3.510523,-0.291125,3.795976,1.684807,1.911157,1.755704
6,2.47474,1.40269,1.44034,2.355686,1.829055,1.235843,1.966906,1.417569,2.078265,1.553637,...,2.824197,2.055356,2.120244,2.289407,2.034324,1.827256,0.863032,1.347423,1.910883,1.558991
7,3.634576,2.49993,2.245223,3.299002,3.349477,1.225023,3.234512,2.178475,3.367602,2.280695,...,4.681416,2.753169,2.877473,3.6562,2.675553,2.987374,0.74385,1.880743,2.748328,2.208895
8,2.676654,1.023384,1.412102,2.728307,1.233804,1.999011,1.739231,1.424861,1.893866,1.681615,...,2.45562,2.521058,2.558681,2.146371,2.550764,1.635376,1.521287,1.567566,2.132043,1.776834
9,5.616027,4.281036,3.592705,4.944686,5.806108,1.33184,5.326645,3.457899,5.50333,3.523025,...,7.740949,4.001684,4.221333,5.928818,3.837323,4.904804,0.651449,2.812713,4.191356,3.336333
10,4.535939,2.458297,2.606727,4.358903,3.182556,2.416331,3.516547,2.573522,3.728458,2.847932,...,5.03975,3.835264,3.946767,4.120967,3.808686,3.271352,1.716008,2.494837,3.51734,2.878155


### Grid Search for Tuning

In [None]:
hidden_ratings_ind = indices_tracker.copy()
hidden_ratings_arrays = []
for user in range(x.shape[0]):
    user_hidden_ratings = x.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    hidden_ratings_arrays.append(user_hidden_ratings)

hidden_ratings_array = pd.DataFrame(hidden_ratings_arrays).to_numpy().flatten()

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import itertools

# Define the hyperparameters to tune
param_grid = {
    'K': [2,5, 10, 20],         # Number of latent features
    'alpha': [0.001, 0.0001], # Learning rate
    'beta': [0.1, 0.5, 1, 2, 4, 5]    # Regularization parameter
}

# Create all possible combinations of hyperparameters
param_combinations = list(itertools.product(*param_grid.values()))

# Initialize variables to keep track of the best parameters and the best RMSE
best_params = None
best_rmse = float('inf')  # initialize with a large value
counter = 0

# Loop over each parameter combination
for params in param_combinations:
    
    # Unpack the parameters
    K, alpha, beta = params
    
    # counter
    counter += 1

    # Run matrix factorization with the current hyperparameters
    np.random.seed(42)
    print(f"Iteration {counter} of {len(param_combinations)}")
    print(f'K={K}, alpha={alpha}, beta={beta}')
    nP, nQ, nR_pred = matrix_factorization_sgd(
        R, K=K, alpha=alpha, beta=beta, use_regularization=True, use_bias=True)
    
    # Compute RMSE
    nR_pred = pd.DataFrame(nR_pred, columns=x_hidden.columns, index=x_hidden.index)
    predicted_ratings_arrays = []
    for user in range(nR_pred.shape[0]):
        user_predicted_ratings = nR_pred.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
        predicted_ratings_arrays.append(user_predicted_ratings)

    predicted_ratings_array = pd.DataFrame(predicted_ratings_arrays).to_numpy().flatten()
    rmse = np.sqrt(mean_squared_error(hidden_ratings_array, predicted_ratings_array))

    # Check if this is the best RMSE so far
    print(f"Checking RMSE: {rmse}")
    if rmse < best_rmse:
        print(f'New best RMSE: {rmse}')
        best_rmse = rmse
        best_params = params
    else :
        print("RMSE not improved")
    print("\n")

# Print the best parameters and the best RMSE
print(f'Best Parameters: {best_params}')
print(f'Best RMSE: {best_rmse}')


Iteration 1 of 18
K=2, alpha=0.001, beta=0.1
Step: 0
Step: 1
Step: 2
Step: 3
Step: 4
Step: 5
Step: 6
Step: 7
Step: 8
Step: 9
Step: 10
Step: 11
Step: 12
Step: 13
Step: 14
Step: 15
Step: 16
Step: 17
Step: 18
Step: 19
Step: 20
Step: 21
Step: 22
Step: 23
Step: 24
Step: 25
Step: 26
Step: 27
Step: 28
Step: 29
Step: 30
Step: 31
Step: 32
Step: 33
Step: 34
Step: 35
Step: 36
Step: 37
Step: 38
Step: 39
Step: 40
Step: 41
Step: 42
Step: 43
Step: 44
Step: 45
Step: 46
Step: 47
Step: 48
Step: 49
Checking RMSE: 3.3765536948832966
New best RMSE: 3.3765536948832966


Iteration 2 of 18
K=2, alpha=0.001, beta=0.5
Step: 0
Step: 1
Step: 2
Step: 3
Step: 4
Step: 5
Step: 6
Step: 7
Step: 8
Step: 9
Step: 10
Step: 11
Step: 12
Step: 13
Step: 14
Step: 15
Step: 16
Step: 17
Step: 18
Step: 19
Step: 20
Step: 21
Step: 22
Step: 23
Step: 24
Step: 25
Step: 26
Step: 27
Step: 28
Step: 29
Step: 30
Step: 31
Step: 32
Step: 33
Step: 34
Step: 35
Step: 36
Step: 37
Step: 38
Step: 39
Step: 40
Step: 41
Step: 42
Step: 43
Step: 44
Step:

In [None]:
# step 1: identify the hidden ratings indices = indices_tracker and get the hidden ratings ==========================================================================
hidden_ratings_ind = indices_tracker.copy()

# Loop through users to append hidden ratings
hidden_ratings_arrays = []

# Loop through users to append hidden ratings arrays
for user in range(x.shape[0]):
    user_hidden_ratings = x.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    hidden_ratings_arrays.append(user_hidden_ratings)


hidden_ratings_array = pd.DataFrame(hidden_ratings_arrays).to_numpy().flatten()
print("Hidden Ratings:", hidden_ratings_array)

# step 2: extract corresponding predicted ratings indices ==========================================================================

# Create an empty list to store predicted ratings arrays
predicted_ratings_arrays = []

# Loop through users to append predicted ratings arrays
for user in range(nR_pred.shape[0]):
    user_predicted_ratings = nR_pred.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    predicted_ratings_arrays.append(user_predicted_ratings)

predicted_ratings_array = pd.DataFrame(predicted_ratings_arrays).to_numpy().flatten()
print("Corresponding Predicted Ratings:", predicted_ratings_array)

# step 3: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values) ==========================================================================

from sklearn.metrics import mean_absolute_error, mean_squared_error

# calculate MAE, MSE and RMSE
print("Using sklearn")
mae = mean_absolute_error(hidden_ratings_array, predicted_ratings_array)
mse = mean_squared_error(hidden_ratings_array, predicted_ratings_array)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")


# Manually
print("\n\nManually")
mae = np.mean(np.abs(hidden_ratings_array - predicted_ratings_array)) # Calculate Mean Absolute Error (MAE)
mse = np.mean((hidden_ratings_array - predicted_ratings_array) ** 2) # Calculate Mean Squared Error (MSE)
rmse = np.sqrt(mse) # Calculate Root Mean Squared Error (RMSE)


print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")

Hidden Ratings: [4. 5. 1. 3. 5. 3. 5. 2. 2. 2. 5. 4. 1. 3. 5. 1. 5. 1. 5. 4. 1. 1. 5. 2.
 5. 2. 4. 1. 4. 4. 3. 3. 2. 1. 4. 2. 5. 4. 4. 4. 3. 1. 4. 2. 2. 3. 1. 1.
 5. 4. 2. 4. 1. 2. 3. 3. 1. 2. 5. 4. 1. 1. 3. 3. 3. 3. 1. 1. 4. 5. 4. 1.
 3. 2. 1. 5. 2. 4. 2. 4. 4. 3. 3. 3. 4. 5. 5. 4. 4. 3. 1. 5. 1. 3. 3. 4.
 2. 4. 1. 2. 1. 5. 1. 3. 5. 4. 5. 5. 2. 2. 2. 4. 4. 3. 5. 5. 5. 3. 3. 2.
 4. 4. 5. 1. 2. 5. 4. 5. 5. 2. 4. 5. 1. 5. 1. 3. 1. 3. 4. 2. 3. 5. 1. 5.
 5. 5. 2. 1. 2. 5.]
Corresponding Predicted Ratings: [ 5.69516466  3.62928464  6.80235818  6.13097254  6.1438771   6.52080282
  7.73146906  5.96760307  4.5673075   6.00192727  6.66521877  6.90728708
  4.84114081  4.104792    5.66964478  4.40403173  4.24338643  3.7007963
  4.12988495  6.04134152  5.25942869  8.04830793  4.89894487  5.02895848
  3.71120641  5.15918216  5.51956847  4.78393186  4.94356061  4.81916789
  5.91814148  6.53642162  4.73827151  5.73846864  4.66894962  6.08089578
  7.35159247  6.31237508  4.79823516  5.82489225  4.1649

In [None]:
# round to 2 decimal places
mae = round(mae, 3)
mse = round(mse, 3)
rmse = round(rmse, 3)

# Save the results to a csv file
results = pd.DataFrame({'MAE': [mae], 'MSE': [mse], 'RMSE': [rmse]})
# results.to_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\results_NMF.csv', index=False)
results.to_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/Results/MF_results.csv', index=False)