# Non-Negative Matrix Factorisation

- Matrix Factorization is a general technique used in collaborative filtering and other applications where a matrix is decomposed into the product of two lower-rank matrices.
- Stochastic Gradient Descent is an optimization algorithm commonly used to minimize the error in the factorization process.

- The key idea is to iteratively update the elements of the factorized matrices using the gradient of the error with respect to the elements.


## Algorithm Summary

Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix. 

1. **Load the data**
- data is provided in a dataframe where each row is a review

2. **Create a user-item matrix**
- convert dataframe into user-item matrix where each row is a user and each column is an item

3. **Create test and train set**
- hide $N$ ratings for each user in the training set and use them to test the performance of the model
- Typically, a certain percentage of ratings for each user are masked in the training set and used for testing the model's performance.

4. **Apply Non-negative Matrix Factorization (NMF)**

    1.  Decompose the user-item interaction matrix into two non-negative matrices: a user matrix and an item matrix.
    2. Minimize the reconstruction error between the original matrix and the product of the decomposed matrices using optimization techniques like gradient descent.

5. **Make predictions**
- For each user-item pair in the test set, predict the rating by reconstructing the original rating matrix using the decomposed user and item matrices.
- The predicted rating is obtained by taking the dot product of the corresponding user and item latent factor vectors.

6. **Evaluate the model**
- Calculate the predictive accuracy of the model using various evaluation metrics such as Root Mean Squared Error (RMSE), Mean Squared Error (MSE), and Mean Absolute Error (MAE).
- Additionally, assess the Top-N recommendation performance of the model using metrics like Normalized Discounted Cumulative Gain (NDCG) and Hit Rate.


## Manaul / From Fundamentals

In [39]:
%reset -f

# load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [31]:
# load data
# amz_data = pd.read_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\set2_data_modelling.csv')
amz_data = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set2_data_modelling.csv", index_col=0)
display(amz_data.head())

# print details
print('Number of Rows: ', amz_data.shape[0])
print('Number of Columns: ', amz_data.shape[1])
print('Number of Unique Users: ', len(amz_data['reviewerID'].unique()))
print('Number of Unique Products: ', len(amz_data['asin'].unique()))
print('Fewest reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().min())
print('Most reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().max())
print("Fewest reviews per product:", amz_data.groupby('asin')['reviewerID'].count().min())
print("Most reviews per product:", amz_data.groupby('asin')['reviewerID'].count().max())

# Creating User Item Matrix =====================================================
# create user-item matrix
x = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
x = x.fillna(0)
print("\n\n\nUser-Item Matrix")
display(x.head())
print('Shape: ', x.shape)

Unnamed: 0,reviewerID,reviewTime,asin,overall,reviewText,stemmed_words_revText,lemmatised_reviewText,filtered_tokens_revText,sentiments_afinn,sentiments_bing,sentiments_vader
76,AQ8OO59DJFJNZ,2018-01-05,767834739,5.0,wonderful movie,wonder movi,wonderful movie,wonderful movie,4,1,0.5719
78,A244CRJ2QSVLZ4,2008-01-29,767834739,5.0,resident evil is a great science fictionhorror...,resid evil great scienc fictionhorror hybrid p...,resident evil great science fictionhorror hybr...,resident evil great science fictionhorror hybr...,-12,-5,-0.9455
81,A1VCLTAGM5RLND,2005-07-23,767834739,5.0,i this movie has people living and working und...,movi peopl live work underground place call hi...,movie people living working underground place ...,movie people living working underground place ...,-1,0,-0.1806
82,A119Q9NFGVOEJZ,2016-02-13,767834739,5.0,every single video game based movie from the s...,everi singl video game base movi super mario b...,every single video game based movie super mari...,every single video game based movie super mari...,18,6,0.9846
83,A1RP6YCOS5VJ5I,2006-09-26,767834739,5.0,i think that i like this movie more than the o...,think like movi origin origin still great real...,think like movie original original still great...,think like movie original original still great...,29,10,0.9951


Number of Rows:  83139
Number of Columns:  11
Number of Unique Users:  3668
Number of Unique Products:  3249
Fewest reviews by a reviewer: 13
Most reviews by a reviewer: 193
Fewest reviews per product: 13
Most reviews per product: 189



User-Item Matrix


asin,0767834739,7799146915,B00000DMAT,B00000DMAX,B00000DMB3,B00000F1GM,B00000I1BJ,B00000I1BY,B00000ID61,B00000INR2,...,B01H353FLA,B01H353HUY,B01H3VFR6U,B01H5GB8ZW,B01H6OXQFS,B01H9SH2LU,B01HGBAFNC,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100RH4M1W1DF0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100WO06OQR8BQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0
A1027EV8A9PV1O,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A103KKI1Y4TFNQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1047P9FLHTDZJ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Shape:  (3668, 3249)


### Train and Test Split

In [40]:
x = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/user_item_ratings_matrix.csv", index_col=0)
x

Unnamed: 0_level_0,item1,item2,item3,item4,item5,item6,item7,item8,item9,item10,...,item21,item22,item23,item24,item25,item26,item27,item28,item29,item30
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,5.0,4.0,0.0,0.0,0.0,0.0,0.0
2,0.0,3.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0,5.0,0.0
3,5.0,0.0,3.0,0.0,0.0,0.0,5.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,2.0
4,0.0,2.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,5.0,0.0,0.0,4.0,4.0,0.0,0.0,0.0,3.0,0.0
5,0.0,0.0,0.0,5.0,0.0,3.0,0.0,0.0,1.0,0.0,...,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
6,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,...,0.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,2.0,0.0,0.0,5.0,0.0,0.0,1.0,0.0,...,4.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,3.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0
9,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,5.0,2.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,3.0
10,4.0,3.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0


In [41]:
# create a copy of the original matrix to store hidden ratings
x_hidden = x.copy()
indices_tracker = []

# number of products to hide for each user
N = 3

# identifies rated items and randomly selects N products to hide ratings for each user
np.random.seed(2207)  # You can use any integer value as the seed
for user_id in range(x_hidden.shape[0]):
    rated_products = np.where(x_hidden.iloc[user_id, :] > 0)[0]
    # print("User:", user_id)
    # print("Indices of Rated Products:", rated_products)
    hidden_indices = np.random.choice(rated_products, N, replace=False)
    indices_tracker.append(hidden_indices)
    # print("Indices to Hide:", hidden_indices, "\n")
    x_hidden.iloc[user_id, hidden_indices] = 0

In [42]:
# check tracker - all hidden ratings 
indices_tracker = pd.DataFrame(indices_tracker).to_numpy()
print("Indices of Ratings per user \n", indices_tracker)

# flattened
indices_tracker_flat = indices_tracker.flatten()
print("Indices of Ratings per User joined", indices_tracker_flat)

# see updated matrix with hidden ratings
print("\n\nUpdated Matrix with Hidden Ratings")
display(x_hidden)

# see original matrix
print("Original Matrix")
display(x)

Indices of Ratings per user 
 [[24 23 14]
 [ 1 28  4]
 [ 6 29 13]
 [ 1 12 24]
 [ 8  5  3]
 [10 13  5]
 [ 5 20  8]
 [28 11 21]
 [20 21 12]
 [18 10 23]
 [29 24  8]
 [12 17  3]
 [ 2 25 21]
 [24 23 15]
 [ 0 11 22]
 [26  6  5]
 [ 4 27 20]
 [22  2  0]
 [12  6 29]
 [16 13 18]
 [20 13  1]
 [ 5 15 28]
 [ 2 16 28]
 [12 29  1]
 [25 11  1]
 [13 15 28]
 [13  6  3]
 [22  9 26]
 [28  9 14]
 [21 12 25]
 [29 23 22]
 [26  1  3]
 [28 17 25]
 [ 7 15 13]
 [ 0 11  7]
 [16 10  7]
 [24  7  9]
 [23  9 18]
 [ 5 19  0]
 [14  6 16]
 [ 8 13 24]
 [28 24 29]
 [12  7 28]
 [10 14 17]
 [ 2 27 10]
 [12  4 11]
 [ 2 22 12]
 [ 5 11  1]
 [ 9 26  0]
 [29 21 18]]
Indices of Ratings per User joined [24 23 14  1 28  4  6 29 13  1 12 24  8  5  3 10 13  5  5 20  8 28 11 21
 20 21 12 18 10 23 29 24  8 12 17  3  2 25 21 24 23 15  0 11 22 26  6  5
  4 27 20 22  2  0 12  6 29 16 13 18 20 13  1  5 15 28  2 16 28 12 29  1
 25 11  1 13 15 28 13  6  3 22  9 26 28  9 14 21 12 25 29 23 22 26  1  3
 28 17 25  7 15 13  0 11  7 16 10  7 24  7

Unnamed: 0_level_0,item1,item2,item3,item4,item5,item6,item7,item8,item9,item10,...,item21,item22,item23,item24,item25,item26,item27,item28,item29,item30
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0,0.0,0.0
3,5.0,0.0,3.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,5.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,3.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,...,0.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,3.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,3.0
10,4.0,3.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Original Matrix


Unnamed: 0_level_0,item1,item2,item3,item4,item5,item6,item7,item8,item9,item10,...,item21,item22,item23,item24,item25,item26,item27,item28,item29,item30
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,5.0,4.0,0.0,0.0,0.0,0.0,0.0
2,0.0,3.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0,5.0,0.0
3,5.0,0.0,3.0,0.0,0.0,0.0,5.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,2.0
4,0.0,2.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,5.0,0.0,0.0,4.0,4.0,0.0,0.0,0.0,3.0,0.0
5,0.0,0.0,0.0,5.0,0.0,3.0,0.0,0.0,1.0,0.0,...,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
6,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,...,0.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,2.0,0.0,0.0,5.0,0.0,0.0,1.0,0.0,...,4.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,3.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0
9,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,5.0,2.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,3.0
10,4.0,3.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0


### Decomposition, Optimisation and Prediction

The implementation of matrix factorization using stochastic gradient descent (SGD) for non-negative matrix factorization (NMF). Let’s break down the key components:

1. Initialization:
- We initialize matrices P and Q with non-negative random values, which is appropriate for NMF.
- The bias terms b_u and b_i are initialized to zeros, and the global bias b is calculated as the mean of non-zero elements in the input matrix R.



2. Update Rules:
- We use the SGD approach to update P and Q iteratively based on the error eij (difference between the actual rating and the predicted rating).
- If use_regularization is enabled, We apply L2 regularization to the updates by adding a penalty term proportional to the current value of P and Q.
- The bias terms b_u and b_i are also updated based on the error.
- Main Loop: The code runs for a fixed number of iterations, with each iteration looping over all $(i,j)$, entries in the input matrix $R$. If R[i][j] is a non-zero rating, it computes the prediction error $e_{ij}$ (difference between observed and predicted ratings). Then, it updates P[i] and Q[j] by SGD with added regularization terms.

3. Convergence Check:
- We monitor the convergence by calculating the Frobenius norm of the difference between the original matrix R and the reconstructed matrix PQ^T.
- If the difference falls below a threshold (0.001 in our case), the algorithm stops iterating.

4. Bias Terms:
- We correctly add the bias terms to the final prediction if use_bias is enabled.

In [53]:
def matrix_factorization_sgd(R, K, steps=50, alpha=0.001, beta=0.02, use_regularization=True, use_bias=True):
    # R = user-item ratings matrix
    # K = number of latent features
    # steps = number of iterations
    # alpha = learning rate
    # beta = bias term

    N, M = R.shape
    P = np.abs(np.random.randn(N, K))  # Initialize with non-negative values
    Q = np.abs(np.random.randn(M, K))
    counter = 0

    # Initialize bias terms
    if use_bias:
        b_u = np.zeros(N)
        b_i = np.zeros(M)
        b = np.mean(R[np.where(R != 0)])  # global bias

    for step in range(steps):
        for i in range(N):
            for j in range(M):
                if R[i][j] > 0:
                    eij = R[i][j] - np.dot(P[i, :], Q[j, :])

                    # Update P and Q
                    for k in range(K):
                        if use_regularization:
                            P[i][k] += alpha * (2 * eij * Q[j][k] - beta * P[i][k])
                            Q[j][k] += alpha * (2 * eij * P[i][k] - beta * Q[j][k])
                        else:
                            P[i][k] += alpha * (2 * eij * Q[j][k])
                            Q[j][k] += alpha * (2 * eij * P[i][k])

                    # Update bias terms
                    if use_bias:
                        b_u[i] += alpha * (eij - beta * b_u[i])
                        b_i[j] += alpha * (eij - beta * b_i[j])

        # Check for convergence within the loop
        if np.sqrt(np.sum((R - np.dot(P, Q.T))**2)) < 0.001:
            break

    # Add bias terms to the prediction
    if use_bias:
        R_pred = np.dot(P, Q.T) + b + b_u[:, np.newaxis] + b_i[np.newaxis:,]  
    else:
        R_pred = np.dot(P, Q.T)

    return P, Q, R_pred


# Use the function to reconstruct the original matrix
np.random.seed(42)
R = x_hidden.values
nP, nQ, nR_pred = matrix_factorization_sgd(R, K=2, alpha=0.001, beta=0.02, use_regularization=False, use_bias=False, steps=1000)
print("Original Matrix:")
print(R)
print("\nReconstructed Matrix:")
print(nR_pred)

#  convert the reconstructed matrix to a dataframe
nR_pred = pd.DataFrame(nR_pred, columns=x_hidden.columns, index=x_hidden.index)
print("\nReconstructed Matrix as a DataFrame")
display(nR_pred)

Original Matrix:
[[5. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 4. 0. 0.]
 [5. 0. 3. ... 0. 0. 0.]
 ...
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 1.]
 [3. 3. 0. ... 0. 0. 0.]]

Reconstructed Matrix:
[[ 4.59819445  2.46671067  2.63502737 ...  2.53473085  3.56896197
   2.92230676]
 [ 4.13910797 -0.18549882  1.66171862 ...  2.8186792   3.53059182
   3.07223151]
 [ 5.01439531  2.32818297  2.76673171 ...  2.84491508  3.9398167
   3.25323681]
 ...
 [ 1.00861777  3.19482251  1.36137622 ... -0.03633597  0.43214739
   0.15382587]
 [ 1.77005512 -1.87676289  0.1800208  ...  1.60658248  1.74736932
   1.64379404]
 [ 3.68297795  2.09823118  2.14671505 ...  2.00288199  2.84241432
   2.31816878]]

Reconstructed Matrix as a DataFrame


Unnamed: 0_level_0,item1,item2,item3,item4,item5,item6,item7,item8,item9,item10,...,item21,item22,item23,item24,item25,item26,item27,item28,item29,item30
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.598194,2.466711,2.635027,4.427982,3.18805,2.483468,3.544901,2.603285,3.761474,2.887082,...,5.078196,3.903188,4.014551,4.160609,3.878936,3.298762,1.769712,2.534731,3.568962,2.922307
2,4.139108,-0.185499,1.661719,4.865041,-0.757941,5.462929,1.299585,1.815326,1.661141,2.604765,...,1.652384,4.965607,4.907197,2.138033,5.199372,1.307965,4.457433,2.818679,3.530592,3.072232
3,5.014395,2.328183,2.766732,4.960982,2.931087,3.193589,3.58134,2.759512,3.84257,3.149295,...,5.098919,4.474848,4.572432,4.295518,4.486836,3.347501,2.36064,2.844915,3.939817,3.253237
4,3.993806,2.103633,2.277209,3.860164,2.710429,2.209159,3.048415,2.252581,3.239212,2.507699,...,4.363582,3.413602,3.507765,3.587784,3.396664,2.838342,1.583356,2.210237,3.104991,2.545331
5,2.03123,-1.442101,0.416641,2.881169,-2.409111,4.49326,-0.424365,0.594327,-0.153377,1.281596,...,-0.828183,3.252273,3.134515,0.146697,3.510523,-0.291125,3.795976,1.684807,1.911157,1.755704
6,2.47474,1.40269,1.44034,2.355686,1.829055,1.235843,1.966906,1.417569,2.078265,1.553637,...,2.824197,2.055356,2.120244,2.289407,2.034324,1.827256,0.863032,1.347423,1.910883,1.558991
7,3.634576,2.49993,2.245223,3.299002,3.349477,1.225023,3.234512,2.178475,3.367602,2.280695,...,4.681416,2.753169,2.877473,3.6562,2.675553,2.987374,0.74385,1.880743,2.748328,2.208895
8,2.676654,1.023384,1.412102,2.728307,1.233804,1.999011,1.739231,1.424861,1.893866,1.681615,...,2.45562,2.521058,2.558681,2.146371,2.550764,1.635376,1.521287,1.567566,2.132043,1.776834
9,5.616027,4.281036,3.592705,4.944686,5.806108,1.33184,5.326645,3.457899,5.50333,3.523025,...,7.740949,4.001684,4.221333,5.928818,3.837323,4.904804,0.651449,2.812713,4.191356,3.336333
10,4.535939,2.458297,2.606727,4.358903,3.182556,2.416331,3.516547,2.573522,3.728458,2.847932,...,5.03975,3.835264,3.946767,4.120967,3.808686,3.271352,1.716008,2.494837,3.51734,2.878155


### Grid Search for Tuning

In [45]:
hidden_ratings_ind = indices_tracker.copy()
hidden_ratings_arrays = []
for user in range(x.shape[0]):
    user_hidden_ratings = x.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    hidden_ratings_arrays.append(user_hidden_ratings)

hidden_ratings_array = pd.DataFrame(hidden_ratings_arrays).to_numpy().flatten()

In [46]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import itertools

# Define the hyperparameters to tune
param_grid = {
    'K': [2,5, 10, 20],         # Number of latent features
    'alpha': [0.001, 0.0001], # Learning rate
    'beta': [0.1, 0.5, 1, 2, 4, 5]    # Regularization parameter
}

# Create all possible combinations of hyperparameters
param_combinations = list(itertools.product(*param_grid.values()))

# Initialize variables to keep track of the best parameters and the best RMSE
best_params = None
best_rmse = float('inf')  # initialize with a large value
counter = 0

# Loop over each parameter combination
for params in param_combinations:
    
    # Unpack the parameters
    K, alpha, beta = params
    
    # counter
    counter += 1

    # Run matrix factorization with the current hyperparameters
    np.random.seed(42)
    print(f"Iteration {counter} of {len(param_combinations)}")
    print(f'K={K}, alpha={alpha}, beta={beta}')
    nP, nQ, nR_pred = matrix_factorization_sgd(
        R, K=K, alpha=alpha, beta=beta, use_regularization=True, use_bias=True)
    
    # Compute RMSE
    nR_pred = pd.DataFrame(nR_pred, columns=x_hidden.columns, index=x_hidden.index)
    predicted_ratings_arrays = []
    for user in range(nR_pred.shape[0]):
        user_predicted_ratings = nR_pred.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
        predicted_ratings_arrays.append(user_predicted_ratings)

    predicted_ratings_array = pd.DataFrame(predicted_ratings_arrays).to_numpy().flatten()
    rmse = np.sqrt(mean_squared_error(hidden_ratings_array, predicted_ratings_array))

    # Check if this is the best RMSE so far
    print(f"Checking RMSE: {rmse}")
    if rmse < best_rmse:
        print(f'New best RMSE: {rmse}')
        best_rmse = rmse
        best_params = params
    else :
        print("RMSE not improved")
    print("\n")

# Print the best parameters and the best RMSE
print(f'Best Parameters: {best_params}')
print(f'Best RMSE: {best_rmse}')


Iteration 1 of 18
K=2, alpha=0.001, beta=0.1
Step: 0
Step: 1
Step: 2
Step: 3
Step: 4
Step: 5
Step: 6
Step: 7
Step: 8
Step: 9
Step: 10
Step: 11
Step: 12
Step: 13
Step: 14
Step: 15
Step: 16
Step: 17
Step: 18
Step: 19
Step: 20
Step: 21
Step: 22
Step: 23
Step: 24
Step: 25
Step: 26
Step: 27
Step: 28
Step: 29
Step: 30
Step: 31
Step: 32
Step: 33
Step: 34
Step: 35
Step: 36
Step: 37
Step: 38
Step: 39
Step: 40
Step: 41
Step: 42
Step: 43
Step: 44
Step: 45
Step: 46
Step: 47
Step: 48
Step: 49
Checking RMSE: 3.3765536948832966
New best RMSE: 3.3765536948832966


Iteration 2 of 18
K=2, alpha=0.001, beta=0.5
Step: 0
Step: 1
Step: 2
Step: 3
Step: 4
Step: 5
Step: 6
Step: 7
Step: 8
Step: 9
Step: 10
Step: 11
Step: 12
Step: 13
Step: 14
Step: 15
Step: 16
Step: 17
Step: 18
Step: 19
Step: 20
Step: 21
Step: 22
Step: 23
Step: 24
Step: 25
Step: 26
Step: 27
Step: 28
Step: 29
Step: 30
Step: 31
Step: 32
Step: 33
Step: 34
Step: 35
Step: 36
Step: 37
Step: 38
Step: 39
Step: 40
Step: 41
Step: 42
Step: 43
Step: 44
Step:

### Evaluation (Predictive Accuracy)

Now evaluate how good the predictions are vs the hidden ratings
- ***step 1***: identify the hidden ratings indices
- ***step 2***: extract hidden ratings indices and corresponding predicted ratings indices
- ***step 3***: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values

In [47]:
# step 1: identify the hidden ratings indices = indices_tracker and get the hidden ratings ==========================================================================
hidden_ratings_ind = indices_tracker.copy()

# Loop through users to append hidden ratings
hidden_ratings_arrays = []

# Loop through users to append hidden ratings arrays
for user in range(x.shape[0]):
    user_hidden_ratings = x.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    hidden_ratings_arrays.append(user_hidden_ratings)


hidden_ratings_array = pd.DataFrame(hidden_ratings_arrays).to_numpy().flatten()
print("Hidden Ratings:", hidden_ratings_array)

# step 2: extract corresponding predicted ratings indices ==========================================================================

# Create an empty list to store predicted ratings arrays
predicted_ratings_arrays = []

# Loop through users to append predicted ratings arrays
for user in range(nR_pred.shape[0]):
    user_predicted_ratings = nR_pred.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    predicted_ratings_arrays.append(user_predicted_ratings)

predicted_ratings_array = pd.DataFrame(predicted_ratings_arrays).to_numpy().flatten()
print("Corresponding Predicted Ratings:", predicted_ratings_array)

# step 3: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values) ==========================================================================

from sklearn.metrics import mean_absolute_error, mean_squared_error

# calculate MAE, MSE and RMSE
print("Using sklearn")
mae = mean_absolute_error(hidden_ratings_array, predicted_ratings_array)
mse = mean_squared_error(hidden_ratings_array, predicted_ratings_array)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")


# Manually
print("\n\nManually")
mae = np.mean(np.abs(hidden_ratings_array - predicted_ratings_array)) # Calculate Mean Absolute Error (MAE)
mse = np.mean((hidden_ratings_array - predicted_ratings_array) ** 2) # Calculate Mean Squared Error (MSE)
rmse = np.sqrt(mse) # Calculate Root Mean Squared Error (RMSE)


print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")

Hidden Ratings: [4. 5. 1. 3. 5. 3. 5. 2. 2. 2. 5. 4. 1. 3. 5. 1. 5. 1. 5. 4. 1. 1. 5. 2.
 5. 2. 4. 1. 4. 4. 3. 3. 2. 1. 4. 2. 5. 4. 4. 4. 3. 1. 4. 2. 2. 3. 1. 1.
 5. 4. 2. 4. 1. 2. 3. 3. 1. 2. 5. 4. 1. 1. 3. 3. 3. 3. 1. 1. 4. 5. 4. 1.
 3. 2. 1. 5. 2. 4. 2. 4. 4. 3. 3. 3. 4. 5. 5. 4. 4. 3. 1. 5. 1. 3. 3. 4.
 2. 4. 1. 2. 1. 5. 1. 3. 5. 4. 5. 5. 2. 2. 2. 4. 4. 3. 5. 5. 5. 3. 3. 2.
 4. 4. 5. 1. 2. 5. 4. 5. 5. 2. 4. 5. 1. 5. 1. 3. 1. 3. 4. 2. 3. 5. 1. 5.
 5. 5. 2. 1. 2. 5.]
Corresponding Predicted Ratings: [ 5.69516466  3.62928464  6.80235818  6.13097254  6.1438771   6.52080282
  7.73146906  5.96760307  4.5673075   6.00192727  6.66521877  6.90728708
  4.84114081  4.104792    5.66964478  4.40403173  4.24338643  3.7007963
  4.12988495  6.04134152  5.25942869  8.04830793  4.89894487  5.02895848
  3.71120641  5.15918216  5.51956847  4.78393186  4.94356061  4.81916789
  5.91814148  6.53642162  4.73827151  5.73846864  4.66894962  6.08089578
  7.35159247  6.31237508  4.79823516  5.82489225  4.1649

In [None]:
# round to 2 decimal places
mae = round(mae, 3)
mse = round(mse, 3)
rmse = round(rmse, 3)

# Save the results to a csv file
results = pd.DataFrame({'MAE': [mae], 'MSE': [mse], 'RMSE': [rmse]})
# results.to_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\results_NMF.csv', index=False)
results.to_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/Results/MF_results.csv', index=False)

## Using Packages

- The `use_regularization` parameter controls whether regularization (beta) is applied.
- The `use_bias` parameter controls whether bias terms are included.
- We use the `NMF` class from `Scikit-learn`, which handles the optimization process for us.

In [26]:
# import NMF from scikit-learn
from sklearn.decomposition import NMF


def matrix_factorization_nmf(R, K, steps=500, alpha=0.001, beta=0.02, use_regularization=True, use_bias=True):
    """
    Perform non-negative matrix factorization using scikit-learn's NMF.

    Args:
        R (numpy.ndarray): The input rating matrix.
        K (int): Number of latent features.
        steps (int): Maximum number of iterations.
        alpha (float): Learning rate.
        beta (float): Regularization parameter.
        use_regularization (bool): Whether to use regularization.
        use_bias (bool): Whether to use bias terms.
    Returns:
        numpy.ndarray, numpy.ndarray, numpy.ndarray: Factorized matrices P, Q, and the reconstructed matrix R_pred.
    """

    # Initialize NMF model
    nmf_model = NMF(n_components=K, init='random', solver='cd', beta_loss='frobenius', max_iter=steps,
                    alpha=alpha if use_regularization else 0.0, l1_ratio=beta if use_regularization else 0)

    # Fit the model to the data
    nmf_model.fit(R)

    # Get the transformed matrices
    P = nmf_model.transform(R)
    Q = nmf_model.components_

    if use_bias:
        b_u = np.zeros(P.shape[0])
        b_i = np.zeros(Q.shape[1])
        b = np.mean(R[np.where(R != 0)])

        for _ in range(steps):
            for i in range(P.shape[0]):
                for j in range(Q.shape[1]):
                    if R[i][j] > 0:
                        eij = R[i][j] - np.dot(P[i, :], Q[:, j])

                        P[i, :] += alpha * (2 * eij * Q[:, j] - beta * P[i, :])
                        Q[:, j] += alpha * (2 * eij * P[i, :] - beta * Q[:, j])

                        b_u[i] += alpha * (eij - beta * b_u[i])
                        b_i[j] += alpha * (eij - beta * b_i[j])

        R_pred = np.dot(P, Q) + b + b_u[:, np.newaxis] + b_i[np.newaxis:,]
    else:
        R_pred = np.dot(P, Q)

    return P, Q, R_pred

# Example usage
np.random.seed(42)
nP, nQ, nR_pred = matrix_factorization_nmf(R, K=2, alpha=0.001, beta=0.02, use_regularization=True, use_bias=True)
print("Original Matrix:")
print(R)
print("\nReconstructed Matrix:")
print(nR_pred)



Original Matrix:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

Reconstructed Matrix:
[[ 9.47914175  4.52876106  4.70464802 ...  4.52876106  4.52876106
   4.88160201]
 [ 9.47914175  4.52876106  4.70464802 ...  4.52876106  4.52876106
   4.88160201]
 [ 9.47914175  4.52876106  4.70464802 ...  4.52876106  4.52876106
   4.88160201]
 ...
 [11.96670806  7.01632738  7.19221434 ...  7.01632738  7.01632738
   7.36916833]
 [ 9.47914175  4.52876106  4.70464802 ...  4.52876106  4.52876106
   4.88160201]
 [ 9.47914175  4.52876106  4.70464802 ...  4.52876106  4.52876106
   4.88160201]]


In [27]:
# Define the parameter grid
param_grid = {
    'K': [2, 3, 4],         # Number of latent features
    'alpha': [0.001, 0.01], # Learning rate
    'beta': [0.01, 0.02]    # Regularization parameter
}

# Create an instance of the GridSearchCV
grid_search = GridSearchCV(estimator=matrix_factorization_nmf, param_grid=param_grid, cv=5)

# Perform grid search
grid_search.fit(R)

# get params
best_K = grid_search.best_params_['K']
best_beta = grid_search.best_params_['beta']
best_alpha = grid_search.best_params_['alpha']


# Print the best parameters and best score
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)
print(f"Best K: {best_K}")
print(f"Best beta: {best_beta}")
print(f"Best alpha: {best_alpha}")

# Re-train the model with best hyperparameters
best_model = matrix_factorization_nmf(R=x_hidden.values, K=best_K, alpha=best_beta, beta=best_beta,
                                      use_regularization=True, use_bias=True)

TypeError: estimator should be an estimator implementing 'fit' method, <function matrix_factorization_nmf at 0x7faa32abc550> was passed

In [None]:
# step 1: identify the hidden ratings indices = indices_tracker and get the hidden ratings ==========================================================================
hidden_ratings_ind = indices_tracker.copy()

# Loop through users to append hidden ratings
hidden_ratings_arrays = []

# Loop through users to append hidden ratings arrays
for user in range(x.shape[0]):
    user_hidden_ratings = x.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    hidden_ratings_arrays.append(user_hidden_ratings)


hidden_ratings_array = pd.DataFrame(hidden_ratings_arrays).to_numpy().flatten()
print("Hidden Ratings:", hidden_ratings_array)

# step 2: extract corresponding predicted ratings indices ==========================================================================

# Create an empty list to store predicted ratings arrays
predicted_ratings_arrays = []

# Loop through users to append predicted ratings arrays
for user in range(nR.shape[0]):
    user_predicted_ratings = nR.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    predicted_ratings_arrays.append(user_predicted_ratings)

predicted_ratings_array = pd.DataFrame(predicted_ratings_arrays).to_numpy().flatten()
print("Corresponding Predicted Ratings:", predicted_ratings_array)

# step 3: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values) ==========================================================================

from sklearn.metrics import mean_absolute_error, mean_squared_error

# calculate MAE, MSE and RMSE
print("Using sklearn")
mae = mean_absolute_error(hidden_ratings_array, predicted_ratings_array)
mse = mean_squared_error(hidden_ratings_array, predicted_ratings_array)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")


# Manually
print("\n\nManually")
mae = np.mean(np.abs(hidden_ratings_array - predicted_ratings_array)) # Calculate Mean Absolute Error (MAE)
mse = np.mean((hidden_ratings_array - predicted_ratings_array) ** 2) # Calculate Mean Squared Error (MSE)
rmse = np.sqrt(mse) # Calculate Root Mean Squared Error (RMSE)


print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")

## Using Packages without Bias and Regularisation

In [84]:
import numpy as np
from sklearn.decomposition import NMF

# Create a copy of the original ratings matrix (you should use a copy so as not to modify the original matrix)
R_copy = R.copy()

# Replace original zeros in R_copy with NaNs
R_copy[R_copy == 0] = np.nan

# Specify the number of components (you can experiment with different values)
n_components = 10
model = NMF(n_components=n_components, init='random', random_state=2207, max_iter=1000, alpha=0.01, l1_ratio=1.5, verbose=False)
P = model.fit_transform(R_copy)  # User-feature matrix
Q = model.components_            # Feature-item matrix

# Multiply P and Q to get the estimated ratings
R_estimated = np.dot(P, Q)

# Create a mask for the missing values
mask = np.isnan(R_copy)

# Replace the original missing values with the predicted ratings
R_predicted = R.copy()  # Create a copy to ensure that the original matrix is not modified
R_predicted[mask] = R_estimated[mask]

# Print the original and predicted ratings
print("Original Ratings:")
print(R)

print("\nPredicted Ratings:")
print(R_predicted)

ValueError: Input X contains NaN.
NMF does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

In [82]:
import numpy as np
from sklearn.decomposition import NMF

# Specify the number of components (you can experiment with different values)
n_components = 10
model = NMF(n_components=n_components, init='random', random_state=2207, max_iter=1000, alpha=0.01, l1_ratio=1.5, verbose=False)
P = model.fit_transform(R)  # User-feature matrix
Q = model.components_       # Feature-item matrix


# Multiply A and B to get the estimated ratings
R_estimated = np.dot(P, Q)

# Replace original zeros in R with predicted ratings
R_predicted = np.where(R == 0, R_estimated, R)

# Print the original and predicted ratings
print("Original Ratings:")
print(R)

print("\nPredicted Ratings:")
print(R_predicted)
R_predicted = pd.DataFrame(R_predicted, columns=x_hidden.columns, index=x_hidden.index)



Original Ratings:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

Predicted Ratings:
[[1.13756107e-05 1.97293066e-04 0.00000000e+00 ... 1.75846414e-01
  1.80200509e-01 3.25668829e-03]
 [5.13774224e-03 1.35576910e-04 1.39125748e-04 ... 1.80595700e-02
  1.84806670e-02 8.05029405e-03]
 [3.44309594e-03 3.36838948e-03 1.52714421e-02 ... 1.62260490e-03
  0.00000000e+00 5.91030722e-03]
 ...
 [3.90118221e-07 2.51222987e-04 6.20551495e-04 ... 1.46281673e-04
  1.16376849e-05 1.55002994e-03]
 [0.00000000e+00 2.96481398e-04 0.00000000e+00 ... 7.20876672e-02
  7.38726141e-02 1.73037274e-03]
 [6.46961783e-02 2.17507721e-02 1.33368959e-01 ... 7.24586383e-04
  2.20241253e-04 1.88555567e-03]]


In [83]:
# step 1: identify the hidden ratings indices = indices_tracker and get the hidden ratings ==========================================================================
hidden_ratings_ind = indices_tracker.copy()

# Loop through users to append hidden ratings
hidden_ratings_arrays = []

# Loop through users to append hidden ratings arrays
for user in range(x.shape[0]):
    user_hidden_ratings = x.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    hidden_ratings_arrays.append(user_hidden_ratings)
hidden_ratings_array = pd.DataFrame(hidden_ratings_arrays).to_numpy().flatten()

# step 2: extract corresponding predicted ratings indices ==========================================================================

# Create an empty list to store predicted ratings arrays
predicted_ratings_arrays = []

# Loop through users to append predicted ratings arrays
for user in range(R_predicted.shape[0]):
    user_predicted_ratings = R_predicted.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    predicted_ratings_arrays.append(user_predicted_ratings)
predicted_ratings_array = pd.DataFrame(predicted_ratings_arrays).to_numpy().flatten()

# step 3: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values) ==========================================================================

from sklearn.metrics import mean_absolute_error, mean_squared_error

# calculate MAE, MSE and RMSE
print("Using sklearn")
mae = mean_absolute_error(hidden_ratings_array, predicted_ratings_array)
mse = mean_squared_error(hidden_ratings_array, predicted_ratings_array)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")


# Manually
print("\n\nManually")
mae = np.mean(np.abs(hidden_ratings_array - predicted_ratings_array)) # Calculate Mean Absolute Error (MAE)
mse = np.mean((hidden_ratings_array - predicted_ratings_array) ** 2) # Calculate Mean Squared Error (MSE)
rmse = np.sqrt(mse) # Calculate Root Mean Squared Error (RMSE)


print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")

Using sklearn
Mean Absolute Error (MAE): 4.241243101668249
Mean Squared Error (MSE): 18.985869875308374
Root Mean Squared Error (RMSE): 4.35727780561538


Manually
Mean Absolute Error (MAE): 4.241243101668249
Mean Squared Error (MSE): 18.985869875308374
Root Mean Squared Error (RMSE): 4.35727780561538


## Using Suprise Test and Train Split

In [36]:
# load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# load data
# amz_data = pd.read_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\set2_data_modelling.csv')
amz_data = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set4_data_modelling.csv", index_col=0)

# print details
print('Number of Rows: ', amz_data.shape[0])
print('Number of Columns: ', amz_data.shape[1])
print('Number of Unique Users: ', len(amz_data['reviewerID'].unique()))
print('Number of Unique Products: ', len(amz_data['asin'].unique()))
print('Fewest reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().min())
print('Most reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().max())
print("Fewest reviews per product:", amz_data.groupby('asin')['reviewerID'].count().min())
print("Most reviews per product:", amz_data.groupby('asin')['reviewerID'].count().max())

# Creating User Item Matrix =====================================================
# create user-item matrix
x = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
x = x.fillna(0)
print("\n\n\nUser-Item Matrix")
display(x.head())
print('Shape: ', x.shape)

Unnamed: 0,reviewerID,reviewTime,asin,overall,reviewText,stemmed_words_revText,lemmatised_reviewText,filtered_tokens_revText,sentiments_afinn,sentiments_bing,sentiments_vader
76,AQ8OO59DJFJNZ,2018-01-05,767834739,5.0,wonderful movie,wonder movi,wonderful movie,wonderful movie,4,1,0.5719
78,A244CRJ2QSVLZ4,2008-01-29,767834739,5.0,resident evil is a great science fictionhorror...,resid evil great scienc fictionhorror hybrid p...,resident evil great science fictionhorror hybr...,resident evil great science fictionhorror hybr...,-12,-5,-0.9455
81,A1VCLTAGM5RLND,2005-07-23,767834739,5.0,i this movie has people living and working und...,movi peopl live work underground place call hi...,movie people living working underground place ...,movie people living working underground place ...,-1,0,-0.1806
82,A119Q9NFGVOEJZ,2016-02-13,767834739,5.0,every single video game based movie from the s...,everi singl video game base movi super mario b...,every single video game based movie super mari...,every single video game based movie super mari...,18,6,0.9846
83,A1RP6YCOS5VJ5I,2006-09-26,767834739,5.0,i think that i like this movie more than the o...,think like movi origin origin still great real...,think like movie original original still great...,think like movie original original still great...,29,10,0.9951


Number of Rows:  83139
Number of Columns:  11
Number of Unique Users:  3668
Number of Unique Products:  3249
Fewest reviews by a reviewer: 13
Most reviews by a reviewer: 193
Fewest reviews per product: 13
Most reviews per product: 189



User-Item Matrix


asin,0767834739,7799146915,B00000DMAT,B00000DMAX,B00000DMB3,B00000F1GM,B00000I1BJ,B00000I1BY,B00000ID61,B00000INR2,...,B01H353FLA,B01H353HUY,B01H3VFR6U,B01H5GB8ZW,B01H6OXQFS,B01H9SH2LU,B01HGBAFNC,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100RH4M1W1DF0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100WO06OQR8BQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0
A1027EV8A9PV1O,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A103KKI1Y4TFNQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1047P9FLHTDZJ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Shape:  (3668, 3249)


### Train and Test Split

In [None]:
# load hidden ratings matrix
x_hidden = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/suprise_hidden_ratings_matrix.csv')

# load testset indices
testset_indices = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/suprise_testset_indices.csv')

# convert to numpy
testset_indices = testset_indices.to_numpy()

In [None]:
#  get predicted ratings for the testset
predicted_ratings = []
for i in range(len(testset_indices)):
    user_id = testset_indices[i][0]
    item_id = testset_indices[i][1]
    predicted_ratings.append(predic_matrix.iloc[user_id, item_id])

print("Predicted Ratings:")
print(predicted_ratings)

# get actual ratings for the testset
print("\nActual Ratings:")
actual_ratings = testset_df[2].to_list()
print(actual_ratings)

In [None]:
# calculate MAE, MSE and RMSE
from sklearn.metrics import mean_absolute_error, mean_squared_error
print("Using sklearn")
mae = mean_absolute_error(actual_ratings, predicted_ratings)
mse = mean_squared_error(actual_ratings, predicted_ratings)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error (MAE): {mae.round(2)}")
print(f"Mean Squared Error (MSE): {mse.round(2)}")
print(f"Root Mean Squared Error (RMSE): {rmse.round(2)}")


# Manually
print("\n\nManually")

# calculate MAE, MSE and RMSE using actual and predicted ratings
mae = np.mean(np.abs(np.array(actual_ratings) - np.array(predicted_ratings))) # Calculate Mean Absolute Error (MAE)
mse = np.mean((np.array(actual_ratings) - np.array(predicted_ratings)) ** 2) # Calculate Mean Squared Error (MSE)
rmse = np.sqrt(mse) # Calculate Root Mean Squared Error (RMSE)

print(f"Mean Absolute Error (MAE): {mae.round(2)}")
print(f"Mean Squared Error (MSE): {mse.round(2)}")
print(f"Root Mean Squared Error (RMSE): {rmse.round(2)}")

In [None]:
# save results to csv
results = pd.DataFrame({'MAE': [mae.round(3)], 'MSE': [mse.round(3)], 'RMSE': [rmse.round(3)]})
results.to_csv("Data/Results/NMF_results.csv", index=False)


# Sandbox

In [35]:
# Creating Dummy User-Item Matrix =====================================================
import pandas as pd
import numpy as np

# Number of users and items
num_users = 50
num_items = 30
min_ratings_per_user = 5
max_ratings_per_user = 10

# Generate random ratings for the user-item matrix - each user has rated only between 5 and 10 items
ratings = np.zeros((num_users, num_items))

for user_index in range(num_users):
    num_ratings = np.random.randint(min_ratings_per_user, max_ratings_per_user + 1)
    item_indices = np.random.choice(num_items, num_ratings, replace=False)
    ratings[user_index, item_indices] = np.random.randint(1, 6, size=num_ratings)

# Create a DataFrame for the user-item ratings
user_item_ratings_df = pd.DataFrame(ratings, columns=[f"item{i+1}" for i in range(num_items)])
user_item_ratings_df.index += 1
user_item_ratings_df.index.name = 'user'

# Save the user-item ratings matrix to a CSV file
user_item_ratings_df.to_csv('../Code/Data/user_item_ratings_matrix.csv')

# Print out the first few rows of the user-item ratings matrix
user_item_ratings_df

Unnamed: 0_level_0,item1,item2,item3,item4,item5,item6,item7,item8,item9,item10,...,item21,item22,item23,item24,item25,item26,item27,item28,item29,item30
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,5.0,4.0,0.0,0.0,0.0,0.0,0.0
2,0.0,3.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0,5.0,0.0
3,5.0,0.0,3.0,0.0,0.0,0.0,5.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,2.0
4,0.0,2.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,5.0,0.0,0.0,4.0,4.0,0.0,0.0,0.0,3.0,0.0
5,0.0,0.0,0.0,5.0,0.0,3.0,0.0,0.0,1.0,0.0,...,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
6,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,...,0.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,2.0,0.0,0.0,5.0,0.0,0.0,1.0,0.0,...,4.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,3.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0
9,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,5.0,2.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,3.0
10,4.0,3.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
import numpy as np

# Data with some missing ratings represented by 0s
ratings = np.array([
    [5, 0, 3, 0],
    [4, 4, 0, 2],
    [0, 3, 0, 0],
    [2, 0, 0, 4],
    [0, 1, 5, 0]
])

class FunkSVD:
    def __init__(self, num_factors, learning_rate, regularization):
        self.num_factors = num_factors
        self.learning_rate = learning_rate
        self.regularization = regularization

    def fit(self, X, num_iterations):
        self.num_users, self.num_items = X.shape

        # Initialize user and item matrices randomly
        self.user_vectors = np.random.randn(self.num_users, self.num_factors)
        self.item_vectors = np.random.randn(self.num_items, self.num_factors)

        for i in range(num_iterations):
            # Update user and item matrices using gradient descent
            for u in range(self.num_users):
                for i in range(self.num_items):
                    if X[u, i] != 0:
                        prediction = np.dot(self.user_vectors[u], self.item_vectors[i])
                        error = X[u, i] - prediction

                        self.user_vectors[u] += self.learning_rate * (error * self.item_vectors[i] - self.regularization * self.user_vectors[u])
                        self.item_vectors[i] += self.learning_rate * (error * self.user_vectors[u] - self.regularization * self.item_vectors[i])

    def predict(self, X):
        # Predict ratings for all users and items
        return np.dot(self.user_vectors, self.item_vectors.T)

# Initialize the FunkSVD model with parameters
num_factors = 2
learning_rate = 0.01
regularization = 0.1
num_iterations = 1000

model = FunkSVD(num_factors, learning_rate, regularization)

# Train the model using the observed ratings in the matrix
model.fit(ratings, num_iterations)

# Get predicted ratings
predicted_ratings = model.predict(ratings)

# Print the predicted ratings
print("Predicted Ratings Matrix:\n", predicted_ratings)


Predicted Ratings Matrix:
 [[ 4.78496146  4.56381781  2.96756047  2.6099398 ]
 [ 3.94057431  3.74974341  2.61609387  2.00440787]
 [ 3.12548658  2.91554045  3.23309482  0.61490819]
 [ 2.00552186  2.07570967 -1.97584102  3.80415768]
 [ 1.31727363  1.05442479  4.8093647  -2.64226551]]
