# CSCA5632 Part 2: Limitation(s) of sklearn’s non-negative matrix factorization library.

#### 1. Load the movie ratings data (as in the HW3-recommender-system) and use matrix factorization technique(s) and predict the missing ratings from the test data. Measure the RMSE. You should use sklearn library. [10 pts]

#### 2. Discuss the results and why sklearn's non-negative matrix facorization library did not work well compared to simple baseline or similarity-based methods we’ve done in Module 3. Can you suggest a way(s) to fix it? [10 pts]



### 1. Load movie ratings data

In [6]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
from sklearn.model_selection import train_test_split
from scipy.sparse import coo_matrix, csr_matrix
from scipy.spatial.distance import jaccard, cosine 
from pytest import approx

In [7]:
MV_users = pd.read_csv('data/users.csv')
MV_movies = pd.read_csv('data/movies.csv')
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [8]:
from collections import namedtuple
Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)

#### 1.1 Prepare data for NMF

In [9]:
# Create a rating matrix
user_list = list(data.users['uID'])
movie_list = list(data.movies['mID'])
genres = list(data.movies.columns.drop(['mID', 'title', 'year']))
# Create a dictionary to map user/movie ID to index
mid2idx = dict(zip(data.movies.mID, list(range(len(data.movies)))))
uid2idx = dict(zip(data.users.uID, list(range(len(data.users)))))

# Create a rating matrix using the train data in a sparse matrix format
ind_movie = [mid2idx[x] for x in data.train.mID]
ind_user = [uid2idx[x] for x in data.train.uID]
train_rating = list(data.train.rating)
rating_matrix = np.array(coo_matrix((train_rating, (ind_user, ind_movie)), shape=(len(user_list), len(movie_list))).toarray())

#### 1.2 Evaluate NMF model performance

In [10]:
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error

# train NMF model
model = NMF(n_components=20, init='random', random_state=0, max_iter=1000)
W = model.fit_transform(rating_matrix)
H = model.components_

# predict the missing ratings
rating_pred = np.dot(W, H)

# predict the missing ratings in the test data
ind_movie_test = [mid2idx[x] for x in data.test.mID]
ind_user_test = [uid2idx[x] for x in data.test.uID]
test_rating = list(data.test.rating)
rating_pred_test = rating_pred[ind_user_test, ind_movie_test]

# calculate RMSE
rmse = np.sqrt(mean_squared_error(test_rating, rating_pred_test))
print(f"RMSE: {rmse}")

RMSE: 2.861970909347182


### 2. Discuss the results and why sklearn's non-negative matrix facorization library did not work well compared to simple baseline or similarity-based methods we’ve done in Module 3. Can you suggest a way(s) to fix it?



| Method                                          |        RMSE        |
|:------------------------------------------------|:------------------:|
| Baseline, $Y_p$=3                               | 1.2585510334053043 |
| Baseline, $Y_p=\mu_u$                           | 1.0352910334228647 |
| Content based, item-item                        | 1.0128116783754684 |
| Collaborative, cosine                           | 1.0263081874204125 |
| Collaborative, jaccard, $M_r\geq 3$             | 0.9819058692126349 |
| Collaborative, jaccard, $M_r\geq 1$             | 0.991363571262366  |
| Collaborative, jaccard, $M_r$                   | 0.9509126236828654 |
| Non-Negative Matrix Factorization (sklearn NMF) | 2.861970909347182  |

The results of the comparison indicate that non-negative matrix factorization (NMF) using sklearn performed significantly worse (RMSE = 2.862) than both baseline and similarity-based methods. In contrast, the best performance was observed with a collaborative filtering approach using the Jaccard similarity metric (RMSE = 0.951). The poor performance of NMF can be attributed to the following limitations:
- Initialization: NMF in sklearn uses random initialization, which can lead to suboptimal solutions. This can result in slow convergence and poor performance.
- Rank selection: The choice of the number of latent factors (rank) in NMF can significantly impact the model's performance. In this case, we used 20 components, which may not be optimal for the dataset.
- Sparsity: NMF is sensitive to the sparsity of the data. The movie ratings dataset is sparse, which can make it challenging for NMF to capture the underlying patterns effectively.
- Non-negativity constraint: The non-negativity constraint in NMF can limit the expressiveness of the model. In some cases, negative values may be necessary to accurately represent the data.
- Hyperparameters: The choice of hyperparameters in NMF, such as the regularization term, can affect the model's performance. Tuning these hyperparameters can be challenging and may require extensive experimentation.

To improve the performance of NMF on the movie ratings dataset, the following strategies can be considered:
- Better initialization: Instead of random initialization, using a more informed initialization strategy, such as SVD-based initialization, can help NMF converge faster and achieve better results.
- Hyperparameter tuning: Systematically tuning the hyperparameters of NMF, such as the regularization term and the number of latent factors, can help optimize the model's performance.
- Regularization: Incorporating regularization techniques, such as L1 or L2 regularization, can help prevent overfitting and improve the generalization of the model.
- Hybrid approaches: Combining NMF with other collaborative filtering or content-based methods can leverage the strengths of different techniques and improve the overall performance.