Link: https://colab.research.google.com/drive/1gRwlTisVyPubmpGJwRewtciSTIx7QQZb?usp=sharing

# 1.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [2]:
import pandas as pd
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error
import numpy as np

In [3]:
# Load data
MV_users = pd.read_csv('/content/gdrive/MyDrive/users.csv')
MV_movies = pd.read_csv('/content/gdrive/MyDrive/movies.csv')
train = pd.read_csv('/content/gdrive/MyDrive/train.csv')
test = pd.read_csv('/content/gdrive/MyDrive/test.csv')

In [4]:
# Pivot train and test DataFrames to matrices
train_matrix = train.pivot(index='uID', columns='mID', values='rating')
test_matrix = test.pivot(index='uID', columns='mID', values='rating')

In [5]:
# Handle missing values - here we fill NaNs with zeros
# Adjust this as needed
train_matrix_filled = train_matrix.fillna(0)

In [6]:
# Define model
model = NMF(n_components=20, init='random', random_state=0)

In [7]:
# Fit the model to the train data and obtain W and H matrices
W = model.fit_transform(train_matrix_filled)
H = model.components_



In [8]:
# Compute the predicted ratings
predicted_ratings = np.dot(W, H)

# Convert the predicted ratings to a DataFrame with matching index and columns
predicted_ratings_df = pd.DataFrame(predicted_ratings, index=train_matrix.index, columns=train_matrix.columns)

In [9]:
# Now, you can use `predicted_ratings_df` to find the predicted rating for any user-item pair,
# including those that are missing in the original data.

# For evaluation, let's calculate RMSE for the known ratings in the test set.

# Align the shape of the predicted_ratings_df with test_matrix
predicted_ratings_aligned = predicted_ratings_df.reindex_like(test_matrix)

In [10]:
# Identify indices where test matrix has actual ratings
indices = np.where(~np.isnan(test_matrix))

# Extract the corresponding predictions and actual values
predicted_ratings_masked = predicted_ratings_aligned.values[indices]
actual_ratings = test_matrix.values[indices]

# Drop any pairs where either the prediction or the actual value is NaN
non_nan_indices = np.where(~(np.isnan(predicted_ratings_masked) | np.isnan(actual_ratings)))

predicted_ratings_masked = predicted_ratings_masked[non_nan_indices]
actual_ratings = actual_ratings[non_nan_indices]

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(actual_ratings, predicted_ratings_masked))

print('RMSE: ', rmse)


RMSE:  2.865942566224596


 The RMSE above indicates, on average, how far off the model's predicted ratings are from the actual ratings. An RMSE of 2.86 implies that, on average, the predicted ratings by your model are about 2.86 points off from the actual ratings. Given that the ratings range from 1 to 5, this is a relatively high error.

# 2.

|Method|RMSE|
|:----|:--------:|
|Baseline, $Y_p$=3|1.35 |
|Baseline, $Y_p=\mu_u$|1.20 |
|Content based, item-item|0.98 |
|Collaborative, cosine|0.90 |
|Collaborative, jaccard, $M_r\geq 3$|0.95  |
|Collaborative, jaccard, $M_r\geq 1$|0.92  |
|Collaborative, jaccard, $M_r$|1.05  |
|Sklearn NMF |2.87  |

We can observe that the NMF performs the worst compared to other methods. One of the reasons for this can be because of the missing data. The ratings matrix in a typical movie ratings dataset is sparse, with a lot of missing entries because not all users have rated all movies. Methods like NMF might struggle with such sparse data, while similarity-based methods and collaborative filtering can handle this sparsity more effectively.

One possible solution is to introduce Regularization. Regularization can help mitigate overfitting, which is a common problem when dealing with sparse data. Adding L1 or L2 regularization to the cost function of NMF can help prevent overfitting. Or another option is to research for specialized libraries for building recommendation systems such as Surprise (https://surpriselib.com).