# Movie Ratings Prediction using Non-Negative Matrix Factorization

## Project Overview

### Objective
This project explores the application of **Non-Negative Matrix Factorization (NMF)** for predicting missing movie ratings and investigates the limitations of sklearn's NMF implementation compared to traditional recommender system approaches.

### Problem Statement
**Limitation(s) of sklearn's non-negative matrix factorization library** 

#### Part 1: NMF Implementation 
- Load the movie ratings dataset (from HW3-recommender-system)
- Apply sklearn's Non-Negative Matrix Factorization (NMF) technique
- Predict missing ratings from the test data
- Measure prediction performance using **RMSE**

#### Part 2: Critical Analysis 
- Analyze and discuss the results obtained from sklearn's NMF
- Compare performance with baseline and similarity-based methods from Module 3
- Identify why NMF may underperform compared to simpler approaches
- Propose potential solutions and improvements to enhance NMF performance

### Dataset
-  [MovieLense](https://grouplens.org/datasets/movielens/) has currently 25 million user-movie ratings.  Since the entire data is too big, we use  a 1 million ratings subset [MovieLens 1M](https://www.kaggle.com/odedgolden/movielens-1m-dataset), and we reformatted the data to make it more convenient to use.




# 1. Load the movie ratings dataset

In [21]:

import pandas as pd
from collections import namedtuple
import numpy as np
from scipy.sparse import coo_matrix
from sklearn.decomposition import NMF
import matplotlib.pyplot as plt
import seaborn as sns

MV_users = pd.read_csv('../data/users.csv')
MV_movies = pd.read_csv('../data/movies.csv')
train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')


Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)
# print(data.movies.info())
# print(data.users.info())
print(data.train.head())    
print(data.test.head())

    uID   mID  rating
0   744  1210       5
1  3040  1584       4
2  1451  1293       5
3  5455  3176       2
4  2507  3074       5
    uID   mID  rating
0  2233   440       4
1  4274   587       5
2  2498   454       3
3  2868  2336       5
4  1636  2686       5


## 2. Part 1: NMF Implementation [10 pts]

### 2.1 Create Utility Matrix

First, we need to convert the training data into a user-item rating matrix that NMF can process.


In [22]:

# Create mapping dictionaries
allusers = list(data.users['uID'])
allmovies = list(data.movies['mID'])

mid2idx = dict(zip(data.movies.mID, list(range(len(data.movies)))))
uid2idx = dict(zip(data.users.uID, list(range(len(data.users)))))

# Convert train data to utility matrix
ind_movie = [mid2idx[x] for x in data.train.mID]
ind_user = [uid2idx[x] for x in data.train.uID]
rating_train = list(data.train.rating)

# Create rating matrix as numpy array
Mr = np.array(coo_matrix((rating_train, (ind_user, ind_movie)), shape=(len(allusers), len(allmovies))).toarray())
                          
print(f"Utility Matrix Shape: {Mr.shape}")
print(f"Matrix sparsity: {(Mr == 0).sum() / Mr.size * 100:.2f}%")


Utility Matrix Shape: (6040, 3883)
Matrix sparsity: 97.01%


### 2.2 Apply NMF for Matrix Factorization

NMF decomposes the rating matrix R into two non-negative matrices:
- **W (User-Factor matrix)**: Shape (n_users, n_components)
- **H (Factor-Movie matrix)**: Shape (n_components, n_movies)

The predicted ratings are obtained by: **R_pred = W × H** with n_components =20



In [23]:
# NMF requires non-negative values, so we replace 0s (unrated) with a small positive value
# This is a common preprocessing step for NMF
Mr_nmf = Mr.copy()
Mr_nmf[Mr_nmf == 0] = 0.01  # Replace 0 with small positive value

# Apply NMF with different numbers of components
n_components = 20  # Number of latent factors
print(f"Number of components (latent factors): {n_components}")
print()

# Initialize and fit NMF model with improved convergence settings
print("Training NMF model...")
nmf_model = NMF(n_components=n_components, init='random',random_state=42,max_iter=500, tol=1e-4, verbose=0)   

W = nmf_model.fit_transform(Mr_nmf)
H = nmf_model.components_

print(f"\nW (User-Factor) matrix shape: {W.shape}")
print(f"H (Factor-Movie) matrix shape: {H.shape}")


Number of components (latent factors): 20

Training NMF model...

W (User-Factor) matrix shape: (6040, 20)
H (Factor-Movie) matrix shape: (20, 3883)


### 2.3 Predict Missing Ratings on Test Set


In [24]:
# Reconstruct the full rating matrix
R_pred = np.dot(W, H)

# Predict ratings for test set
predictions = []
for idx, row in data.test.iterrows():
    uid = row['uID']
    mid = row['mID']
    
    user_idx = uid2idx[uid]
    movie_idx = mid2idx[mid]
    
    predicted_rating = R_pred[user_idx, movie_idx]
    predictions.append(predicted_rating)

predictions = np.array(predictions)

# Handle NaN values by replacing with mean rating
predictions[np.isnan(predictions)] = 3.0

# Clip predictions to valid rating range [1, 5]
predictions = np.clip(predictions, 1, 5)


### 2.4 Calculate RMSE


In [25]:
# Calculate RMSE
actual_ratings = np.array(data.test.rating)
rmse_nmf = np.sqrt(((actual_ratings - predictions)**2).mean())

print(f"NMF RMSE: {rmse_nmf:.4f}")

NMF RMSE: 2.5321


## 3. Part 2: Critical Analysis and Comparison 




| Rank | Method | RMSE | Performance |
|------|--------|------|-------------|
| 1 | Collaborative Filtering (Jaccard, Mr) | **0.952** | Best |
| 2 | Collaborative Filtering (Jaccard, Mr≥3) | 0.982 | +3.2% |
| 3 | Collaborative Filtering (Jaccard, Mr≥1) | 0.991 | +4.1% |
| 4 | Content-Based (Jaccard) | 1.012 | +6.3% |
| 5 | Collaborative Filtering (Cosine) | 1.024 | +7.6% |
| 6 | Baseline: User average | 1.035 | +8.7% |
| 7 | Baseline: Predict all to 3 | 1.258 | +32.1% |
| 8 | **NMF (sklearn)** | **2.532** | **+166.0%** |

**Key Findings:**
- Best HW3 method: **0.952** (Collaborative Filtering with Jaccard on Mr)
- NMF (sklearn): **2.532** 
- NMF performs **worse than even the simplest baseline** with very big gap (predict all to 3)

**Conclusion:** sklearn's NMF significantly underperforms compared to all HW3 methods, including basic baselines. This indicates fundamental issues with applying generic NMF to sparse rating matrices.


### Why sklearn's NMF Underperforms


Sklearn's NMF performed poorly (RMSE: 2.5321) compared to similarity-based methods and even the simplest baselines for three primary reasons:

**Different Objective:** NMF is an unsupervised dimensionality reduction technique optimized to minimize reconstruction error, not prediction error (RMSE). In contrast, similarity-based methods directly find similar users/items to predict unknown ratings, aligning better with the task's goal.

**Sparsity Handling:** Sklearn's NMF is not inherently designed for the high sparsity of recommendation datasets. Replacing unrated entries (zeros) with a small value introduces noise and inaccurate information, leading to a less effective model.

**How to Fix It**
- Use Specialized Algorithms: Instead of NMF, use matrix factorization algorithms designed for recommendation, such as SVD (Singular Value Decomposition) or ALS (Alternating Least Squares). These are optimized to predict missing ratings in sparse matrices. Libraries like Surprise provide efficient implementations.
