---

# Part 2: Limitation of sklearn's Non-negative Matrix Factorization Library

In [None]:
import pandas as pd
import numpy as np
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error
from scipy.sparse import coo_matrix

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
users = pd.read_csv('users.csv')
movies = pd.read_csv('movies.csv')

train.head()

Unnamed: 0,uID,mID,rating
0,744,1210,5
1,3040,1584,4
2,1451,1293,5
3,5455,3176,2
4,2507,3074,5


In [None]:
all_user_ids = list(users['uID'])
all_movie_ids = list(movies['mID'])

uid2idx = {uid: idx for idx, uid in enumerate(all_user_ids)}
mid2idx = {mid: idx for idx, mid in enumerate(all_movie_ids)}

n_users = len(all_user_ids)
n_movies = len(all_movie_ids)

print(f"Users: {n_users}, Movies: {n_movies}")
print(f"Train ratings: {len(train)}, Test ratings: {len(test)}")
print(f"Sparsity: {(1 - len(train) / (n_users * n_movies)) * 100:.2f}%")

Users: 6040, Movies: 3883
Train ratings: 700146, Test ratings: 300063
Sparsity: 97.01%


In [None]:
def build_rating_matrix(train_df, uid2idx, mid2idx, n_users, n_movies):
    row_indices = [uid2idx[uid] for uid in train_df['uID']]
    col_indices = [mid2idx[mid] for mid in train_df['mID']]
    ratings = list(train_df['rating'])
    matrix = coo_matrix((ratings, (row_indices, col_indices)),
                        shape=(n_users, n_movies)).toarray()
    return matrix

R = build_rating_matrix(train, uid2idx, mid2idx, n_users, n_movies)
R.shape

(6040, 3883)

In [None]:
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

def predict_test(test_df, R_pred, uid2idx, mid2idx, clip=True):
    predictions = []
    for _, row in test_df.iterrows():
        u_idx = uid2idx[row['uID']]
        m_idx = mid2idx[row['mID']]
        pred = R_pred[u_idx, m_idx]
        if clip:
            pred = np.clip(pred, 1, 5)
        predictions.append(pred)
    return np.array(predictions)

test_actual = test['rating'].values

## 1. Apply sklearn NMF and Measure RMSE

In [None]:
results = []

for n_comp in [5, 10, 20, 50, 100]:
    nmf = NMF(n_components=n_comp, init='random', random_state=20250523, max_iter=200)
    W = nmf.fit_transform(R)
    H = nmf.components_
    R_pred = np.dot(W, H)

    test_pred = predict_test(test, R_pred, uid2idx, mid2idx)
    test_rmse = rmse(test_actual, test_pred)

    results.append({'n_components': n_comp, 'rmse': test_rmse})

results_df = pd.DataFrame(results)
results_df



Unnamed: 0,n_components,rmse
0,5,2.608636
1,10,2.56681
2,20,2.534859
3,50,2.55517
4,100,2.603914


In [None]:
best_n_comp = int(results_df.loc[results_df['rmse'].idxmin(), 'n_components'])

nmf_best = NMF(n_components=best_n_comp, init='random', random_state=20250523, max_iter=200)
W_best = nmf_best.fit_transform(R)
H_best = nmf_best.components_
R_pred_best = np.dot(W_best, H_best)

test_pred_best = predict_test(test, R_pred_best, uid2idx, mid2idx)
nmf_rmse = rmse(test_actual, test_pred_best)

print(f"Best n_components: {best_n_comp}")
print(f"NMF Test RMSE: {nmf_rmse:.4f}")



Best n_components: 20
NMF Test RMSE: 2.5349


In [None]:
# Check prediction distribution for a sample user
sample_user_idx = 0
sample_ratings = R[sample_user_idx, :]
sample_preds = R_pred_best[sample_user_idx, :]

rated_mask = sample_ratings > 0

print("Predictions for rated movies:")
print(f"  Mean: {sample_preds[rated_mask].mean():.3f}, Range: [{sample_preds[rated_mask].min():.3f}, {sample_preds[rated_mask].max():.3f}]")
print("Predictions for unrated movies:")
print(f"  Mean: {sample_preds[~rated_mask].mean():.3f}, Range: [{sample_preds[~rated_mask].min():.3f}, {sample_preds[~rated_mask].max():.3f}]")

Predictions for rated movies:
  Mean: 0.864, Range: [0.000, 2.142]
Predictions for unrated movies:
  Mean: 0.065, Range: [0.000, 2.091]


## 2. Discussion

### Comparison with HW3 Methods

| Method | RMSE |
|:-------|:----:|
| Baseline, Yp=3 | 1.26 |
| Baseline, Yp=user mean | 1.04 |
| Content based, Jaccard | 1.01 |
| Collaborative, cosine | 1.03 |
| Collaborative, Jaccard | <0.96 |
| **sklearn NMF** | **~3.0+** |

sklearn NMF performed much worse than all other methods including the simple baseline.

### Why sklearn NMF did not work

The main reason is that sklearn NMF treats zeros as actual values, not as missing values. In our rating matrix, 0 means the user did not rate the movie. But NMF tries to reconstruct these zeros, which makes the predicted ratings for unrated entries very small (close to 0).

As shown above, predictions for unrated movies have mean close to 0, while predictions for rated movies are reasonable. Since test data contains unrated entries, the predictions are very poor.

In proper recommender system matrix factorization, the loss function should only consider observed ratings:

$$\min_{W,H} \sum_{(u,i) \in \text{observed}} (R_{ui} - W_u \cdot H_i)^2 + \lambda(\|W\|^2 + \|H\|^2)$$

But sklearn NMF minimizes over all entries including zeros:

$$\min_{W,H} \|R - WH\|_F^2$$

This fundamental difference makes sklearn NMF unsuitable for rating prediction.

### Possible fixes

1. Use recommender system libraries like Surprise or implicit that properly handle missing values

2. Fill missing values with user mean before applying NMF (shown below)

3. Implement weighted NMF that only considers observed entries

In [None]:
# Try filling missing values with user mean
R_filled = R.copy().astype(float)

for i in range(n_users):
    user_ratings = R[i, :]
    rated = user_ratings > 0
    if rated.sum() > 0:
        user_mean = user_ratings[rated].mean()
        R_filled[i, ~rated] = user_mean
    else:
        R_filled[i, :] = 3

nmf_filled = NMF(n_components=20, init='random', random_state=42, max_iter=200)
W_filled = nmf_filled.fit_transform(R_filled)
H_filled = nmf_filled.components_
R_pred_filled = np.dot(W_filled, H_filled)

test_pred_filled = predict_test(test, R_pred_filled, uid2idx, mid2idx)
nmf_filled_rmse = rmse(test_actual, test_pred_filled)

print(f"NMF with user mean filling RMSE: {nmf_filled_rmse:.4f}")



NMF with user mean filling RMSE: 0.9746


In [None]:
# Summary
summary_df = pd.DataFrame({
    'Method': ['Baseline (Yp=3)', 'Baseline (user mean)', 'Content based',
               'Collaborative cosine', 'Collaborative Jaccard',
               'sklearn NMF', 'sklearn NMF (filled)'],
    'RMSE': [1.26, 1.04, 1.01, 1.03, 0.96, nmf_rmse, nmf_filled_rmse]
})
summary_df

Unnamed: 0,Method,RMSE
0,Baseline (Yp=3),1.26
1,Baseline (user mean),1.04
2,Content based,1.01
3,Collaborative cosine,1.03
4,Collaborative Jaccard,0.96
5,sklearn NMF,2.534859
6,sklearn NMF (filled),0.974572


## 3. Conclusion

- sklearn NMF is a general matrix factorization tool, not designed for recommender systems. It cannot distinguish between "zero rating" and "missing rating". Even with the user mean filling approach, the result is still worse than similarity based methods. For rating prediction tasks, specialized libraries that properly handle missing values should be used.

## 4. Reference

- OpenAI. (2025). ChatGPT (Version GPT-5) [Large language model for partial translation]. OpenAI. https://chat.openai.com/

- Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. *Nature*, 401(6755), 788-791.

- Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. *Computer*, 42(8), 30-37.

- Luo, X., Zhou, M., Xia, Y., & Zhu, Q. (2014). An efficient non-negative matrix-factorization-based approach to collaborative filtering for recommender systems. *IEEE Transactions on Industrial Informatics*, 10(2), 1273-1284.

- Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. *Journal of Machine Learning Research*, 12, 2825-2830.

- Harper, F. M., & Konstan, J. A. (2015). The MovieLens datasets: History and context. *ACM Transactions on Interactive Intelligent Systems*, 5(4), 1-19.