# Week 4 Matrix Factorization
Movie Ratings Data
Please download movie data and put it into data_movies folder

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
import seaborn as sns
from collections import Counter

from itertools import permutations

from sklearn.model_selection import GridSearchCV, KFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import NMF
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import StackingClassifier



## Step 1: Load Data
Make sure your CSV files are loaded correctly:

In [4]:
MV_users = pd.read_csv('data_movies/users.csv')
MV_movies = pd.read_csv('data_movies/movies.csv')
train = pd.read_csv('data_movies/train.csv')
test = pd.read_csv('data_movies/test.csv')

print(test.info())

FileNotFoundError: [Errno 2] No such file or directory: 'data2/users.csv'

## Step 2: Prepare the Data
We'll create a user-item matrix for training data.

In [None]:
# Create a user-item matrix
train_matrix = train.pivot(index='uID', columns='mID', values='rating')

# Fill missing values with 0 (for matrix factorization purposes)
train_matrix_filled = train_matrix.fillna(0)


## Step 3: Apply Matrix Factorization
We'll use Non-negative Matrix Factorization (NMF) as an example.

In [None]:
from sklearn.decomposition import NMF

# Apply NMF
nmf_model = NMF(n_components=20, init='random', random_state=42, max_iter=500)
W = nmf_model.fit_transform(train_matrix_filled)
H = nmf_model.components_

# Reconstruct the ratings matrix
predicted_ratings = np.dot(W, H)


## Step 4: Predict Missing Ratings
Now, predict the ratings for the test set using the reconstructed matrix.

In [None]:
# Convert the predicted ratings matrix to DataFrame for easy access
predicted_ratings_df = pd.DataFrame(predicted_ratings, index=train_matrix.index, columns=train_matrix.columns)

# Prepare the test data
# test['predicted_rating'] = test.apply(lambda row: predicted_ratings_df.loc[row['uID'], row['mID']], axis=1)

print(predicted_ratings_df.index)  # Check the user IDs in the predicted ratings
print(predicted_ratings_df.columns)  # Check the movie IDs in the predicted ratings
print(test['uID'].unique())  # Check the unique user IDs in the test set
print(test['mID'].unique())  # Check the unique movie IDs in the test set

test['uID'] = test['uID'].astype(predicted_ratings_df.index.dtype)
test['mID'] = test['mID'].astype(predicted_ratings_df.columns.dtype)

def get_predicted_rating(row):
    try:
        return predicted_ratings_df.loc[row['uID'], row['mID']]
    except KeyError:
        return np.nan  # Or some default value

test['predicted_rating'] = test.apply(get_predicted_rating, axis=1)



## Step 5: Measure RMSE
Finally, calculate the RMSE between the actual and predicted ratings.

In [None]:
# Calculate RMSE
from sklearn.metrics import mean_squared_error

test = test.dropna(subset=['predicted_rating'])

rmse = np.sqrt(mean_squared_error(test['rating'], test['predicted_rating']))
print(f"RMSE: {rmse:.4f}")



## Discussion
### Discussion of Results

The RMSE of 2.8538 suggests that the Non-Negative Matrix Factorization (NMF) model didn't perform as well as expected. Here are some reasons why this might be the case:

### 1. **Complexity of NMF:**
   - **NMF Limitations:** NMF is a linear model that assumes non-negativity in both the factors and the original matrix. While it can be effective for certain types of data, its linear nature might not capture the more complex interactions between users and items (movies) as well as other methods.
   - **Overfitting:** NMF might also overfit the training data, especially if the number of components is not appropriately tuned. This overfitting can lead to poorer performance on the test set.
   - **Sensitivity to Sparsity:** Movie ratings datasets are typically sparse, with many missing ratings. NMF may struggle to accurately reconstruct the user-item matrix if the data is too sparse, leading to higher errors.

### 2. **Comparison to Baseline or Similarity-Based Methods:**
   - **Baseline Models:** Simple baseline methods, such as predicting the mean rating for a movie or the average rating a user gives, can sometimes outperform more complex models when the data is sparse or the underlying patterns are simple.
   - **Similarity-Based Methods:** Methods like user-user or item-item collaborative filtering directly leverage similarities between users or items. These methods can be more effective when the relationships between users and items are straightforward or when there are clear clusters of similar users/items.

### Ways to Improve the NMF Model

Here are some suggestions to improve the performance of NMF or alternative methods you might consider:

### 1. **Hyperparameter Tuning:**
   - **Number of Components:** Experiment with different numbers of latent components in the NMF model. Too few components might underfit, while too many might overfit.
   - **Regularization:** Introduce regularization to prevent overfitting. NMF in `sklearn` allows for regularization on both the components and the coefficient matrix, which can help control the complexity of the model.

### 2. **Hybrid Methods:**
   - **Combine NMF with Similarity-Based Methods:** Consider combining NMF with user-based or item-based collaborative filtering. For example, you can use NMF to get initial estimates of ratings and then refine those estimates using similarity-based adjustments.
   - **Blending Models:** Blend predictions from NMF with those from simpler models like baseline predictors or similarity-based methods. This ensemble approach can leverage the strengths of multiple methods.

### 3. **Data Preprocessing:**
   - **Imputation:** Before applying NMF, experiment with different imputation strategies for missing data. For example, filling in missing values with a global mean, item-specific mean, or user-specific mean might help NMF perform better.
   - **Feature Engineering:** Introduce additional features, such as user demographics or movie genres, that can be incorporated into the model, either directly or as part of a hybrid approach.

### 4. **Try Alternative Matrix Factorization Techniques:**
   - **SVD (Singular Value Decomposition):** SVD is another matrix factorization technique that might perform better on certain datasets, especially if you relax the non-negativity constraint.
   - **ALS (Alternating Least Squares):** ALS is commonly used in collaborative filtering, especially in large-scale recommendation systems, and might be more robust to sparsity.

### 5. **Model Evaluation and Cross-Validation:**
   - **Cross-Validation:** Ensure that you are using proper cross-validation techniques to evaluate the performance of your model. This will help in obtaining a more reliable estimate of the model's true performance.
   - **Learning Curve Analysis:** Perform a learning curve analysis to understand how the model's performance scales with more data. This can help identify whether the model is underfitting or overfitting.

### Conclusion

While NMF can be a powerful tool, it may not always be the best choice for all datasets, particularly when the data is sparse or the relationships between users and items are complex. By experimenting with hyperparameters, combining models, and trying alternative techniques, you can improve performance and potentially achieve lower RMSE values.