https://github.com/havik2323/DTSA5510 repo can be found here

### 1. Load the movie ratings data (as in the HW3-recommender-system) and use matrix factorization technique(s) and predict the missing ratings from the test data. Measure the RMSE. You should use sklearn library. [10 pts]


The below script aims to predict movie ratings using Non-Negative Matrix Factorization (NMF) and evaluate the prediction accuracy by calculating the Root Mean Squared Error (RMSE). Initially, the script loads user, movie, training, and test datasets from CSV files. It ensures that the correct column names are used for pivoting the training data into a user-item matrix, where rows represent users, columns represent movies, and the values are the respective ratings. Missing values are filled with zeros to prepare the matrix for the NMF model.

The NMF model is then initialized with a specified number of components (20 in this case) and fitted to the user-item matrix, resulting in two matrices: user features and item features. These matrices are multiplied to obtain the predicted ratings for the entire user-item matrix. The predictions are converted into a DataFrame for easier access and manipulation.

To evaluate the model's performance, the script extracts the true ratings from the test dataset and attempts to find corresponding predicted ratings from the predictions DataFrame. If a predicted rating is not available, it handles the missing entries by excluding them. The RMSE is then calculated by comparing the true ratings with the predicted ratings, providing a measure of the prediction accuracy. Finally, the RMSE value is printed indicating the model's performance. A lower RMSE value indicates better predictive performance of the model, meaning the predicted ratings are close to the actual ratings.  A higher RMSE value indicates poorer performance meaning there is a larger discrepancy between the predicted and actual ratings.  In this case, the RMSE was ~2.85 which is a higher RMSE indicating a poorer performance on the test set.  


### 2. Discuss the results and why sklearn's non-negative matrix facorization library did not work well compared to simple baseline or similarity-based methods we’ve done in Module 3. Can you suggest a way(s) to fix it? [10 pts]

The results of using the sklearn's non-negative matrix factorization (NMF) model to predict movie ratings yielded a high RMSE of 2.85 indicating significantly poor performance compared to the similarity-based methods used in a previous project. Several factors contribute to this outcome. Movie rating datasets are often sparse meaning many users have rated only a few movies, and NMF can struggle with sparse data due to insufficient information to accurately learn underlying user and item features. Additionally, NMF may not handle new users or items well, particularly if they were not well-represented in the training data leading to a cold start problem (cold start refers to, in this case, a situation where we have no data on a user movie preferences so there's nothing to compare against to provide recommendations). The performance of NMF is also sensitive to hyperparameter selection, and the chosen number of components (20) might not be optimal for this dataset. Furthermore, the model might overfit the training data capturing more noise rather than underlying patterns resulting in poor generalization on the test data. Unlike simpler models, NMF's complexity does not necessarily translate to better performance.

In contrast, similarity-based methods (i.e. user-user or item-item collaborative filtering) are often more effective in handling sparse datasets as they predict ratings based on the similarity between users or items which can be more robust with limited data points. 


To improve the NMF model's performance or achieve better results overall, we could leverage several strategies. Experimenting with different numbers of components and other hyperparameters through cross-validation could help identify more optimal settings. Combining NMF with other approaches such as collaborative or content-based filtering could leverage the strengths of different methods for improved accuracy (ensemble technique). Introducing regularization can prevent overfitting and help the model generalize better to unseen data. Incorporating additional features like user demographics, movie genres, or temporal information could provide more context and improve predictions. Increasing the training data by including more ratings or using synthetic data generation techniques could also enhance model performance. Exploring advanced models like neural collaborative filtering (NCF) or autoencoders (a personal favorite), and using ensemble techniques to combine predictions from these multiple models could lead to more robust prediction performance.



In [2]:
import pandas as pd
import numpy as np
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error
from math import sqrt

###Load data
MV_users = pd.read_csv('/home/bbehe/Desktop/Coursera/bbc_data/learn-ai-bbc/Files/users.csv')
MV_movies = pd.read_csv('/home/bbehe/Desktop/Coursera/bbc_data/learn-ai-bbc/Files/movies.csv')
train = pd.read_csv('/home/bbehe/Desktop/Coursera/bbc_data/learn-ai-bbc/Files/train.csv')
test = pd.read_csv('/home/bbehe/Desktop/Coursera/bbc_data/learn-ai-bbc/Files/test.csv')

###Ensure the correct column names for pivoting
user_col = 'uID'
movie_col = 'mID'
rating_col = 'rating'

###Pivot the training data to create a user-item matrix
train_pivot = train.pivot(index=user_col, columns=movie_col, values=rating_col).fillna(0)

###Initialize NMF model
n_components = 20 
nmf_model = NMF(n_components=n_components, init='random', random_state=42)

###Fit the NMF model
user_features = nmf_model.fit_transform(train_pivot)
item_features = nmf_model.components_

###Predict ratings
predicted_ratings = np.dot(user_features, item_features)

###Convert the predictions to a DataFrame
predicted_ratings_df = pd.DataFrame(predicted_ratings, index=train_pivot.index, columns=train_pivot.columns)

###Function to calculate RMSE
def calculate_rmse(true_ratings, predicted_ratings):
    return sqrt(mean_squared_error(true_ratings, predicted_ratings))

###Extract the true ratings from the test set and handle missing entries
def get_predicted_rating(row):
    try:
        return predicted_ratings_df.loc[row[user_col], row[movie_col]]
    except KeyError:
        return np.nan

test['predicted_rating'] = test.apply(get_predicted_rating, axis=1)

###Drop rows with missing predicted ratings
test = test.dropna(subset=['predicted_rating'])

###Calculate RMSE
rmse = calculate_rmse(test[rating_col], test['predicted_rating'])
print(f"RMSE: {rmse}")

  from pandas.core import (


RMSE: 2.853698227091948
