### Note 

This is the second part of the assignment for Module 4 of Unsupervised Algorithms in Machine Learning.

More information can be found on: https://github.com/minhleathvn/machine-learnin-theory-and-hands-on-practice-with-pythong-cu

# Movie Recommendation System using NMF

This notebook builds a movie recommendation system using Non-negative Matrix Factorization (NMF). The process involves loading user, movie, training, and testing data, mapping user and movie IDs, creating a rating matrix, training the NMF model, and evaluating the model's performance using Root Mean Squared Error (RMSE).

## Data Loading

This section loads the `users.csv`, `movies.csv`, `train.csv`, and `test.csv` files into pandas DataFrames.

In [12]:
import pandas as pd
from IPython.display import display

In [13]:
# Load data
users = pd.read_csv('users.csv')
movies = pd.read_csv('movies.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# Preview dataframes
dfs = {"users":users, "movies":movies, "train_data": train, "test_data": test}

for key, df in dfs.items():
    print(f"Sample of {key}")
    display(df.head())
    print("\n")

Sample of users


Unnamed: 0,uID,gender,age,accupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455




Sample of movies


Unnamed: 0,mID,title,year,Doc,Com,Hor,Adv,Wes,Dra,Ani,...,Chi,Cri,Thr,Sci,Mys,Rom,Fil,Fan,Act,Mus
0,1,Toy Story,1995,0,1,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0
1,2,Jumanji,1995,0,0,0,1,0,0,0,...,1,0,0,0,0,0,0,1,0,0
2,3,Grumpier Old Men,1995,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale,1995,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II,1995,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0




Sample of train_data


Unnamed: 0,uID,mID,rating
0,744,1210,5
1,3040,1584,4
2,1451,1293,5
3,5455,3176,2
4,2507,3074,5




Sample of test_data


Unnamed: 0,uID,mID,rating
0,2233,440,4
1,4274,587,5
2,2498,454,3
3,2868,2336,5
4,1636,2686,5






## Data Preprocessing and Mapping

Here, we identify unique users and movies from the training data and create a mapping from their original IDs to a contiguous index range. This mapping is then applied to both the training and testing datasets. Rows in the test set corresponding to users or movies not present in the training set are removed to ensure consistency. Finally, a rating matrix is created using the training data, where rows represent users, columns represent movies, and the values are the ratings. Missing ratings are filled with zeros.

In [14]:
import numpy as np

# Finding the unique number of users and movies
unique_user_ids = train['uID'].unique()
unique_movie_ids = train['mID'].unique()

# Creating a new mapping from original user and movie IDs to connect to testing dataset
user_id_to_index = {uid: i for i, uid in enumerate(unique_user_ids)}
movie_id_to_index = {mid: i for i, mid in enumerate(unique_movie_ids)}

# apply mapping to training and testing datasets
train['uID_mapped'] = train['uID'].apply(lambda uID: user_id_to_index[uID])
train['mID_mapped'] = train['mID'].apply(lambda mID: movie_id_to_index[mID])
test['uID_mapped'] = test['uID'].apply(lambda uID: user_id_to_index.get(uID))
test['mID_mapped'] = test['mID'].apply(lambda mID: movie_id_to_index.get(mID))

# Drop rows in test set where uID or mID are NaN (users/movies not in training set)
test.dropna(subset=['uID_mapped', 'mID_mapped'], inplace=True)

# Convert mapped uID and mID in test set to integers
test['uID_mapped'] = test['uID_mapped'].astype(int)
test['mID_mapped'] = test['mID_mapped'].astype(int)


# Create rating matrix using pivot
rating_matrix = train.pivot(index='uID_mapped', columns='mID_mapped', values='rating').fillna(0)

display(rating_matrix.iloc[:10, :10])

mID_mapped,0,1,2,3,4,5,6,7,8,9
uID_mapped,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
1,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,5.0,4.0,0.0,0.0,0.0,0.0,3.0,0.0
3,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3.0,3.0,0.0,3.0,5.0,0.0,0.0,4.0,4.0,0.0
5,5.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,1.0
6,0.0,3.0,0.0,0.0,0.0,4.0,0.0,4.0,5.0,0.0
7,5.0,4.0,0.0,4.0,0.0,0.0,2.0,5.0,0.0,4.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0
9,0.0,2.0,0.0,0.0,5.0,0.0,0.0,3.0,3.0,0.0


## Training the NMF Model

This section utilizes the Non-negative Matrix Factorization (NMF) technique from the `sklearn.decomposition` library to decompose the rating matrix into two lower-rank matrices: user factors and movie factors. These factors represent latent features that capture user preferences and movie characteristics. The number of latent factors (`n_components`) is a hyperparameter that can be tuned.

In [15]:
# Train NMF (non-negative matrix factorization)
from sklearn.decomposition import NMF

# Define the number of latent factors
n_components = 20  # This is a hyperparameter that can be tuned

# Initialize NMF model
model = NMF(n_components=n_components, init='random', random_state=0, max_iter=1000)

# Fit the model to the rating matrix
# Transpose the matrix if users are rows and movies are columns, or vice versa,
# depending on how your NMF implementation expects the input.
# Assuming users are rows and movies are columns in rating_matrix
user_factors = model.fit_transform(rating_matrix)
movie_factors = model.components_

print("Shape of user factors matrix:", user_factors.shape)
print("Shape of movie factors matrix:", movie_factors.shape)

Shape of user factors matrix: (6040, 20)
Shape of movie factors matrix: (20, 3664)


In [16]:
# Calculate the predicted ratings
pred_rating_matrix = np.dot(user_factors, movie_factors)
pd.DataFrame(pred_rating_matrix).iloc[:10, :10]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,4.961131,2.257379,0.175019,0.299125,0.339634,0.078639,0.205387,0.539468,0.419132,0.317446
1,0.812536,0.678687,0.020554,0.106916,0.000134,0.0,0.016212,0.148794,0.023751,0.016762
2,3.809488,3.285759,2.717312,2.102809,1.25025,0.866808,0.19291,1.909298,1.158725,0.406189
3,2.780142,0.850619,0.278176,1.556893,0.001678,0.378317,0.058002,0.199906,1.199094,0.036553
4,2.374397,2.184107,2.209716,2.499339,0.662466,1.361286,0.15955,2.498196,1.667162,0.445662
5,2.617965,1.864294,2.94665,0.927399,1.431078,0.34524,0.135911,0.383244,0.217437,0.186471
6,1.073212,2.158384,0.346365,2.082899,0.041962,0.378928,0.569372,1.481489,1.379609,1.285579
7,1.828926,0.764179,1.520939,1.387096,0.253823,0.729455,0.265085,1.708199,1.479696,0.876888
8,3.037044,0.974179,0.329381,1.339951,0.208775,0.103213,0.381997,1.874477,1.460102,0.157476
9,1.177941,1.834533,1.07633,1.999263,0.520708,0.396758,0.474961,1.828662,1.801428,0.852055


## Predicting Ratings and Evaluating the Model

After training the NMF model, we predict the ratings by taking the dot product of the user factors and movie factors matrices. Finally, the performance of the recommendation system is evaluated using the Root Mean Squared Error (RMSE) between the actual ratings in the test set and the predicted ratings.

In [17]:
from sklearn.metrics import mean_squared_error
import numpy as np

y_true = test['rating'].values

def get_predicted_rating(uID_mapped, mID_mapped, predicted_ratings_matrix):
    # The mapped IDs are already integers due to the astype(int) conversion
    return predicted_ratings_matrix[uID_mapped, mID_mapped]

y_pred = test.apply(lambda row: get_predicted_rating(row['uID_mapped'], row['mID_mapped'], pred_rating_matrix), axis=1)

# Calculate the MSE and then take the square root for RMSE
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
print(f"RMSE: {round(rmse,4)}")

RMSE: 2.8614


## Discussion

The high RMSE (2.8614) suggests the NMF model isn't performing as well as it could. This might be due to data sparsity, the choice of latent factors (n_components), or not accounting for user and movie biases.

To improve the model, consider:

- Hyperparameter Tuning: Experiment with n_components and initialization methods.
- Regularization: Add regularization to prevent overfitting.
- Bias Terms: Include user and movie biases in the model.
- Different Techniques: Explore other matrix factorization methods or collaborative filtering algorithms.
- Additional Data: Incorporate movie genres, user demographics, etc.

These steps can help improve the accuracy of the recommendation system.