# Movie Ratings NMF

**Limitation(s) of sklearn’s non-negative matrix factorization library. [20 pts]**

1. Load the movie ratings data (as in the HW3-recommender-system) and use matrix factorization technique(s) and predict the missing ratings from the test data. Measure the RMSE. You should use sklearn library. [10 pts]

2. Discuss the results and why sklearn's non-negative matrix facorization library did not work well compared to simple baseline or similarity-based methods we’ve done in Module 3. Can you suggest a way(s) to fix it? [10 pts]

In [1]:
# load packages
import pandas as pd
import numpy as np
from sklearn.decomposition import NMF
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

In [2]:
# load data
path = '../Recommender_System_Assignment/'
MV_users = pd.read_csv(path + 'users.csv')
MV_movies = pd.read_csv(path + 'movies.csv')
train = pd.read_csv(path + 'train.csv')
test = pd.read_csv(path + 'test.csv')

In [3]:
# clean movie and user data
# exclude ID numbers which is just the index + 1

# change gender to 1/0 for M/F in user data
MV_users['gender'] = (MV_users['gender'] == 'M').astype(int)
# remove user ID column
MV_users.drop(columns = ['uID'], inplace = True)
# get zip code as a number
MV_users['zip'] = MV_users['zip'].apply(lambda x: x.split('-')[0])
MV_users['zip'] = MV_users['zip'].astype(int)

# drop movie ID, title, and year columns from movie data
# year excluded because all movies are from a 5-year time period, so no real change
MV_movies.drop(columns = ['mID', 'title', 'year'], inplace = True)

# get both sets into np matrix form
users = MV_users.to_numpy()
movies = MV_movies.to_numpy()

The BBC News project showed that NMF cannot be used to predict classifications for new data that it was not trained on. For this project, we'll use it as a method of reducing dimensionality and then use logistic regression to actually make the classification model.

In [5]:
# organize data into usable formats for NMF
# X is combination of user data and movie data
# first do NMF on both user and movie data 

# initialize model
nmf_model = NMF(n_components = 1, init = 'random', random_state = 2021)

# transform user data
users_reduced = nmf_model.fit_transform(users)
users_reduced = users_reduced[:, 0].tolist()

# transform movie data
movies_reduced = nmf_model.fit_transform(movies)
movies_reduced = movies_reduced[:, 0].tolist()

# make copies of original datasets for replacing ids with transformed information
train_df = train.copy()
test_df = test.copy()

# remove rating column
train_df.drop(columns = 'rating', inplace = True)
test_df.drop(columns = 'rating', inplace = True)

# replace user IDs with transformed data in training and testing data
for ii in range(len(users_reduced)):
    # get transformed value
    user_trans = users_reduced[ii]
    # get user id value
    uid = ii + 1
    # replace all values of the uid in the train_df and test_df with the transformed data
    train_df['uID'] = train_df['uID'].replace(uid, user_trans)
    test_df['uID'] = test_df['uID'].replace(uid, user_trans)

# now repeat the previous procedure but for the movie data
for ii in range(len(movies_reduced)):
    # get transformed value
    movie_trans = movies_reduced[ii]
    # get movie id value
    mID = ii + 1
    # replace all values of the mID in the train_df and test_df with the transformed data
    train_df['mID'] = train_df['mID'].replace(mID, movie_trans)
    test_df['mID'] = test_df['mID'].replace(mID, movie_trans)

X_train = train_df.to_numpy()
X_test = test_df.to_numpy()

In [6]:
# save y data for logistic regression later
# y is ratings
y_train = train['rating'].tolist()
y_test = test['rating'].tolist()

## Implement Logistic Regression Model

In [7]:
# create model
lr_model = LogisticRegression(max_iter = 100000)
lr_model.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100000


In [8]:
# evaluate model on training data
train_preds = lr_model.predict(X_train)

acc = accuracy_score(y_train, train_preds)
cm = confusion_matrix(y_train, train_preds)

print(acc)
print(cm)

0.3489972091535194
[[     0      0      0  39414     22]
 [     0      0      0  75150     24]
 [     0      0      0 182716     86]
 [     0      0      0 244058    167]
 [     0      0      0 158218    291]]


In [9]:
# evaluate model on testing data
test_preds = lr_model.predict(X_test)

acc = accuracy_score(y_test, test_preds)
cm = confusion_matrix(y_test, test_preds)

print(acc)
print(cm)

0.34927331926962
[[     0      0      0  16735      3]
 [     0      0      0  32374      9]
 [     0      0      0  78352     43]
 [     0      0      0 104671     75]
 [     0      0      0  67668    133]]


## Conclusions

This showed us that the NMF procedure is used better as a dimensionality reduction method rather than a classifier. Also, the logistic regression model tends to predict things that lean towards the average rating, which is not surprising based on what we found in the Recommender Systems Project. It is possible to fix this by creating a method of normalizing the rating data but still keeping it as discrete data rather than continuous. 