# Non-Negative Matrix Factorization in Rating Movies

NMF will be applied to an additional dataset in order to determine its suitability. The data used will be movie rating data from:

"MovieLens 1M Dataset." Kaggle, https://www.kaggle.com/datasets/odedgolden/movielens-1m-dataset.

The data consists of four different datasets which can be observed below. 

In [2]:
import pandas as pd
import numpy as np
from sklearn.decomposition import NMF

In [3]:
dfm_train = pd.read_csv('train.csv')
dfm_test = pd.read_csv('test.csv')
dfm_users = pd.read_csv('users.csv')
dfm_movies = pd.read_csv('movies.csv')
dfm_movies = dfm_movies.drop(columns=['title', 'year'])

display(dfm_movies.head())
display(dfm_users.head())
display(dfm_train.head())
display(dfm_test.head())

Unnamed: 0,mID,Doc,Com,Hor,Adv,Wes,Dra,Ani,War,Chi,Cri,Thr,Sci,Mys,Rom,Fil,Fan,Act,Mus
0,1,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0
1,2,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0
2,3,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,4,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
4,5,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,uID,gender,age,accupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


Unnamed: 0,uID,mID,rating
0,744,1210,5
1,3040,1584,4
2,1451,1293,5
3,5455,3176,2
4,2507,3074,5


Unnamed: 0,uID,mID,rating
0,2233,440,4
1,4274,587,5
2,2498,454,3
3,2868,2336,5
4,1636,2686,5


The first step will be to process these datasets into one for testing and one for training. They will be joined to the movie and user databases on their respective IDs in order to bring all the movie and user attributes together and linked to their ratings. Next, the model will be fit, and the train dataset will be fed into it to predict ratings. 

In [4]:
# Combine the movie and user datasets on the training data. 
dfm_combo = pd.merge(dfm_train, dfm_movies, on='mID', how='left')
dfm_combo = pd.merge(dfm_combo, dfm_users, on='uID', how='left')

# Drop the unnecessary IDs as well as zip code.
dfm_combo = dfm_combo.drop(['mID', 'uID', 'zip'], axis=1)

# Create dummies to normalize and differentiate categories.
dfm_combo = pd.get_dummies(dfm_combo, columns=['gender', 'age', 'accupation'])

# Move the ratings column to the end.
new_cols = [col for col in dfm_combo.columns if col != 'rating'] + ['rating']
dfm_combo = dfm_combo[new_cols]

dfm_combo.head()

Unnamed: 0,Doc,Com,Hor,Adv,Wes,Dra,Ani,War,Chi,Cri,...,accupation_12,accupation_13,accupation_14,accupation_15,accupation_16,accupation_17,accupation_18,accupation_19,accupation_20,rating
0,0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,5
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,5
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,2
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5


In [5]:
# Fitting the NMF model on everything but the rating column.
nmf_model_m = NMF(n_components=5, init='random', random_state=10)
WM = nmf_model_m.fit_transform(dfm_combo.iloc[:, :-1])
HM = nmf_model_m.components_

print(WM.shape, HM.shape)

(700146, 5) (5, 48)


In [6]:
# Combine the movie and user datasets on the testing data. 
dfm_combo_T = pd.merge(dfm_train, dfm_movies, on='mID', how='left')
dfm_combo_T = pd.merge(dfm_combo_T, dfm_users, on='uID', how='left')

# Drop the unnecessary IDs as well as zip code.
dfm_combo_T = dfm_combo_T.drop(['mID', 'uID', 'zip'], axis=1)

# Create dummies to normalize and differentiate categories.
dfm_combo_T = pd.get_dummies(dfm_combo_T, columns=['gender', 'age', 'accupation'])

# Move the ratings column to the end.
new_cols_T = [col for col in dfm_combo_T.columns if col != 'rating'] + ['rating']
dfm_combo_T = dfm_combo_T[new_cols_T]

dfm_combo_T.head()

Unnamed: 0,Doc,Com,Hor,Adv,Wes,Dra,Ani,War,Chi,Cri,...,accupation_12,accupation_13,accupation_14,accupation_15,accupation_16,accupation_17,accupation_18,accupation_19,accupation_20,rating
0,0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,5
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,5
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,2
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5


In [7]:
# Setting up the index to match with the predictions.
#idx_m = dfm_combo_T.iloc[:, :-1].index

# Choosing the best category to use as the prediction.
dfm_calc = dfm_combo_T.iloc[:, :-1]
calcs_m = np.dot(HM, np.transpose(dfm_calc))
best_m = calcs_m.argmax(axis=0)

# Creating the dataframe of predictions. 
#data_m = {'idx': idx_m, 'rating': best_m}
data_m = {'rating': best_m}
dfm_out = pd.DataFrame(data_m)

dfm_out.head()

Unnamed: 0,rating
0,2
1,1
2,4
3,3
4,1


Using NMF in this approach seems to have a couple of significant drawbacks. The first is that interpretability is rather poor as there is no way to assume classes based on the top associated keywords or similar. Additionally, as these ratings are assessed on a scale rather than being distinct categories, there will be quite a bit more overlap in what constitutes each rating. That is to say, there are not clear clusterings of each rating, which will lead to more ambiguity in fitting the model. 

One method of alleviating these problems could be to convert the rating system to a simple binary like/dislike rating, which could allow for more purity in each class.


## Useful Resources:
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html