In [1]:
import numpy as np
import pandas as pd
from scipy.sparse import coo_matrix

from sklearn.decomposition import NMF, TruncatedSVD
from sklearn.preprocessing import MinMaxScaler

## Non-Negative Matrix Factorization for Movie Recommendations

This appends to a previous week's assignment on recommender systems.

We have four files:
- rec_movies: one row per movie, its release year, and a number of columns for genre
- rec_users: one row per user, their gender, their age, their occupation, and their zip code
- rec_train: one row per user, movie, and rating
- rec_test: same as rec_train


## 1. Load Moving Ratings Data and Predict with Matrix Factorization


In [2]:
users = pd.read_csv('data/rec_users.csv')
movies = pd.read_csv('data/rec_movies.csv')
train = pd.read_csv('data/rec_train.csv')
test = pd.read_csv('data/rec_test.csv')

In [3]:
allusers = list(users['uID'])
allmovies = list(movies['mID'])
genres = list(movies.columns.drop(['mID', 'title', 'year']))
mid2idx = dict(zip(movies.mID,list(range(len(movies)))))
uid2idx = dict(zip(users.uID,list(range(len(users)))))

# Turns the train set into a utility matrix with one row per user, one 
# column per movie, and each cell as that user's rating for that movie
movie_ratings_utility_matrix = np.array(
        coo_matrix(
            (
                list(train.rating)
                , (
                    [uid2idx[x] for x in train.uID]
                    , [mid2idx[x] for x in train.mID]
                )
            ), shape=(
                len(allusers)
                , len(allmovies)
            )
        ).toarray()
    )

In [4]:
def rmse(y_preds):
    y_preds[np.isnan(y_preds)] = 3 #In case there is nan values in prediction, it will impute to 3.
    y_true = np.array(test.rating)
    return np.sqrt(((y_true - y_preds)**2).mean())

Not including solutions here, but tested and confirmed that 
- `predict_everything_to_3()`
- `predict_to_user_average()`

works. Just to make sure my refactoring here did not mess anything up.

As a reminder, here is how each of the methods performed in the week three assignment.

| Method                              | RMSE  |
|:------------------------------------|:-----:|
| Baseline, $Y_p$=3                   | 1.259 |
| Baseline, $Y_p=\mu_u$               | 1.035 |
| Content based, item-item            | 1.013 |
| Collaborative, cosine               | 1.026 |
| Collaborative, jaccard, $M_r\geq 3$ | 0.982 |
| Collaborative, jaccard, $M_r\geq 1$ | 0.991 |
| Collaborative, jaccard, $M_r$       | 0.952 |

In [5]:
nmf = NMF(
    n_components=5
    , random_state=42
)
W = nmf.fit_transform(movie_ratings_utility_matrix)
H = nmf.components_

preds_nmf = W.dot(H)

In [6]:
print(f'W matrix shape: {W.shape}')
print(f'H matrix shape: {H.shape}')
print(f'Utility matrix shape: {movie_ratings_utility_matrix.shape}')

W matrix shape: (6040, 5)
H matrix shape: (5, 3883)
Utility matrix shape: (6040, 3883)


In [7]:
svd = TruncatedSVD(
    n_components=5
    , random_state=42
)

U = svd.fit_transform(movie_ratings_utility_matrix)
S = np.diag(svd.singular_values_)
V = svd.components_

preds_svd = U.dot(S.dot(V))


In [8]:
print(f'U matrix shape: {U.shape}')
print(f'S matrix shape: {S.shape}')
print(f'V matrix shape: {V.shape}')
print(f'Utility matrix shape: {movie_ratings_utility_matrix.shape}')

U matrix shape: (6040, 5)
S matrix shape: (5, 5)
V matrix shape: (5, 3883)
Utility matrix shape: (6040, 3883)


In [9]:
mms_nmf = MinMaxScaler((0, 5))
preds_test_nmf = np.array([
    preds_nmf[uid2idx[uid], mid2idx[mid]]
    for uid, mid in np.array(test.drop('rating', axis=1))
])
preds_test_nmf_scaled = mms_nmf.fit_transform(preds_test_nmf.reshape(-1,1)).flatten()

mms_svd = MinMaxScaler((0, 5))
preds_test_svd = np.array([
    preds_svd[uid2idx[uid], mid2idx[mid]]
    for uid, mid in np.array(test.drop('rating', axis=1))
])
preds_test_svd_scaled = mms_svd.fit_transform(preds_test_svd.reshape(-1,1)).flatten()

In [10]:
print(f'RMSE for NMF: {rmse(preds_test_nmf)}')
print(f'RMSE for NMF scaled: {rmse(preds_test_nmf_scaled)}')
print(f'RMSE for SVD: {rmse(preds_test_svd)}')
print(f'RMSE for SVD scaled: {rmse(preds_test_svd_scaled)}')


RMSE for NMF: 2.9914309681594413
RMSE for NMF scaled: 3.2785882531138237
RMSE for SVD: 1296.9826413080698
RMSE for SVD scaled: 3.2626762882028038


## 2. Why the Predictions Did Not Work

## References

I needed some help figuring this out out. Here are some resources I used when getting up to speed on using matrix factorization for rating predictions.

- https://medium.com/beek-tech/predicting-ratings-with-matrix-factorization-methods-cf6c68da775
- https://medium.com/analytics-vidhya/matrix-factorization-made-easy-recommender-systems-7e4f50504477