# Part 2

The movie data is being imported into Jupyternotebook and the collection is stored as "data"

In [9]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
from sklearn.model_selection import train_test_split
from scipy.sparse import coo_matrix, csr_matrix
from scipy.spatial.distance import jaccard, cosine 
from pytest import approx

#Load Data
MV_users = pd.read_csv('data/users.csv')
MV_movies = pd.read_csv('data/movies.csv')
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

print (train.head(10))

#Structure Data in collections.namedtuple
from collections import namedtuple
Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)

    uID   mID  rating
0   744  1210       5
1  3040  1584       4
2  1451  1293       5
3  5455  3176       2
4  2507  3074       5
5  1465  1210       5
6  1050  2390       4
7  1587  1866       2
8  1611   551       5
9  3507  2580       3


- Matrix Factorization technique is used to factorize the rating matrix (users x movies).

- The prediction of rating of given UID and MID is the dot product of user latent feature (uid) and mocie latent feature (mid)


In [10]:
from sklearn.decomposition import NMF

class RecSys():
    def __init__(self, data, latent_dim=20, average_rating = False):
        self.data = data
        self.allusers = list(self.data.users['uID'])
        self.allmovies = list(self.data.movies['mID'])
        self.genres = list(self.data.movies.columns.drop(['mID', 'title', 'year']))
        self.mid2idx = dict(zip(self.data.movies.mID, list(range(len(self.data.movies)))))
        self.uid2idx = dict(zip(self.data.users.uID, list(range(len(self.data.users)))))
        self.rating_matrix()
        if average_rating:
            self.predict_to_user_average()
        self.latent_dim = latent_dim
        self.user_factors = None
        self.movie_factors = None
        self.SVD_factorize()

    #Create Rating Matrix (users x movies)
    def rating_matrix(self):
        ind_movie = [self.mid2idx[x] for x in self.data.train.mID]
        ind_user = [self.uid2idx[x] for x in self.data.train.uID]
        rating_train = list(self.data.train.rating)
        self.Mr = np.array(coo_matrix((rating_train, (ind_user, ind_movie)), shape=(len(self.allusers), len(self.allmovies))).toarray())
    
    #Get Average rating for each user by movie
    def predict_to_user_average(self):
        user_sum = self.Mr.sum(axis=1)
        user_count = (self.Mr > 0).sum(axis=1)
        average_ratings = np.divide(user_sum, user_count, out=np.full_like(user_sum, 3, dtype=float), where=user_count != 0)

        #Replace 0 entries with average rating for that user
        self.Mr = np.where(self.Mr == 0, average_ratings.reshape(-1,1), self.Mr)
        
    # Factorize the rating matrix using SVD
    def SVD_factorize(self):
        model = NMF(n_components=self.latent_dim, init='random', random_state=42, max_iter=200)
        U = model.fit_transform(self.Mr)
        Vt = model.components_.T
        self.user_factors = U
        self.movie_factors = Vt
    
    # Predict the rating for a given user and movie using the dot product of user and movie latent features
    def predict_matrix_factorized(self, uid, mid):
        uidx = self.uid2idx.get(uid, None)
        midx = self.mid2idx.get(mid, None)
        if uidx is None or midx is None:
            return 3
        else:
            pred = np.dot(self.user_factors[uidx], self.movie_factors[midx])
            pred = np.clip(pred, 1, 5)
            return pred

    # Predict ratings for all user-movie pairs in the test set
    def predict(self):
        test_uids = self.data.test.uID
        test_mids = self.data.test.mID
        predictions = np.array([self.predict_matrix_factorized(uid, mid) for uid, mid in zip(test_uids, test_mids)])
        return predictions

    # Calculate RMSE between predicted and actual ratings
    def rmse(self, yp):
        yp[np.isnan(yp)] = 3
        yt = np.array(self.data.test.rating)
        return np.sqrt(((yt - yp) ** 2).mean())

# Creating Sample test data
np.random.seed(42)
sample_train = train[:30000]
sample_test = test[:30000]
sample_MV_users = MV_users[(MV_users.uID.isin(sample_train.uID)) | (MV_users.uID.isin(sample_test.uID))]
sample_MV_movies = MV_movies[(MV_movies.mID.isin(sample_train.mID)) | (MV_movies.mID.isin(sample_test.mID))]
sample_data = Data(sample_MV_users, sample_MV_movies, sample_train, sample_test)
print (sample_data.movies.shape)
print (sample_data.users.shape)

#Get Rating Matrix
model_A = RecSys(sample_data,latent_dim = 10)
yp = model_A.predict()
print(f'Model_A RMSE: {model_A.rmse(yp)}')

(3152, 21)
(5769, 5)
Model_A RMSE: 2.8168254488518953


Observation

- The rmse is very high with this matrix factorization technique which is far above 1.

- In the process of factorizating, there are some Nan or 0 rating in the rating matrix which give rise to issues in the factorization.

- For similiarity comprison in Week 3 assignment, the Nan ratings can be filtered and disregarded in similiarity matrix calculation.

---

Suggestion

- We can use rating value averaging to replace 0 rating by the average score of the user

In [11]:
model_B = RecSys(sample_data,latent_dim = 10,average_rating = True)
yp = model_B.predict()
print(f'Model_B RMSE: {model_B.rmse(yp)}')



Model_B RMSE: 1.1423770092265007


# Discussion

#### Experiment Results

RMSE in this study shows the average prediction error in score. If RMSE is 1, and the prediction is 2, it means that the actual value is in the range of 1 to 3.

With the given result:

    Model A (raw sparse rating matrix) → RMSE ≈ 2.817

    Model B (with user-average imputation) → RMSE ≈ 1.142

- Model A performs poorly because it treats all 0s as real ratings, even though they represent missing values.

- Model B improves significantly by filling missing values with user averages, which reduces noise and allows the model to learn real patterns.


- In recommender systems, a zero in the matrix usually means "no rating" (missing data).
However, sklearn.decomposition.NMF interprets it as "user rated movie with 0", which is not true.
This introduces massive bias and noise into the model cauing th huge RMSE.

- It can be corrected by using averaging rating to fill up the "no rating" cells for the user.

--- 

Similarity model (in Week 3) works better

-  similarity is conputed based on observed ratings only
- The system was simple, interpretable, and robust against data sparsity

In this case of movie rating predictions, the Similarity Model performs much better than the Matrix Factorization Model tested in this notebook.


