### Grading
The final score that you will receive for your programming assignment is generated in relation to the total points set in your programming assignment item—not the total point value in the nbgrader notebook.<br>
When calculating the final score shown to learners, the programming assignment takes the percentage of earned points vs. the total points provided by nbgrader and returns a score matching the equivalent percentage of the point value for the programming assignment. <br>
**DO NOT CHANGE VARIABLE OR METHOD SIGNATURES** The autograder will not work properly if your change the variable or method signatures. 

### Validate Button
Please note that this assignment uses nbgrader to facilitate grading. You will see a **validate button** at the top of your Jupyter notebook. If you hit this button, it will run tests cases for the lab that aren't hidden. It is good to use the validate button before submitting the lab. Do know that the labs in the course contain hidden test cases. The validate button will not let you know whether these test cases pass. After submitting your lab, you can see more information about these hidden test cases in the Grader Output. <br>
***Cells with longer execution times will cause the validate button to time out and freeze. Please know that if you run into Validate time-outs, it will not affect the final submission grading.*** <br>

# Building Recommender Systems for Movie Rating Prediction

In this assignment, we will build a recommender systems that predict movie ratings. [MovieLense](https://grouplens.org/datasets/movielens/) has currently 25 million user-movie ratings.  Since the entire data is too big, we use  a 1 million ratings subset [MovieLens 1M](https://www.kaggle.com/odedgolden/movielens-1m-dataset), and we reformatted the data to make it more convenient to use.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
from sklearn.model_selection import train_test_split
from scipy.sparse import coo_matrix, csr_matrix
from scipy.spatial.distance import jaccard, cosine 
from pytest import approx

In [2]:
MV_users = pd.read_csv('data/users.csv')
MV_movies = pd.read_csv('data/movies.csv')
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [3]:
from collections import namedtuple
Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)

### Starter codes
Now, we will be building a recommender system which has various techniques to predict ratings. 
The `class RecSys` has baseline prediction methods (such as predicting everything to 3 or to average rating of each user) and other utility functions. `class ContentBased` and `class Collaborative` inherit `class RecSys` and further add methods calculating item-item similarity matrix. You will be completing those functions using what we learned about content-based filtering and collaborative filtering.

`RecSys`'s `rating_matrix` method converts the (user id, movie id, rating) triplet from the train data (train data's ratings are known) into a utility matrix for 6040 users and 3883 movies.    
Here, we create the utility matrix as a dense matrix (numpy.array) format for convenience. But in a real world data where hundreds of millions of users and items may exist, we won't be able to create the utility matrix in a dense matrix format (For those who are curious why, try measuring the dense matrix self.Mr using .nbytes()). In that case, we may use sparse matrix operations as much as possible and distributed file systems and distributed computing will be needed. Fortunately, our data is small enough to fit in a laptop/pc memory. Also, we will use numpy and scipy.sparse, which allow significantly faster calculations than calculating on pandas.DataFrame object.    
In the `rating_matrix` method, pay attention to the index mapping as user IDs and movie IDs are not the same as array index.

In [4]:
class RecSys():
    def __init__(self,data):
        self.data=data
        self.allusers = list(self.data.users['uID'])
        self.allmovies = list(self.data.movies['mID'])
        self.genres = list(self.data.movies.columns.drop(['mID', 'title', 'year']))
        self.mid2idx = dict(zip(self.data.movies.mID,list(range(len(self.data.movies)))))
        self.uid2idx = dict(zip(self.data.users.uID,list(range(len(self.data.users)))))
        self.Mr=self.rating_matrix()
        self.Mm=None 
        self.sim=np.zeros((len(self.allmovies),len(self.allmovies)))
        
    def rating_matrix(self):
        """
        Convert the rating matrix to numpy array of shape (#allusers,#allmovies)
        """
        ind_movie = [self.mid2idx[x] for x in self.data.train.mID] 
        ind_user = [self.uid2idx[x] for x in self.data.train.uID]
        rating_train = list(self.data.train.rating)
        
        return np.array(coo_matrix((rating_train, (ind_user, ind_movie)), shape=(len(self.allusers), len(self.allmovies))).toarray())


    def predict_everything_to_3(self):
        """
        Predict everything to 3 for the test data
        """
        # Generate an array with 3s against all entries in test dataset
        # your code here
        #np.array([3] * len(self.data.test))
        #np.ones(shape=(len(self.data.test,))) * 3
        return np.ones_like(self.data.test.rating) * 3
        
        
    def predict_to_user_average(self):
        """
        Predict to average rating for the user.
        Returns numpy array of shape (#users,)
        """
        # Generate an array as follows:
        # 1. Calculate all avg user rating as sum of ratings of user across all movies/number of movies whose rating > 0
        # 2. Return the average rating of users in test data
        # your code here
        mean_ratings = dict()
        test_IDs = self.data.test.uID.unique()
        for id in test_IDs:
            id_idx = self.uid2idx[id]
            ratings = self.Mr[id_idx]
            mean_ratings[id] = ratings.sum() / np.count_nonzero(ratings)
        yp = []
        for i in range(len(self.data.test)):
            yp.append(mean_ratings[self.data.test.uID[i]])
        yp = np.array(yp)
        return yp
    
    def predict_from_sim(self,uid,mid):
        """
        Predict a user rating on a movie given userID and movieID
        """
        # Predict user rating as follows:
        # 1. Get entry of user id in rating matrix
        # 2. Get entry of movie id in sim matrix
        # 3. Employ 1 and 2 to predict user rating of the movie
        # your code here
        index_userID = self.uid2idx[uid]
        ratings_index_userID = self.Mr[index_userID]
        index_movieID = self.mid2idx[mid]
        movie_sims = self.sim[index_movieID]
        sum_of_sims = np.dot(movie_sims, ratings_index_userID !=0)  # sum of sims where rating != 0
        rating = np.dot(ratings_index_userID, movie_sims) / sum_of_sims
        
        # if there are no similar movies, ie all sims=0 then the rating will be 0
        # if rating=0 then predict to user average
        if rating == 0:
            return self.Mr[index_userID].sum() / np.count_nonzero(self.Mr[index_userID])
        else:
            return rating
        # return rating
    
    def predict(self):
        """
        Predict ratings in the test data. Returns predicted rating in a numpy array of size (# of rows in testdata,)
        """
        # your code here
        yp = np.array([])
        for i in range(len(self.data.test)):
            uID = self.data.test.iloc[i]['uID']
            mID = self.data.test.iloc[i]['mID']
            rating = self.predict_from_sim(uID, mID)
            yp = np.append(yp, rating)
        return yp
    
    def rmse(self,yp):
        yp[np.isnan(yp)]=3 #In case there is nan values in prediction, it will impute to 3.
        yt=np.array(self.data.test.rating)
        return np.sqrt(((yt-yp)**2).mean())

    
class ContentBased(RecSys):
    def __init__(self,data):
        super().__init__(data)
        self.data=data
        self.Mm = self.calc_movie_feature_matrix()  
        
    def calc_movie_feature_matrix(self):
        """
        Create movie feature matrix in a numpy array of shape (#allmovies, #genres) 
        """
        # your code here
        m = self.data.movies.drop(['mID', 'title', 'year'], axis=1)
        return np.asmatrix(m)
    
    def calc_item_item_similarity(self):
        """
        Create item-item similarity using Jaccard similarity
        """
        # Update the sim matrix by calculating item-item similarity using Jaccard similarity
        # Jaccard Similarity: J(A, B) = |A∩B| / |A∪B| 
        # your code here
        from sklearn.metrics import pairwise_distances
        self.sim = (1 - pairwise_distances(self.Mm, metric='jaccard'))
        return
        
                
class Collaborative(RecSys):    
    def __init__(self,data):
        super().__init__(data)
        
    def calc_item_item_similarity(self, simfunction, *X):     #simfunction:  'cossim', 'jacsim'
        """
        Create item-item similarity using similarity function. 
        X is an optional transformed matrix of Mr
        """    
        # General function that calculates item-item similarity based on the sim function and data inputed
        if len(X)==0:
            self.sim = simfunction()            
        else:
            self.sim = simfunction(X[0]) # *X passes in a tuple format of (X,), to X[0] will be the actual transformed matrix
            
    def cossim(self):    
        """
        Calculates item-item similarity for all pairs of items using cosine similarity (values from 0 to 1) on utility matrix
        Returns a cosine similarity matrix of size (#all movies, #all movies)
        """
        # Return a sim matrix by calculating item-item similarity for all pairs of items using Jaccard similarity
        # Cosine Similarity: C(A, B) = (A.B) / (||A||.||B||) 
        # your code here
        
        # create X - a transformed/normalized matrix of .Mr (impute user mean rating for 0 and subtract user mean rating from all)
        X = self.Mr.copy().astype(float)
        for i in range(len(X)):
            user_avg = X[i].sum() / np.count_nonzero(X[i])
            np.nan_to_num(user_avg, copy=False)
            X[i] = np.where(X[i]==0, X[i], X[i] - user_avg)  # ie  where val=0, val stays 0, else where !=0 set val = val - mean

        from sklearn.metrics import pairwise_distances
        X_sim = (1 - pairwise_distances(X.T, metric='cosine'))
        X_sim = 0.5 + (0.5 * X_sim)
        return X_sim
        # this solution PASSED all cells
        
#         # Compute **averaged** movie ratings for all users (movie_ratings_allUsers)
#         movie_ratings_allUsers = self.Mr.sum(axis=1) / np.count_nonzero(self.Mr, axis=1)
#         np.nan_to_num(movie_ratings_allUsers, copy=False)  # default copy=True

#         # Create a sparse matrix for operating cosine on its values
#         movie_ratings_array = np.repeat(np.expand_dims(movie_ratings_allUsers, axis=1), self.Mr.shape[1], axis=1)

#         # Take care of all the zero ratings (missing value/itentionally we don't know)
#         movie_ratings_array_adjusted = self.Mr + (self.Mr==0)*movie_ratings_array - movie_ratings_array

#         # Average all the ratings: divide by its magnitude!
#         MR_avg = movie_ratings_array_adjusted / (np.sqrt((movie_ratings_array_adjusted**2).sum(axis=0)))

#         # Put a Boundary check # 1: since dividing by magnitude may produce inf, zeros, etc. Set nans to 0.
#         MR_avg = np.nan_to_num(MR_avg)  # or np.nan_to_num(MR_avg, copy=False)

#         # Perform an item-item cosine similarity using: np.dot(matrix.T, matrix)
#         sim_mat = np.dot(MR_avg.T, MR_avg)

#         # Note that the 289 movies with all zero rating will have cosine sim = 0  -  all same-same movie ratings along diagonal should be 1
#         #a = np.argwhere(np.diag(sim_mat) == 0)
#         #sim_mat[a, a] = 1  
#         # still 42 (other vals along diagonal slightly > 1) - use alt method
#         idx = range(sim_mat.shape[0])
#         sim_mat[idx, idx] = 1

#         # Normalized Cosine Formula:
#         sim_mat = 0.5 + (0.5 * sim_mat)

#         return sim_mat

    
    def jacsim(self,Xr):
        """
        Calculates item-item similarity for all pairs of items using jaccard similarity (values from 0 to 1)
        Xr is the transformed rating matrix.
        """    
        # Return a sim matrix by calculating item-item similarity for all pairs of items using Jaccard similarity
        # Jaccard Similarity: J(A, B) = |A∩B| / |A∪B| 
        # your code here
        
#         # Convert Xr into a CSR format
#         # csr0 = csr_matrix((Xr>0).astype(int))
#         X = csr_matrix(Xr)

#         # get the intersection
#         # produces a n x n matrix of the intersection values, ie [0,1] is the intersection between col 0 and col 1 which is value 3
#         a = (X > 0).astype(int)
#         nz_intersect = np.dot(a.T, a)
        
        
        n = Xr.shape[1]
        max_val = int(Xr.max())
        nz_intersect = np.zeros((n,n)).astype(int)
        for i in range(1, max_val + 1):
                csrm = csr_matrix((Xr == i)).astype(int)
                nz_intersect = nz_intersect + np.array(np.dot(csrm.T, csrm).toarray()).astype(int)

        # get the union

        # get the nonzero counts of each column
        #colsums = A.sum(axis=0)  # alternatively
        colsums = np.count_nonzero(Xr, axis=0)  # alternatively

        # get matrix of sum of colsums between columns
        # start with matrix of n x n where row vals = sum for corresponding column eg col 1 = 4, all row[0] vals = 4
        n = Xr.shape[1] # how many movies / columns
        colsums_mat = np.repeat(colsums.reshape(n,1), n, axis=1)
        # add the colsum matrix to its transpose to get the pairs
        colsums_pairs = colsums_mat + colsums_mat.T

        # to get the union:  subtract the intersection of a pair from the column sums of the two colums  eg col 1 = 4, col 2 = 3; total = 7, int = 3 ---> untion = 4
        union = colsums_pairs - nz_intersect

        # calculate jaccard similarity
        sim = nz_intersect / union
        np.nan_to_num(sim, copy=False)  # NaNs potentially generated when union is zero

        d = np.argwhere(np.diag(sim) != 1)
        sim[d, d] = 1
        
        return np.array(sim)
    
    

# Q1. Baseline models [15 pts]

### 1a. Complete the function `predict_everything_to_3` in the class `RecSys`  [5 pts]

In [5]:
# Creating Sample test data
np.random.seed(42)
sample_train = train[:30000]
sample_test = test[:30000]


sample_MV_users = MV_users[(MV_users.uID.isin(sample_train.uID)) | (MV_users.uID.isin(sample_test.uID))]
sample_MV_movies = MV_movies[(MV_movies.mID.isin(sample_train.mID)) | (MV_movies.mID.isin(sample_test.mID))]


sample_data = Data(sample_MV_users, sample_MV_movies, sample_train, sample_test)

In [6]:
# Sample tests predict_everything_to_3 in class RecSys

sample_rs = RecSys(sample_data)
sample_yp = sample_rs.predict_everything_to_3()
print(sample_rs.rmse(sample_yp))
assert sample_rs.rmse(sample_yp)==approx(1.2642784503423288, abs=1e-3), "Did you predict everything to 3 for the test data?"

1.2642784503423288


In [7]:
# passed

In [8]:
# Hidden tests predict_everything_to_3 in class RecSys
rs = RecSys(data)
yp = rs.predict_everything_to_3()
print(rs.rmse(yp))

1.2585510334053043


In [9]:
method_yp3 = rs.rmse(yp)
print(method_yp3)

1.2585510334053043


In [143]:
# passed

### 1b. Complete the function predict_to_user_average in the class RecSys [10 pts]
Hint: Include rated items only when averaging

In [144]:
#     def predict_to_user_average(self):
#         all_ratings = pd.concat((self.data.train, self.data.test))
#         uIDs = all_ratings.uID.unique()
#         avg_user_ratings = dict()
#         for uID in uIDs:
#             avg_rating = self.data.train[(self.data.train.uID == uID) & (self.data.train.rating > 0)].rating.mean()
#             avg_user_ratings[uID] = avg_rating
#         yp = np.array([])
#         for index, row in self.data.test.iterrows():
#             yp = np.append(yp, avg_user_ratings[row['uID']])
#         return yp

# passes both tests below

In [145]:
# mean_ratings = dict()
# test_IDs = sample_rs.data.test.uID.unique()
# for id in test_IDs:
#     id_idx = sample_rs.uid2idx[id]
#     ratings = sample_rs.Mr[id_idx]
#     mean_ratings[id] = ratings.sum() / np.count_nonzero(ratings)
    
# yp = np.array([])
# for index, row in sample_rs.data.test.iterrows():
#     yp = np.append(yp, mean_ratings[row['uID']])

# yp = []
# for i in range(len(sample_rs.data.test)):
#     yp.append(mean_ratings[sample_rs.data.test.uID[i]])
# yp = np.array(yp)

In [12]:
# Sample tests predict_to_user_average in the class RecSys
sample_yp = sample_rs.predict_to_user_average()
print(sample_rs.rmse(sample_yp))
assert sample_rs.rmse(sample_yp)==approx(1.1429596846619763, abs=1e-3), "Check predict_to_user_average in the RecSys class. Did you predict to average rating for the user?" 

1.1429596846619763


In [13]:
# passed

In [14]:
# Hidden tests predict_to_user_average in the class RecSys
yp = rs.predict_to_user_average()
print(rs.rmse(yp))

1.0352910334228647


In [15]:
method2_ypavg = rs.rmse(yp)
print(method2_ypavg)

1.0352910334228647


In [149]:
# passed

# Q2. Content-Based model [25 pts]

### 2a. Complete the function calc_movie_feature_matrix in the class ContentBased [5 pts]

In [16]:
cb = ContentBased(data)

In [17]:
# tests calc_movie_feature_matrix in the class ContentBased 
assert(cb.Mm.shape==(3883, 18))

In [18]:
# passed

### 2b. Complete the function calc_item_item_similarity in the class ContentBased [10 pts]
This function updates `self.sim` and does not return a value.    
Some factors to think about:     
1. The movie feature matrix has binary elements. Which similarity metric should be used?
2. What is the computation complexity (time complexity) on similarity calcuation?      
Hint: You may use functions in the `scipy.spatial.distance` module on the dense matrix, but it is quite slow (think about the time complexity). If you want to speed up, you may try using functions in the `scipy.sparse` module. 

In [13]:
from sklearn.metrics import pairwise_distances
my_cb = ContentBased(sample_data)
sim_mat = (1 - pairwise_distances(my_cb.Mm, metric='jaccard'))
np.trace(sim_mat)

3152.0

In [14]:
from scipy.spatial.distance import pdist, squareform
sim_mat = (1 - squareform(pdist(my_cb.Mm, metric='jaccard')))
np.trace(sim_mat)

3152.0

In [20]:
cb.calc_item_item_similarity()

In [21]:
# Sample tests calc_item_item_similarity in ContentBased class 

sample_cb = ContentBased(sample_data)
sample_cb.calc_item_item_similarity() 

# print(np.trace(sample_cb.sim))
# print(sample_cb.sim[10:13,10:13])
assert(sample_cb.sim.sum() > 0), "Check calc_item_item_similarity."
assert(np.trace(sample_cb.sim) == 3152), "Check calc_item_item_similarity. What do you think np.trace(cb.sim) should be?"


ans = np.array([[1, 0.25, 0.],[0.25, 1, 0.],[0., 0., 1]])
for pred, true in zip(sample_cb.sim[10:13, 10:13], ans):
    assert approx(pred, 0.01) == true, "Check calc_item_item_similarity. Look at cb.sim"

In [23]:
# tests calc_item_item_similarity in ContentBased class 

In [24]:
# additional tests for calc_item_item_similarity in ContentBased class 

In [25]:
# additional tests for calc_item_item_similarity in ContentBased class

In [26]:
# additional tests for calc_item_item_similarity in ContentBased class

In [27]:
# additional tests for calc_item_item_similarity in ContentBased class

In [28]:
# passed all above

### 2c. Complete the function predict_from_sim in the class RecSys [5 pts]

In [22]:
# for a, b in zip(sample_MV_users.uID, sample_MV_movies.mID):
#     print(a, b, sample_cb.predict_from_sim(a,b))

# Sample tests for predict_from_sim in RecSys class 
assert(sample_cb.predict_from_sim(245,276)==approx(2.5128205128205128,abs=1e-2)), "Check predict_from_sim. Look at how you predicted a user rating on a movie given UserID and movieID."
assert(sample_cb.predict_from_sim(2026,2436)==approx(2.785714285714286,abs=1e-2)), "Check predict_from_sim. Look at how you predicted a user rating on a movie given UserID and movieID."

In [30]:
# passed

In [31]:
# index_userID = sample_cb.uid2idx[245]
# ratings_index_userID = sample_cb.Mr[index_userID]
# index_movieID = sample_cb.mid2idx[276]
# movie_sims = sample_cb.sim[index_movieID]

# sum_of_sims = np.dot(movie_sims, ratings_index_userID !=0)
# np.dot(ratings_index_userID, movie_sims) / sum_of_sims

# # idx = np.nonzero(np.array(ratings_index_userID))
# # x = np.array(movie_sims)[idx].sum()
# # np.dot(ratings_index_userID, movie_sims) / x

# # if the rating is zero, ie all similarity scores ar zero, then compute user average
# sample_cb.Mr[index_userID].sum() / np.count_nonzero(sample_cb.Mr[index_userID])

In [32]:
# tests for predict_from_sim in RecSys class 

In [33]:
# passed - only after multiple attempts and finally kernel restart!

### 2d. Complete the function predict in the class RecSys [5 pts]
After completing the predict method in the RecSys class, run the cell below to calculate rating prediction and RMSE. How much does the performance increase compared to the baseline results from above? 

In [23]:
# Sample tests method predict in the RecSys class 

sample_yp = sample_cb.predict()
sample_rmse = sample_cb.rmse(sample_yp)
print(sample_rmse)

assert(sample_rmse==approx(1.1962537249116723, abs=1e-2)), "Check method predict in the RecSys class."

1.1962537249116723


In [35]:
# passed

In [36]:
# yp = np.array([])
# for i in range(len(sample_cb.data.test)):
#     uID = sample_cb.data.test.iloc[i]['uID']
#     mID = sample_cb.data.test.iloc[i]['mID']
#     rating = sample_cb.predict_from_sim(uID, mID)
#     yp = np.append(yp, rating)

# sample_cb.rmse(yp)

In [24]:
# Hidden tests method predict in the RecSys class 

yp = cb.predict()
rmse = cb.rmse(yp)
print(rmse)

1.0128116783754684


In [38]:
# tests method predict in the RecSys class 

In [25]:
method3_cb_ii = rmse
print(method3_cb_ii)

1.0128116783754684


In [None]:
# passed

# Q3. Collaborative Filtering

### 3a. Complete the function cossim in the class Collaborative [10 pts]
**To Do:**    
1.Impute the unrated entries in self.Mr to the user's average rating then subtract by the user mean, call this matrix X.   
2.Calculate cosine similarity for all item-item pairs. Don't forget to rescale the cosine similarity to be 0~1.    
You might encounter divide by zero warning (numpy will fill nan value for that entry). In that case, you can fill those with appropriate values.    

Hint: Let's say a movie item has not been rated by anyone. When you calculate similarity of this vector to anoter, you will get $\vec{0}$=[0,0,0,....,0]. When you normalize this vector, you'll get divide by zero warning and it will make nan value in self.sim matrix. Theoretically what should the similarity value for $\vec{x}_i \cdot \vec{x}_i$ when $\vec{x}_i = \vec{0}$? What about $\vec{x}_i \cdot \vec{x}_j$ when $\vec{x}_i = \vec{0}$ and $\vec{x}_j$ is an any vector?     

Hint: You may use `scipy.spatial.distance.cosine`, but it will be slow because its cosine function does vector-vector operation whereas you can implement matrix-matrix operation using numpy to calculate all cosines all at once (it can be 100 times faster than vector-vector operation in our data). Also pay attention to the definition. The scipy.spatial.distance provides distance, not similarity. 

3. Run the below cell that calculate yp and RMSE. 

In [26]:
# Sample tests cossim method in the Collaborative class

sample_cf = Collaborative(sample_data)
sample_cf.calc_item_item_similarity(sample_cf.cossim)
sample_yp = sample_cf.predict()
sample_rmse = sample_cf.rmse(sample_yp)

assert(np.trace(sample_cf.sim)==3152), "Check cossim method in the Collaborative class. What should np.trace(cf.sim) equal?"
assert(sample_rmse==approx(1.1429596846619763, abs=5e-3)), "Check cossim method in the Collaborative class. rmse result is not as expected."
assert(sample_cf.sim[0,:3]==approx([1., 0.5, 0.5],abs=1e-2)), "Check cossim method in the Collaborative class. cf.sim isn't giving the expected results."

In [None]:
# passed

In [154]:
# X = sample_cf.Mr.copy()
# for i in range(len(X)):
#     user_avg = X[i].sum() / np.count_nonzero(X[i])
#     X[i] = np.where(X[i]==0, X[i], X[i] - user_avg)  # ie  where val=0, val stays 0, else where !=0 set val = val - mean

# from sklearn.metrics import pairwise_distances
# X_sim = (1 - pairwise_distances(X.T, metric='cosine'))
# X_sim = 0.5 + (0.5 * X_sim)

In [176]:
print(np.count_nonzero(sample_cf.Mr==0))  # how many values are zero?
print(np.count_nonzero(sample_cf.Mr))     # how many values are non zero?
print(np.any(np.isnan(sample_cf.Mr)))     # are any values NaN?
print('total rows:', sample_cf.Mr.shape[0])
print('rows all zero:', np.all(sample_cf.Mr==0, axis=1).sum())  # how many rows are all zero?
print('rows not all zero:', np.any(sample_cf.Mr, axis=1).sum())     # how many rows have at least one rating?

np.nonzero(sample_cf.Mr)  # returns tuple of indices of nonzero values (array of dim_1, array of dim_2)
np.where(sample_cf.Mr!=0) # equivalent to np.nonzero - returns indices where val!=0

18153888
30000
False
total rows: 5769
rows all zero: 585
rows not all zero: 5184


(array([   0,    0,    1, ..., 5768, 5768, 5768]),
 array([ 463,  518,  694, ..., 1121, 1793, 2819]))

In [167]:
# cosine calculation examples from lecture
from scipy.spatial.distance import cosine

a = np.array([5, 0, 1, 4])
b = np.array([2, 3, 5, 0])
c = np.array([4, 4, 0, 4])
(1 - cosine(a, b))  # 0.375

# impute Nan/0 to neutral value 3
a = np.array([5, 3, 1, 4])
b = np.array([2, 3, 5, 3])
c = np.array([4, 4, 3, 4])
(1 - cosine(a, b))  # 0.74

# impute to 3 (and subtract 3)
a = np.array([5, 3, 1, 4]) - 3
b = np.array([2, 3, 5, 3]) - 3
(1 - cosine(a, b))  # -0.89

# normalize by user mean:  impute NaN/0 vals to mean (and subtract mean from all)
a = np.array([5, 3.25, 1, 4, 3]) - 3.25
b = np.array([2, 3, 5, 3.25, 3]) - 3.25
(1 - cosine(a, b))  # -0.94

-0.9403746653744962

In [8]:
# example of np.expand
# np.expand_dims(a, axis)
x = np.array([1, 2])
x.shape # (2,)
y = np.expand_dims(x, axis=0)  # equivalent to x[np.newaxis, :] or x[np.newaxis]
y.shape # (1, 2)
y = np.expand_dims(x, axis=1)  # equivalent to x[:, np.newaxis]
y.shape # (2, 1)

(2, 1)

In [9]:
# example of np.repeat
# np.repeat(a, repeats, axis=None)   Output array which has the same shape as a, except along the given axis
np.repeat(3, 4)
x = np.array([[1,2],[3,4]])
np.repeat(x, 2)
np.repeat(x, 3, axis=1)    #np.repeat(x, 3, axis=0)
# np.repeat(x, [1, 2], axis=0)

array([[1, 1, 1, 2, 2, 2],
       [3, 3, 3, 4, 4, 4]])

In [28]:
test = sample_cf

In [232]:
# Compute **averaged** movie ratings for all users (movie_ratings_allUsers)
movie_ratings_allUsers = test.Mr.sum(axis=1) / np.count_nonzero(test.Mr, axis=1)
np.isnan(movie_ratings_allUsers).sum() #585 NaNs

# replace the NaNs with 0
np.nan_to_num(movie_ratings_allUsers, copy=False)  # default copy=True
np.isnan(movie_ratings_allUsers).sum()  # 0

0

In [181]:
# Create a sparse matrix for operating cosine on its values
movie_ratings_array = np.repeat(np.expand_dims(movie_ratings_allUsers, axis=1), test.Mr.shape[1], axis=1)
movie_ratings_array.shape # (5769, 3152)  # no NaNs

(5769, 3152)

In [197]:
# Take care of all the zero ratings (missing value/itentionally we don't know)
movie_ratings_array_adjusted = test.Mr + (test.Mr==0)*movie_ratings_array - movie_ratings_array
# takes the original Mr, uses a bool matrix (where vals==0) * ratings to impute all zero vals to the mean for user
# then subtracts the mean from all values
movie_ratings_array_adjusted.shape  # (5769, 3152)  # no NaNs

(5769, 3152)

In [210]:
# are there columns that are all zero?  that will produce a norm=zero and div/0 --> NaN
np.all(movie_ratings_array_adjusted == 0, axis=0).sum() #289

289

In [228]:
# Average all the ratings: divide by its magnitude!
MR_avg = movie_ratings_array_adjusted / (np.sqrt((movie_ratings_array_adjusted**2).sum(axis=0)))
# note that np.sqrt((movie_ratings_array_adjusted**2).sum(axis=0)) should be equivalent to 
#    np.linalg.norm(movie_ratings_array_adjusted, ord=2, axis=0)

In [229]:
np.isnan(MR_avg).sum()  # 5769 users x 289 movies with all zero ratings = 1667241
np.all(np.isnan(MR_avg), axis=0).sum()  # 289 columns with all NaNs

289

In [243]:
# Put a Boundary check # 1: since dividing by magnitude may produce inf, zeros, etc. Set nans to 0.

# numpy.nan_to_num(x, copy=True, nan=0.0, posinf=None, neginf=None)
MR_avg = np.nan_to_num(MR_avg)  # or np.nan_to_num(MR_avg, copy=False)
MR_avg.shape  # (5769, 3152)
np.isnan(MR_avg).sum() # 0
np.count_nonzero(np.all(MR_avg == 0, axis=0))  # 289 columns are all zero (as expected)

289

In [244]:
# Perform an item-item cosine similarity using: np.dot(matrix.T, matrix)
sim_mat = np.dot(MR_avg.T, MR_avg)

In [248]:
# Note that the 289 movies with all zero rating will have cosine sim = 0
sim_mat.shape     # (3152, 3152)
sim_mat.trace()   # trace is 2863 (not 3152)
np.count_nonzero(np.diag(sim_mat) == 0)  # 289

2863.0

In [296]:
a = np.argwhere(np.diag(sim_mat) == 0)
b = np.argwhere(np.all(movie_ratings_array_adjusted == 0, axis=0))
np.all(np.equal(a, b))  # True

sim_mat[a, b].sum()
sim_mat[a, a].sum()  # 0

sim_mat[a, a] = 1
sim_mat.trace()  # 3152

3152.0

In [44]:
# But note!
np.count_nonzero(np.diag(sim_mat) > 1) # 42
# must be corrected to 1 - autograder checks max == 1
sim_mat[range(sim_mat.shape[0]), range(sim_mat.shape[0])] = 1
np.count_nonzero(np.diag(sim_mat) > 1)  # 0

0

In [45]:
# Normalized Cosine Formula:
sim_mat = 0.5 + (0.5 * sim_mat)
sim_mat.trace()

3152.0

In [27]:
# Hidden tests cossim method in the Collaborative class

cf = Collaborative(data)
cf.calc_item_item_similarity(cf.cossim)
yp = cf.predict()
rmse = cf.rmse(yp)
print(rmse)

1.0263081874204125


In [28]:
# method "B" rmse = 1.0263081874204125

In [29]:
method4_cb_cosine = rmse
print(rmse)

1.0263081874204125


In [None]:
# tests cossim method in the Collaborative class 

In [None]:
# additional tests for cossim method in the Collaborative class

In [None]:
# additional tests for cossim method in the Collaborative class

In [None]:
# additional tests for cossim method in the Collaborative class

In [None]:
# additional tests for cossim method in the Collaborative class

In [None]:
# additional tests for cossim method in the Collaborative class

In [None]:
# all cells passed

### 3b. Complete the function jacsim in the class Collaborative [15 pts]
**3b [15 pts] = 3b-i) [5 pts]+3b-ii) [5 pts]+ 3b-iii) [5 pts]**

Function `jacsim` calculates jaccard similarity between items using collaborative filtering method. When we have a rating matrix `self.Mr`, the entries of Mr matrix are 0 to 5 (0: unrated, 1-5: rating). We are interested to see which threshold method works better when we use jaccard dimilarity in the collaborative filtering.    
We may treat any rating 3 or above to be 1 and the negatively rated (below 3) and no-rating as 0. Or, we may treat movies with any ratings to be 1 and ones that has no rating as 0. In this question, we will complete a function jacsim that takes a transformed rating matrix X and calculate and returns a jaccard similarity matrix.     
Let's consider these input cases for the utility matrix $M_r$ with ratings 1-5 and 0s for no-rating.    
1. $M_r \geq 3$ 
2. $M_r \geq 0$ 
3. $M_r$, no transform.

Things to think about: 
- The cases 1 and 2 are straightforward to calculate Jaccard, but what does Jaccard mean for multicategory data?
- Time complexity: The matrix $M_r$ is much bigger than the item feature matrix $M_m$, therefore it will take very long time if we calculate on dense matrix.     
Hint: Use sparse matrix.
- Which method will give the best performance?

### 3b-i)  When $M_r\geq3$ [5 pts]
After you've implemented the jacsim function, run the code below. If implemented correctly, you'll have RMSE below 0.99. 

In [7]:
cf = Collaborative(data)
Xr = cf.Mr>=3
t0=time.perf_counter()
cf.calc_item_item_similarity(cf.jacsim,Xr)
t1=time.perf_counter()
time_sim = t1-t0
print('similarity calculation time',time_sim)
yp = cf.predict()
rmse = cf.rmse(yp)
print(rmse)
assert(rmse<0.99)

similarity calculation time 1.5387999191880226
0.9819058692126349


In [None]:
# tests RMSE for jacsim implementation

In [None]:
# additional tests for RMSE for jacsim implementation

In [None]:
# additional tests for jacsim implementation

In [None]:
# *** did not pass single cell above ***

In [None]:
# additional tests for jacsim implementation

In [None]:
# passed

### 3b-ii)  When $M_r\geq1$ [5 pts]
After you've implemented the jacsim function, run the code below. If implemented correctly, you'll have RMSE below 1.0. 

In [8]:
cf = Collaborative(data)
Xr = cf.Mr>=1
t0=time.perf_counter()
cf.calc_item_item_similarity(cf.jacsim,Xr)
t1=time.perf_counter()
time_sim = t1-t0
print('similarity calculation time',time_sim)
yp = cf.predict()
rmse = cf.rmse(yp)
print(rmse)
assert(rmse<1.0)

similarity calculation time 1.7192324921488762
0.991363571262366


In [None]:
# tests RMSE for jacsim implementation 

In [None]:
# tests RMSE for jacsim implementation

In [None]:
# tests jacsim implementation

In [None]:
# tests performance of jacsim implementation

In [None]:
# all cells passed

### 3b-iii)  When $M_r$; no transform [5 pts]
After you've implemented the jacsim function, run the code below. If implemented correctly, you'll have RMSE below 0.96

In [9]:
cf = Collaborative(data)
Xr = cf.Mr.astype(int)
t0=time.perf_counter()
cf.calc_item_item_similarity(cf.jacsim,Xr)
t1=time.perf_counter()
time_sim = t1-t0
print('similarity calculation time',time_sim)
yp = cf.predict()
rmse = cf.rmse(yp)
print(rmse)
assert(rmse<0.96)

similarity calculation time 2.6094931066036224
0.9516534264490534


In [None]:
# tests jacsim implementation RMSE

In [None]:
# tests jacsim implementation RMSE

In [None]:
# tests jacsim implementation

In [None]:
# tests jacsim implementation performance

### 3.C Discussion [Peer Review]
Answer the questions below in this week's Peer Review assignment. <br>
1. Summarize the methods and performances: Below is a template/example.

|Method|RMSE|
|:----|:--------:|
|Baseline, $Y_p$=3| |
|Baseline, $Y_p=\mu_u$| |
|Content based, item-item| |
|Collaborative, cosine| |
|Collaborative, jaccard, $M_r\geq 3$|  |
|Collaborative, jaccard, $M_r\geq 1$|  |
|Collaborative, jaccard, $M_r$|  |

2. Discuss which method(s) work better than others and why.

In [34]:
print(method_yp3)
print(method2_ypavg)
print(method3_cb_ii)
print(method4_cb_cosine)

1.2585510334053043
1.0352910334228647
1.0128116783754684
1.0263081874204125


### 3.C Discussion [Peer Review]
Answer the questions below in this week's Peer Review assignment. <br>
1. Summarize the methods and performances: Below is a template/example.

|Method|RMSE|
|:----|:--------:|
|Baseline, $Y_p$=3| |
|Baseline, $Y_p=\mu_u$| |
|Content based, item-item| |
|Collaborative, cosine| |
|Collaborative, jaccard, $M_r\geq 3$|  |
|Collaborative, jaccard, $M_r\geq 1$|  |
|Collaborative, jaccard, $M_r$|  |

2. Discuss which method(s) work better than others and why.

|Method|RMSE|
|:----|:--------:|
|Baseline, $Y_p$=3| 1.2586 |
|Baseline, $Y_p=\mu_u$| 1.0353 |
|Content based, item-item| 1.0128 |
|Collaborative, cosine| 1.0263 |
|Collaborative, jaccard, $M_r\geq 3$| 0.9819 |
|Collaborative, jaccard, $M_r\geq 1$| 0.9914 |
|Collaborative, jaccard, $M_r$| 0.9517 |

**Discussion:**  

Collaborative method using jaccard simmilarity measure demonstrated the best results as mesured by RMSE; in particular, with no transformation on the the original data, i.e. defining similarity matrix based on all levels of ratings.  

Improvements in performance were observerd between baseline estimations, content based methods, and collaborative methods.  Methods which capture increasing amounts of information are expected to perform better.  For example, predicting all ratings as 3 doesn't actually capture or utilize any of the underlying information and preferences.  As more information is used, as in setting prediction ratings to the individual user average, performance improves.  

Content based and Collaborative based methods are qualitatively different methods from one another.  In comparison, to the baseline methods, both capture more information.  However, the amount of relevant information captured, and in turn the relative improvements in performance will depend on the underlying data and information.  In this example, content based similarity measures were based solely on genre classifications.  However, there are obviously variations within genres or other factors which affect user preferences.  Other detailed information could be used to improve similarity ratings and predictions; for example:  actors, directors, run time (short/long), themes not captured by genre (dystopian, futuristic, mystery, slow-burn, feel good, etc), specific elements (smoking, nudity, violence), and numerous other content based elements.  With greater information about the content, a content based measure could potentially perform better than a collaborative based measure.  Similarly, collaborative based measures will also be affected by the underlying data and information depending on factors such as sparse ratings.  So, depending on the available information and prediction goal, one measure or the other may perform better.

Within the collaborative method, increased information capture produced imroved performance.  With the change in threshold value for ratings was adjusted (or not used), differing and more granular information about user past rating is captured resulting in improved performance.

### Grading
The final score that you will receive for your programming assignment is generated in relation to the total points set in your programming assignment item—not the total point value in the nbgrader notebook.<br>
When calculating the final score shown to learners, the programming assignment takes the percentage of earned points vs. the total points provided by nbgrader and returns a score matching the equivalent percentage of the point value for the programming assignment. <br>
**DO NOT CHANGE VARIABLE OR METHOD SIGNATURES** The autograder will not work properly if your change the variable or method signatures. 

### Validate Button
Please note that this assignment uses nbgrader to facilitate grading. You will see a **validate button** at the top of your Jupyter notebook. If you hit this button, it will run tests cases for the lab that aren't hidden. It is good to use the validate button before submitting the lab. Do know that the labs in the course contain hidden test cases. The validate button will not let you know whether these test cases pass. After submitting your lab, you can see more information about these hidden test cases in the Grader Output. <br>
***Cells with longer execution times will cause the validate button to time out and freeze. Please know that if you run into Validate time-outs, it will not affect the final submission grading.*** <br>

# Building Recommender Systems for Movie Rating Prediction

In this assignment, we will build a recommender systems that predict movie ratings. [MovieLense](https://grouplens.org/datasets/movielens/) has currently 25 million user-movie ratings.  Since the entire data is too big, we use  a 1 million ratings subset [MovieLens 1M](https://www.kaggle.com/odedgolden/movielens-1m-dataset), and we reformatted the data to make it more convenient to use.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
from sklearn.model_selection import train_test_split
from scipy.sparse import coo_matrix, csr_matrix
from scipy.spatial.distance import jaccard, cosine 
from pytest import approx

In [2]:
MV_users = pd.read_csv('data/users.csv')
MV_movies = pd.read_csv('data/movies.csv')
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [3]:
from collections import namedtuple
Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)

### Starter codes
Now, we will be building a recommender system which has various techniques to predict ratings. 
The `class RecSys` has baseline prediction methods (such as predicting everything to 3 or to average rating of each user) and other utility functions. `class ContentBased` and `class Collaborative` inherit `class RecSys` and further add methods calculating item-item similarity matrix. You will be completing those functions using what we learned about content-based filtering and collaborative filtering.

`RecSys`'s `rating_matrix` method converts the (user id, movie id, rating) triplet from the train data (train data's ratings are known) into a utility matrix for 6040 users and 3883 movies.    
Here, we create the utility matrix as a dense matrix (numpy.array) format for convenience. But in a real world data where hundreds of millions of users and items may exist, we won't be able to create the utility matrix in a dense matrix format (For those who are curious why, try measuring the dense matrix self.Mr using .nbytes()). In that case, we may use sparse matrix operations as much as possible and distributed file systems and distributed computing will be needed. Fortunately, our data is small enough to fit in a laptop/pc memory. Also, we will use numpy and scipy.sparse, which allow significantly faster calculations than calculating on pandas.DataFrame object.    
In the `rating_matrix` method, pay attention to the index mapping as user IDs and movie IDs are not the same as array index.

In [4]:
class RecSys():
    def __init__(self,data):
        self.data=data
        self.allusers = list(self.data.users['uID'])
        self.allmovies = list(self.data.movies['mID'])
        self.genres = list(self.data.movies.columns.drop(['mID', 'title', 'year']))
        self.mid2idx = dict(zip(self.data.movies.mID,list(range(len(self.data.movies)))))
        self.uid2idx = dict(zip(self.data.users.uID,list(range(len(self.data.users)))))
        self.Mr=self.rating_matrix()
        self.Mm=None 
        self.sim=np.zeros((len(self.allmovies),len(self.allmovies)))
        
    def rating_matrix(self):
        """
        Convert the rating matrix to numpy array of shape (#allusers,#allmovies)
        """
        ind_movie = [self.mid2idx[x] for x in self.data.train.mID] 
        ind_user = [self.uid2idx[x] for x in self.data.train.uID]
        rating_train = list(self.data.train.rating)
        
        return np.array(coo_matrix((rating_train, (ind_user, ind_movie)), shape=(len(self.allusers), len(self.allmovies))).toarray())


    def predict_everything_to_3(self):
        """
        Predict everything to 3 for the test data
        """
        # Generate an array with 3s against all entries in test dataset
        # your code here
        #np.array([3] * len(self.data.test))
        #np.ones(shape=(len(self.data.test,))) * 3
        return np.ones_like(self.data.test.rating) * 3
        
        
    def predict_to_user_average(self):
        """
        Predict to average rating for the user.
        Returns numpy array of shape (#users,)
        """
        # Generate an array as follows:
        # 1. Calculate all avg user rating as sum of ratings of user across all movies/number of movies whose rating > 0
        # 2. Return the average rating of users in test data
        # your code here
        mean_ratings = dict()
        test_IDs = self.data.test.uID.unique()
        for id in test_IDs:
            id_idx = self.uid2idx[id]
            ratings = self.Mr[id_idx]
            mean_ratings[id] = ratings.sum() / np.count_nonzero(ratings)
        yp = []
        for i in range(len(self.data.test)):
            yp.append(mean_ratings[self.data.test.uID[i]])
        yp = np.array(yp)
        return yp
    
    def predict_from_sim(self,uid,mid):
        """
        Predict a user rating on a movie given userID and movieID
        """
        # Predict user rating as follows:
        # 1. Get entry of user id in rating matrix
        # 2. Get entry of movie id in sim matrix
        # 3. Employ 1 and 2 to predict user rating of the movie
        # your code here
        index_userID = self.uid2idx[uid]
        ratings_index_userID = self.Mr[index_userID]
        index_movieID = self.mid2idx[mid]
        movie_sims = self.sim[index_movieID]
        sum_of_sims = np.dot(movie_sims, ratings_index_userID !=0)  # sum of sims where rating != 0
        rating = np.dot(ratings_index_userID, movie_sims) / sum_of_sims
        
        # if there are no similar movies, ie all sims=0 then the rating will be 0
        # if rating=0 then predict to user average
        if rating == 0:
            return self.Mr[index_userID].sum() / np.count_nonzero(self.Mr[index_userID])
        else:
            return rating
        # return rating
    
    def predict(self):
        """
        Predict ratings in the test data. Returns predicted rating in a numpy array of size (# of rows in testdata,)
        """
        # your code here
        yp = np.array([])
        for i in range(len(self.data.test)):
            uID = self.data.test.iloc[i]['uID']
            mID = self.data.test.iloc[i]['mID']
            rating = self.predict_from_sim(uID, mID)
            yp = np.append(yp, rating)
        return yp
    
    def rmse(self,yp):
        yp[np.isnan(yp)]=3 #In case there is nan values in prediction, it will impute to 3.
        yt=np.array(self.data.test.rating)
        return np.sqrt(((yt-yp)**2).mean())

    
class ContentBased(RecSys):
    def __init__(self,data):
        super().__init__(data)
        self.data=data
        self.Mm = self.calc_movie_feature_matrix()  
        
    def calc_movie_feature_matrix(self):
        """
        Create movie feature matrix in a numpy array of shape (#allmovies, #genres) 
        """
        # your code here
        m = self.data.movies.drop(['mID', 'title', 'year'], axis=1)
        return np.asmatrix(m)
    
    def calc_item_item_similarity(self):
        """
        Create item-item similarity using Jaccard similarity
        """
        # Update the sim matrix by calculating item-item similarity using Jaccard similarity
        # Jaccard Similarity: J(A, B) = |A∩B| / |A∪B| 
        # your code here
        from sklearn.metrics import pairwise_distances
        self.sim = (1 - pairwise_distances(self.Mm, metric='jaccard'))
        return
        
                
class Collaborative(RecSys):    
    def __init__(self,data):
        super().__init__(data)
        
    def calc_item_item_similarity(self, simfunction, *X):     #simfunction:  'cossim', 'jacsim'
        """
        Create item-item similarity using similarity function. 
        X is an optional transformed matrix of Mr
        """    
        # General function that calculates item-item similarity based on the sim function and data inputed
        if len(X)==0:
            self.sim = simfunction()            
        else:
            self.sim = simfunction(X[0]) # *X passes in a tuple format of (X,), to X[0] will be the actual transformed matrix
            
    def cossim(self):    
        """
        Calculates item-item similarity for all pairs of items using cosine similarity (values from 0 to 1) on utility matrix
        Returns a cosine similarity matrix of size (#all movies, #all movies)
        """
        # Return a sim matrix by calculating item-item similarity for all pairs of items using Jaccard similarity
        # Cosine Similarity: C(A, B) = (A.B) / (||A||.||B||) 
        # your code here
        
        # create X - a transformed/normalized matrix of .Mr (impute user mean rating for 0 and subtract user mean rating from all)
        X = self.Mr.copy().astype(float)
        for i in range(len(X)):
            user_avg = X[i].sum() / np.count_nonzero(X[i])
            np.nan_to_num(user_avg, copy=False)
            X[i] = np.where(X[i]==0, X[i], X[i] - user_avg)  # ie  where val=0, val stays 0, else where !=0 set val = val - mean

        from sklearn.metrics import pairwise_distances
        X_sim = (1 - pairwise_distances(X.T, metric='cosine'))
        X_sim = 0.5 + (0.5 * X_sim)
        return X_sim
        # this solution PASSED all cells
        
#         # Compute **averaged** movie ratings for all users (movie_ratings_allUsers)
#         movie_ratings_allUsers = self.Mr.sum(axis=1) / np.count_nonzero(self.Mr, axis=1)
#         np.nan_to_num(movie_ratings_allUsers, copy=False)  # default copy=True

#         # Create a sparse matrix for operating cosine on its values
#         movie_ratings_array = np.repeat(np.expand_dims(movie_ratings_allUsers, axis=1), self.Mr.shape[1], axis=1)

#         # Take care of all the zero ratings (missing value/itentionally we don't know)
#         movie_ratings_array_adjusted = self.Mr + (self.Mr==0)*movie_ratings_array - movie_ratings_array

#         # Average all the ratings: divide by its magnitude!
#         MR_avg = movie_ratings_array_adjusted / (np.sqrt((movie_ratings_array_adjusted**2).sum(axis=0)))

#         # Put a Boundary check # 1: since dividing by magnitude may produce inf, zeros, etc. Set nans to 0.
#         MR_avg = np.nan_to_num(MR_avg)  # or np.nan_to_num(MR_avg, copy=False)

#         # Perform an item-item cosine similarity using: np.dot(matrix.T, matrix)
#         sim_mat = np.dot(MR_avg.T, MR_avg)

#         # Note that the 289 movies with all zero rating will have cosine sim = 0  -  all same-same movie ratings along diagonal should be 1
#         #a = np.argwhere(np.diag(sim_mat) == 0)
#         #sim_mat[a, a] = 1  
#         # still 42 (other vals along diagonal slightly > 1) - use alt method
#         idx = range(sim_mat.shape[0])
#         sim_mat[idx, idx] = 1

#         # Normalized Cosine Formula:
#         sim_mat = 0.5 + (0.5 * sim_mat)

#         return sim_mat

    
    def jacsim(self,Xr):
        """
        Calculates item-item similarity for all pairs of items using jaccard similarity (values from 0 to 1)
        Xr is the transformed rating matrix.
        """    
        # Return a sim matrix by calculating item-item similarity for all pairs of items using Jaccard similarity
        # Jaccard Similarity: J(A, B) = |A∩B| / |A∪B| 
        # your code here
        
#         # Convert Xr into a CSR format
#         # csr0 = csr_matrix((Xr>0).astype(int))
#         X = csr_matrix(Xr)

#         # get the intersection
#         # produces a n x n matrix of the intersection values, ie [0,1] is the intersection between col 0 and col 1 which is value 3
#         a = (X > 0).astype(int)
#         nz_intersect = np.dot(a.T, a)
        
        
        n = Xr.shape[1]
        max_val = int(Xr.max())
        nz_intersect = np.zeros((n,n)).astype(int)
        for i in range(1, max_val + 1):
                csrm = csr_matrix((Xr == i)).astype(int)
                nz_intersect = nz_intersect + np.array(np.dot(csrm.T, csrm).toarray()).astype(int)

        # get the union

        # get the nonzero counts of each column
        #colsums = A.sum(axis=0)  # alternatively
        colsums = np.count_nonzero(Xr, axis=0)  # alternatively

        # get matrix of sum of colsums between columns
        # start with matrix of n x n where row vals = sum for corresponding column eg col 1 = 4, all row[0] vals = 4
        n = Xr.shape[1] # how many movies / columns
        colsums_mat = np.repeat(colsums.reshape(n,1), n, axis=1)
        # add the colsum matrix to its transpose to get the pairs
        colsums_pairs = colsums_mat + colsums_mat.T

        # to get the union:  subtract the intersection of a pair from the column sums of the two colums  eg col 1 = 4, col 2 = 3; total = 7, int = 3 ---> untion = 4
        union = colsums_pairs - nz_intersect

        # calculate jaccard similarity
        sim = nz_intersect / union
        np.nan_to_num(sim, copy=False)  # NaNs potentially generated when union is zero

        d = np.argwhere(np.diag(sim) != 1)
        sim[d, d] = 1
        
        return np.array(sim)
    
    

# Q1. Baseline models [15 pts]

### 1a. Complete the function `predict_everything_to_3` in the class `RecSys`  [5 pts]

In [5]:
# Creating Sample test data
np.random.seed(42)
sample_train = train[:30000]
sample_test = test[:30000]


sample_MV_users = MV_users[(MV_users.uID.isin(sample_train.uID)) | (MV_users.uID.isin(sample_test.uID))]
sample_MV_movies = MV_movies[(MV_movies.mID.isin(sample_train.mID)) | (MV_movies.mID.isin(sample_test.mID))]


sample_data = Data(sample_MV_users, sample_MV_movies, sample_train, sample_test)

In [6]:
# Sample tests predict_everything_to_3 in class RecSys

sample_rs = RecSys(sample_data)
sample_yp = sample_rs.predict_everything_to_3()
print(sample_rs.rmse(sample_yp))
assert sample_rs.rmse(sample_yp)==approx(1.2642784503423288, abs=1e-3), "Did you predict everything to 3 for the test data?"

1.2642784503423288


In [7]:
# passed

In [8]:
# Hidden tests predict_everything_to_3 in class RecSys
rs = RecSys(data)
yp = rs.predict_everything_to_3()
print(rs.rmse(yp))

1.2585510334053043


In [9]:
method_yp3 = rs.rmse(yp)
print(method_yp3)

1.2585510334053043


In [143]:
# passed

### 1b. Complete the function predict_to_user_average in the class RecSys [10 pts]
Hint: Include rated items only when averaging

In [144]:
#     def predict_to_user_average(self):
#         all_ratings = pd.concat((self.data.train, self.data.test))
#         uIDs = all_ratings.uID.unique()
#         avg_user_ratings = dict()
#         for uID in uIDs:
#             avg_rating = self.data.train[(self.data.train.uID == uID) & (self.data.train.rating > 0)].rating.mean()
#             avg_user_ratings[uID] = avg_rating
#         yp = np.array([])
#         for index, row in self.data.test.iterrows():
#             yp = np.append(yp, avg_user_ratings[row['uID']])
#         return yp

# passes both tests below

In [145]:
# mean_ratings = dict()
# test_IDs = sample_rs.data.test.uID.unique()
# for id in test_IDs:
#     id_idx = sample_rs.uid2idx[id]
#     ratings = sample_rs.Mr[id_idx]
#     mean_ratings[id] = ratings.sum() / np.count_nonzero(ratings)
    
# yp = np.array([])
# for index, row in sample_rs.data.test.iterrows():
#     yp = np.append(yp, mean_ratings[row['uID']])

# yp = []
# for i in range(len(sample_rs.data.test)):
#     yp.append(mean_ratings[sample_rs.data.test.uID[i]])
# yp = np.array(yp)

In [12]:
# Sample tests predict_to_user_average in the class RecSys
sample_yp = sample_rs.predict_to_user_average()
print(sample_rs.rmse(sample_yp))
assert sample_rs.rmse(sample_yp)==approx(1.1429596846619763, abs=1e-3), "Check predict_to_user_average in the RecSys class. Did you predict to average rating for the user?" 

1.1429596846619763


In [13]:
# passed

In [14]:
# Hidden tests predict_to_user_average in the class RecSys
yp = rs.predict_to_user_average()
print(rs.rmse(yp))

1.0352910334228647


In [15]:
method2_ypavg = rs.rmse(yp)
print(method2_ypavg)

1.0352910334228647


In [149]:
# passed

# Q2. Content-Based model [25 pts]

### 2a. Complete the function calc_movie_feature_matrix in the class ContentBased [5 pts]

In [16]:
cb = ContentBased(data)

In [17]:
# tests calc_movie_feature_matrix in the class ContentBased 
assert(cb.Mm.shape==(3883, 18))

In [18]:
# passed

### 2b. Complete the function calc_item_item_similarity in the class ContentBased [10 pts]
This function updates `self.sim` and does not return a value.    
Some factors to think about:     
1. The movie feature matrix has binary elements. Which similarity metric should be used?
2. What is the computation complexity (time complexity) on similarity calcuation?      
Hint: You may use functions in the `scipy.spatial.distance` module on the dense matrix, but it is quite slow (think about the time complexity). If you want to speed up, you may try using functions in the `scipy.sparse` module. 

In [13]:
from sklearn.metrics import pairwise_distances
my_cb = ContentBased(sample_data)
sim_mat = (1 - pairwise_distances(my_cb.Mm, metric='jaccard'))
np.trace(sim_mat)

3152.0

In [14]:
from scipy.spatial.distance import pdist, squareform
sim_mat = (1 - squareform(pdist(my_cb.Mm, metric='jaccard')))
np.trace(sim_mat)

3152.0

In [20]:
cb.calc_item_item_similarity()

In [21]:
# Sample tests calc_item_item_similarity in ContentBased class 

sample_cb = ContentBased(sample_data)
sample_cb.calc_item_item_similarity() 

# print(np.trace(sample_cb.sim))
# print(sample_cb.sim[10:13,10:13])
assert(sample_cb.sim.sum() > 0), "Check calc_item_item_similarity."
assert(np.trace(sample_cb.sim) == 3152), "Check calc_item_item_similarity. What do you think np.trace(cb.sim) should be?"


ans = np.array([[1, 0.25, 0.],[0.25, 1, 0.],[0., 0., 1]])
for pred, true in zip(sample_cb.sim[10:13, 10:13], ans):
    assert approx(pred, 0.01) == true, "Check calc_item_item_similarity. Look at cb.sim"

In [23]:
# tests calc_item_item_similarity in ContentBased class 

In [24]:
# additional tests for calc_item_item_similarity in ContentBased class 

In [25]:
# additional tests for calc_item_item_similarity in ContentBased class

In [26]:
# additional tests for calc_item_item_similarity in ContentBased class

In [27]:
# additional tests for calc_item_item_similarity in ContentBased class

In [28]:
# passed all above

### 2c. Complete the function predict_from_sim in the class RecSys [5 pts]

In [22]:
# for a, b in zip(sample_MV_users.uID, sample_MV_movies.mID):
#     print(a, b, sample_cb.predict_from_sim(a,b))

# Sample tests for predict_from_sim in RecSys class 
assert(sample_cb.predict_from_sim(245,276)==approx(2.5128205128205128,abs=1e-2)), "Check predict_from_sim. Look at how you predicted a user rating on a movie given UserID and movieID."
assert(sample_cb.predict_from_sim(2026,2436)==approx(2.785714285714286,abs=1e-2)), "Check predict_from_sim. Look at how you predicted a user rating on a movie given UserID and movieID."

In [30]:
# passed

In [31]:
# index_userID = sample_cb.uid2idx[245]
# ratings_index_userID = sample_cb.Mr[index_userID]
# index_movieID = sample_cb.mid2idx[276]
# movie_sims = sample_cb.sim[index_movieID]

# sum_of_sims = np.dot(movie_sims, ratings_index_userID !=0)
# np.dot(ratings_index_userID, movie_sims) / sum_of_sims

# # idx = np.nonzero(np.array(ratings_index_userID))
# # x = np.array(movie_sims)[idx].sum()
# # np.dot(ratings_index_userID, movie_sims) / x

# # if the rating is zero, ie all similarity scores ar zero, then compute user average
# sample_cb.Mr[index_userID].sum() / np.count_nonzero(sample_cb.Mr[index_userID])

In [32]:
# tests for predict_from_sim in RecSys class 

In [33]:
# passed - only after multiple attempts and finally kernel restart!

### 2d. Complete the function predict in the class RecSys [5 pts]
After completing the predict method in the RecSys class, run the cell below to calculate rating prediction and RMSE. How much does the performance increase compared to the baseline results from above? 

In [23]:
# Sample tests method predict in the RecSys class 

sample_yp = sample_cb.predict()
sample_rmse = sample_cb.rmse(sample_yp)
print(sample_rmse)

assert(sample_rmse==approx(1.1962537249116723, abs=1e-2)), "Check method predict in the RecSys class."

1.1962537249116723


In [35]:
# passed

In [36]:
# yp = np.array([])
# for i in range(len(sample_cb.data.test)):
#     uID = sample_cb.data.test.iloc[i]['uID']
#     mID = sample_cb.data.test.iloc[i]['mID']
#     rating = sample_cb.predict_from_sim(uID, mID)
#     yp = np.append(yp, rating)

# sample_cb.rmse(yp)

In [24]:
# Hidden tests method predict in the RecSys class 

yp = cb.predict()
rmse = cb.rmse(yp)
print(rmse)

1.0128116783754684


In [38]:
# tests method predict in the RecSys class 

In [25]:
method3_cb_ii = rmse
print(method3_cb_ii)

1.0128116783754684


In [None]:
# passed

# Q3. Collaborative Filtering

### 3a. Complete the function cossim in the class Collaborative [10 pts]
**To Do:**    
1.Impute the unrated entries in self.Mr to the user's average rating then subtract by the user mean, call this matrix X.   
2.Calculate cosine similarity for all item-item pairs. Don't forget to rescale the cosine similarity to be 0~1.    
You might encounter divide by zero warning (numpy will fill nan value for that entry). In that case, you can fill those with appropriate values.    

Hint: Let's say a movie item has not been rated by anyone. When you calculate similarity of this vector to anoter, you will get $\vec{0}$=[0,0,0,....,0]. When you normalize this vector, you'll get divide by zero warning and it will make nan value in self.sim matrix. Theoretically what should the similarity value for $\vec{x}_i \cdot \vec{x}_i$ when $\vec{x}_i = \vec{0}$? What about $\vec{x}_i \cdot \vec{x}_j$ when $\vec{x}_i = \vec{0}$ and $\vec{x}_j$ is an any vector?     

Hint: You may use `scipy.spatial.distance.cosine`, but it will be slow because its cosine function does vector-vector operation whereas you can implement matrix-matrix operation using numpy to calculate all cosines all at once (it can be 100 times faster than vector-vector operation in our data). Also pay attention to the definition. The scipy.spatial.distance provides distance, not similarity. 

3. Run the below cell that calculate yp and RMSE. 

In [26]:
# Sample tests cossim method in the Collaborative class

sample_cf = Collaborative(sample_data)
sample_cf.calc_item_item_similarity(sample_cf.cossim)
sample_yp = sample_cf.predict()
sample_rmse = sample_cf.rmse(sample_yp)

assert(np.trace(sample_cf.sim)==3152), "Check cossim method in the Collaborative class. What should np.trace(cf.sim) equal?"
assert(sample_rmse==approx(1.1429596846619763, abs=5e-3)), "Check cossim method in the Collaborative class. rmse result is not as expected."
assert(sample_cf.sim[0,:3]==approx([1., 0.5, 0.5],abs=1e-2)), "Check cossim method in the Collaborative class. cf.sim isn't giving the expected results."

In [None]:
# passed

In [154]:
# X = sample_cf.Mr.copy()
# for i in range(len(X)):
#     user_avg = X[i].sum() / np.count_nonzero(X[i])
#     X[i] = np.where(X[i]==0, X[i], X[i] - user_avg)  # ie  where val=0, val stays 0, else where !=0 set val = val - mean

# from sklearn.metrics import pairwise_distances
# X_sim = (1 - pairwise_distances(X.T, metric='cosine'))
# X_sim = 0.5 + (0.5 * X_sim)

In [176]:
print(np.count_nonzero(sample_cf.Mr==0))  # how many values are zero?
print(np.count_nonzero(sample_cf.Mr))     # how many values are non zero?
print(np.any(np.isnan(sample_cf.Mr)))     # are any values NaN?
print('total rows:', sample_cf.Mr.shape[0])
print('rows all zero:', np.all(sample_cf.Mr==0, axis=1).sum())  # how many rows are all zero?
print('rows not all zero:', np.any(sample_cf.Mr, axis=1).sum())     # how many rows have at least one rating?

np.nonzero(sample_cf.Mr)  # returns tuple of indices of nonzero values (array of dim_1, array of dim_2)
np.where(sample_cf.Mr!=0) # equivalent to np.nonzero - returns indices where val!=0

18153888
30000
False
total rows: 5769
rows all zero: 585
rows not all zero: 5184


(array([   0,    0,    1, ..., 5768, 5768, 5768]),
 array([ 463,  518,  694, ..., 1121, 1793, 2819]))

In [167]:
# cosine calculation examples from lecture
from scipy.spatial.distance import cosine

a = np.array([5, 0, 1, 4])
b = np.array([2, 3, 5, 0])
c = np.array([4, 4, 0, 4])
(1 - cosine(a, b))  # 0.375

# impute Nan/0 to neutral value 3
a = np.array([5, 3, 1, 4])
b = np.array([2, 3, 5, 3])
c = np.array([4, 4, 3, 4])
(1 - cosine(a, b))  # 0.74

# impute to 3 (and subtract 3)
a = np.array([5, 3, 1, 4]) - 3
b = np.array([2, 3, 5, 3]) - 3
(1 - cosine(a, b))  # -0.89

# normalize by user mean:  impute NaN/0 vals to mean (and subtract mean from all)
a = np.array([5, 3.25, 1, 4, 3]) - 3.25
b = np.array([2, 3, 5, 3.25, 3]) - 3.25
(1 - cosine(a, b))  # -0.94

-0.9403746653744962

In [8]:
# example of np.expand
# np.expand_dims(a, axis)
x = np.array([1, 2])
x.shape # (2,)
y = np.expand_dims(x, axis=0)  # equivalent to x[np.newaxis, :] or x[np.newaxis]
y.shape # (1, 2)
y = np.expand_dims(x, axis=1)  # equivalent to x[:, np.newaxis]
y.shape # (2, 1)

(2, 1)

In [9]:
# example of np.repeat
# np.repeat(a, repeats, axis=None)   Output array which has the same shape as a, except along the given axis
np.repeat(3, 4)
x = np.array([[1,2],[3,4]])
np.repeat(x, 2)
np.repeat(x, 3, axis=1)    #np.repeat(x, 3, axis=0)
# np.repeat(x, [1, 2], axis=0)

array([[1, 1, 1, 2, 2, 2],
       [3, 3, 3, 4, 4, 4]])

In [28]:
test = sample_cf

In [232]:
# Compute **averaged** movie ratings for all users (movie_ratings_allUsers)
movie_ratings_allUsers = test.Mr.sum(axis=1) / np.count_nonzero(test.Mr, axis=1)
np.isnan(movie_ratings_allUsers).sum() #585 NaNs

# replace the NaNs with 0
np.nan_to_num(movie_ratings_allUsers, copy=False)  # default copy=True
np.isnan(movie_ratings_allUsers).sum()  # 0

0

In [181]:
# Create a sparse matrix for operating cosine on its values
movie_ratings_array = np.repeat(np.expand_dims(movie_ratings_allUsers, axis=1), test.Mr.shape[1], axis=1)
movie_ratings_array.shape # (5769, 3152)  # no NaNs

(5769, 3152)

In [197]:
# Take care of all the zero ratings (missing value/itentionally we don't know)
movie_ratings_array_adjusted = test.Mr + (test.Mr==0)*movie_ratings_array - movie_ratings_array
# takes the original Mr, uses a bool matrix (where vals==0) * ratings to impute all zero vals to the mean for user
# then subtracts the mean from all values
movie_ratings_array_adjusted.shape  # (5769, 3152)  # no NaNs

(5769, 3152)

In [210]:
# are there columns that are all zero?  that will produce a norm=zero and div/0 --> NaN
np.all(movie_ratings_array_adjusted == 0, axis=0).sum() #289

289

In [228]:
# Average all the ratings: divide by its magnitude!
MR_avg = movie_ratings_array_adjusted / (np.sqrt((movie_ratings_array_adjusted**2).sum(axis=0)))
# note that np.sqrt((movie_ratings_array_adjusted**2).sum(axis=0)) should be equivalent to 
#    np.linalg.norm(movie_ratings_array_adjusted, ord=2, axis=0)

In [229]:
np.isnan(MR_avg).sum()  # 5769 users x 289 movies with all zero ratings = 1667241
np.all(np.isnan(MR_avg), axis=0).sum()  # 289 columns with all NaNs

289

In [243]:
# Put a Boundary check # 1: since dividing by magnitude may produce inf, zeros, etc. Set nans to 0.

# numpy.nan_to_num(x, copy=True, nan=0.0, posinf=None, neginf=None)
MR_avg = np.nan_to_num(MR_avg)  # or np.nan_to_num(MR_avg, copy=False)
MR_avg.shape  # (5769, 3152)
np.isnan(MR_avg).sum() # 0
np.count_nonzero(np.all(MR_avg == 0, axis=0))  # 289 columns are all zero (as expected)

289

In [244]:
# Perform an item-item cosine similarity using: np.dot(matrix.T, matrix)
sim_mat = np.dot(MR_avg.T, MR_avg)

In [248]:
# Note that the 289 movies with all zero rating will have cosine sim = 0
sim_mat.shape     # (3152, 3152)
sim_mat.trace()   # trace is 2863 (not 3152)
np.count_nonzero(np.diag(sim_mat) == 0)  # 289

2863.0

In [296]:
a = np.argwhere(np.diag(sim_mat) == 0)
b = np.argwhere(np.all(movie_ratings_array_adjusted == 0, axis=0))
np.all(np.equal(a, b))  # True

sim_mat[a, b].sum()
sim_mat[a, a].sum()  # 0

sim_mat[a, a] = 1
sim_mat.trace()  # 3152

3152.0

In [44]:
# But note!
np.count_nonzero(np.diag(sim_mat) > 1) # 42
# must be corrected to 1 - autograder checks max == 1
sim_mat[range(sim_mat.shape[0]), range(sim_mat.shape[0])] = 1
np.count_nonzero(np.diag(sim_mat) > 1)  # 0

0

In [45]:
# Normalized Cosine Formula:
sim_mat = 0.5 + (0.5 * sim_mat)
sim_mat.trace()

3152.0

In [27]:
# Hidden tests cossim method in the Collaborative class

cf = Collaborative(data)
cf.calc_item_item_similarity(cf.cossim)
yp = cf.predict()
rmse = cf.rmse(yp)
print(rmse)

1.0263081874204125


In [28]:
# method "B" rmse = 1.0263081874204125

In [29]:
method4_cb_cosine = rmse
print(rmse)

1.0263081874204125


In [None]:
# tests cossim method in the Collaborative class 

In [None]:
# additional tests for cossim method in the Collaborative class

In [None]:
# additional tests for cossim method in the Collaborative class

In [None]:
# additional tests for cossim method in the Collaborative class

In [None]:
# additional tests for cossim method in the Collaborative class

In [None]:
# additional tests for cossim method in the Collaborative class

In [None]:
# all cells passed

### 3b. Complete the function jacsim in the class Collaborative [15 pts]
**3b [15 pts] = 3b-i) [5 pts]+3b-ii) [5 pts]+ 3b-iii) [5 pts]**

Function `jacsim` calculates jaccard similarity between items using collaborative filtering method. When we have a rating matrix `self.Mr`, the entries of Mr matrix are 0 to 5 (0: unrated, 1-5: rating). We are interested to see which threshold method works better when we use jaccard dimilarity in the collaborative filtering.    
We may treat any rating 3 or above to be 1 and the negatively rated (below 3) and no-rating as 0. Or, we may treat movies with any ratings to be 1 and ones that has no rating as 0. In this question, we will complete a function jacsim that takes a transformed rating matrix X and calculate and returns a jaccard similarity matrix.     
Let's consider these input cases for the utility matrix $M_r$ with ratings 1-5 and 0s for no-rating.    
1. $M_r \geq 3$ 
2. $M_r \geq 0$ 
3. $M_r$, no transform.

Things to think about: 
- The cases 1 and 2 are straightforward to calculate Jaccard, but what does Jaccard mean for multicategory data?
- Time complexity: The matrix $M_r$ is much bigger than the item feature matrix $M_m$, therefore it will take very long time if we calculate on dense matrix.     
Hint: Use sparse matrix.
- Which method will give the best performance?

### 3b-i)  When $M_r\geq3$ [5 pts]
After you've implemented the jacsim function, run the code below. If implemented correctly, you'll have RMSE below 0.99. 

In [7]:
cf = Collaborative(data)
Xr = cf.Mr>=3
t0=time.perf_counter()
cf.calc_item_item_similarity(cf.jacsim,Xr)
t1=time.perf_counter()
time_sim = t1-t0
print('similarity calculation time',time_sim)
yp = cf.predict()
rmse = cf.rmse(yp)
print(rmse)
assert(rmse<0.99)

similarity calculation time 1.5387999191880226
0.9819058692126349


In [None]:
# tests RMSE for jacsim implementation

In [None]:
# additional tests for RMSE for jacsim implementation

In [None]:
# additional tests for jacsim implementation

In [None]:
# *** did not pass single cell above ***

In [None]:
# additional tests for jacsim implementation

In [None]:
# passed

### 3b-ii)  When $M_r\geq1$ [5 pts]
After you've implemented the jacsim function, run the code below. If implemented correctly, you'll have RMSE below 1.0. 

In [8]:
cf = Collaborative(data)
Xr = cf.Mr>=1
t0=time.perf_counter()
cf.calc_item_item_similarity(cf.jacsim,Xr)
t1=time.perf_counter()
time_sim = t1-t0
print('similarity calculation time',time_sim)
yp = cf.predict()
rmse = cf.rmse(yp)
print(rmse)
assert(rmse<1.0)

similarity calculation time 1.7192324921488762
0.991363571262366


In [None]:
# tests RMSE for jacsim implementation 

In [None]:
# tests RMSE for jacsim implementation

In [None]:
# tests jacsim implementation

In [None]:
# tests performance of jacsim implementation

In [None]:
# all cells passed

### 3b-iii)  When $M_r$; no transform [5 pts]
After you've implemented the jacsim function, run the code below. If implemented correctly, you'll have RMSE below 0.96

In [9]:
cf = Collaborative(data)
Xr = cf.Mr.astype(int)
t0=time.perf_counter()
cf.calc_item_item_similarity(cf.jacsim,Xr)
t1=time.perf_counter()
time_sim = t1-t0
print('similarity calculation time',time_sim)
yp = cf.predict()
rmse = cf.rmse(yp)
print(rmse)
assert(rmse<0.96)

similarity calculation time 2.6094931066036224
0.9516534264490534


In [None]:
# tests jacsim implementation RMSE

In [None]:
# tests jacsim implementation RMSE

In [None]:
# tests jacsim implementation

In [None]:
# tests jacsim implementation performance

### 3.C Discussion [Peer Review]
Answer the questions below in this week's Peer Review assignment. <br>
1. Summarize the methods and performances: Below is a template/example.

|Method|RMSE|
|:----|:--------:|
|Baseline, $Y_p$=3| |
|Baseline, $Y_p=\mu_u$| |
|Content based, item-item| |
|Collaborative, cosine| |
|Collaborative, jaccard, $M_r\geq 3$|  |
|Collaborative, jaccard, $M_r\geq 1$|  |
|Collaborative, jaccard, $M_r$|  |

2. Discuss which method(s) work better than others and why.

In [34]:
print(method_yp3)
print(method2_ypavg)
print(method3_cb_ii)
print(method4_cb_cosine)

1.2585510334053043
1.0352910334228647
1.0128116783754684
1.0263081874204125


### 3.C Discussion [Peer Review]
Answer the questions below in this week's Peer Review assignment. <br>
1. Summarize the methods and performances: Below is a template/example.

|Method|RMSE|
|:----|:--------:|
|Baseline, $Y_p$=3| |
|Baseline, $Y_p=\mu_u$| |
|Content based, item-item| |
|Collaborative, cosine| |
|Collaborative, jaccard, $M_r\geq 3$|  |
|Collaborative, jaccard, $M_r\geq 1$|  |
|Collaborative, jaccard, $M_r$|  |

2. Discuss which method(s) work better than others and why.

|Method|RMSE|
|:----|:--------:|
|Baseline, $Y_p$=3| 1.2586 |
|Baseline, $Y_p=\mu_u$| 1.0353 |
|Content based, item-item| 1.0128 |
|Collaborative, cosine| 1.0263 |
|Collaborative, jaccard, $M_r\geq 3$| 0.9819 |
|Collaborative, jaccard, $M_r\geq 1$| 0.9914 |
|Collaborative, jaccard, $M_r$| 0.9517 |

**Discussion:**  

The collaborative method using Jaccard similarity measure demonstrated the best results as measured by RMSE; in particular, with no transformation on the original data i.e., defining similarity matrix based on all levels of ratings.  

Improvements in performance were observed between baseline estimations, content based methods, and collaborative methods.  Methods which capture increasing amounts of information are expected to perform better.  For example, predicting all ratings as 3 doesn't actually capture or utilize any of the underlying information and preferences.  As more information is used, as in setting prediction ratings to the individual user average, performance improves.  

Content based and Collaborative based methods are qualitatively different methods from one another.  In comparison, to the baseline methods, both capture more information.  However, the amount of relevant information captured, and in turn the relative improvements in performance will depend on the underlying data and information.  In this example, content based similarity measures were based solely on genre classifications.  However, there are obviously variations within genres or other factors which affect user preferences.  Other detailed information could be used to improve similarity ratings and predictions; for example:  actors, directors, run time (short/long), themes not captured by genre (dystopian, futuristic, mystery, slow burn, feel good, etc.), specific elements (smoking, nudity, violence), and numerous other content based elements.  With greater information about the content, a content based measure could potentially perform better than a collaborative based measure.  Similarly, collaborative based measures will also be affected by the underlying data and information depending on factors such as sparse ratings.  So, depending on the available information and prediction goal, one measure or the other may perform better.

Within the collaborative method, increased information capture produced improved performance.  With the change in threshold value for ratings was adjusted (or not used), differing and more granular information about user past rating is captured resulting in improved performance.