# Objective
The goal of this personalization project will be to maximize the accuracy of our recommendation system when considering the results of recommending the top-5 movies for a given user. 

Max accuracy or coverage of recommendations

MovieLens is a web-based recommender system and virtual community that recommends movies for its users to watch, based on their film preferences using CF of members' movie ratings and reviews. To address the cold-start problem for new users, MovieLens uses preference elicitation where they ask new users to rate how much they enjoy watching different genres of movies.

The dataset contains 10000054 ratings. Users were selected at random but all users selected had rated at least 20 movies. The data contains three files: 

#### 1. Movies.dat
Each line of this file represents one movie, and has the following format: MovieID::Title::Genres
MovieID is the real MovieLens id.

Movie titles, by policy, should be entered identically to those found in IMDB, including year of release. However, they are entered manually, so errors and inconsistencies may exist.

Genres are a pipe-separated list, and are selected from the following:

Action
Adventure
Animation
Children's
Comedy
Crime
Documentary
Drama
Fantasy
Film-Noir
Horror
Musical
Mystery
Romance
Sci-Fi
Thriller
War
Western


#### 2. Ratings.dat
All ratings are contained in the file ratings.dat. Each line of this file represents one rating of one movie by one user, and has the following format:

UserID::MovieID::Rating::Timestamp

The lines within this file are ordered first by UserID, then, within user, by MovieID.

Ratings are made on a 1-5 star scale, with half-star increments.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.


#### 3. Tags.dat
All tags are contained in the file tags.dat. Each line of this file represents one tag applied to one movie by one user, and has the following format:

UserID::MovieID::Tag::Timestamp

The lines within this file are ordered first by UserID, then, within user, by MovieID.

Tags are user generated metadata about movies. Each tag is typically a single word, or short phrase. The meaning, value and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.



# 1. Item-Based Nearest Neighbors

For each user, we'd like to accurately recommend a set of top 5 movies they'd enjoy which they have not seen yet (their rating is 0). To do this we will use an approach that is similar to weighted KNN. 

First we will import the necessary libraries for building our item-based nearest neighbors CF model as well as read in our datasets described earlier. 


In [3]:
import pandas as pd 
import numpy as np
import random
from surprise import Dataset, evaluate
from surprise import KNNBasic
from sklearn import cross_validation as cv


#Define the format of each of the data files
#Movies; MovieID::Title::Genres
moviescol = ['MovieId', 'Title', 'Genres','Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
movies = pd.read_csv('./movies.dat', sep='::', names = moviescol, engine='python')

'''
Generates a new matrix with movie ID and indicator columns for the genre(s) of that movie.
'''
movie_genre = []
for (idx, row) in movies.iterrows(): 
    genres = row.loc['Genres'].split("|")
    movieid = row.loc['MovieId']
    for g in genres:  
        movie_genre.append({'MovieId': movieid, 'Genre': g})

#movie_genre = pd.DataFrame(moviegenre)


moviegenrecol = ['Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'IMAX', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

test = pd.DataFrame(0, index = np.arange(len(movies)), columns = moviegenrecol)
MovieGenres = pd.concat([movies['MovieId'], test], axis = 1)
MovieGenres.columns= ['MovieId','Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'IMAX', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
        
for row in movie_genre: 
    movieID = row['MovieId']
    genre = row['Genre']
    MovieGenres.loc[MovieGenres.MovieId==movieID,genre]=1

###########################################
# Reads in User, Item Ratings raw data file  (ratings.dat)
###########################################

#Ratings:
#UserID::MovieID::Rating::Timestamp
ratingscol = ['UserID', 'MovieID', 'Rating', 'Timestamp']
ratings = pd.read_csv('./ml-10M100K/ratings.dat', sep='::', names = ratingscol, engine='python')



In order to determine the most appropriate way to sample our data, we first need to get a better understanding of the sparsity level of the user-item matrix. As seen in the introduction paper of item-based CF, we can calculate the Sparsity level of a matrix by using 1- (nonzero entries/total entries) 

In [43]:
# Compute the Sparsity of the Matrix by first finding the number of unique items and users present in the ratings df. Although the Movielens site already provides us with number of unique movies and usrs information, we provide a procedure to obtain it regardless.

n_users = ratings['UserID'].nunique()
n_items = ratings['MovieID'].nunique()

print 'Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_items)

sparsity = round(1.0-len(ratings)/float(n_users*n_items),3)
print 'The sparsity level of MovieLens10M is ' +  str(sparsity*100) + '%'

Number of users = 69878 | Number of movies = 10677
The sparsity level of MovieLens10M is 98.7%


Given that we want to work with a sample size of 10000 users and 100 items to start with, we cannot sample at random due to the high sparsity level previously calculated. Therefore we will obtain the top 10000 users with the most ratings and the most rated 100 movies.

In [5]:
user_sample = ratings['UserID'].value_counts().head(10000).index
movie_sample = ratings['MovieID'].value_counts().head(100).index

ratings_subset = ratings.loc[ratings['UserID'].isin(user_sample)].loc[ratings['MovieID'].isin(movie_sample)]

And then we can divide this subset into a training and test set for our item based CF model.

In [6]:
train_data, test_data = cv.train_test_split(ratings_subset, test_size=0.20)

We want to format the ratings matrices to be one row per user and one column per movie -- we also substitute all missing values with 0. Therefore we have a 10000 x 100 matrix where each element of the matrix (i,j) represents how user i rated movie j. 

In [14]:
#Ratings Matrix of Subset
subset_pivot = ratings_subset.pivot(index = 'UserID', columns = 'MovieID', values = 'Rating').fillna(0)

#Training and Test Data Ratings Matrices
ratings_pivot_train = train_data.pivot(index = 'UserID', columns = 'MovieID', values = 'Rating').fillna(0)
ratings_pivot_test = test_data.pivot(index = 'UserID', columns = 'MovieID', values = 'Rating').fillna(0)


In [15]:
nonzero = np.count_nonzero(subset_pivot)
print 'The total number of nonzero entries in the ratings subset matrix is ' +  str(nonzero)

totalentries = 100*10000.0
sparsity = 1- nonzero/totalentries

print 'The sparsity of the ratings subset matrix is '+ str(sparsity)

The total number of nonzero entries in the ratings subset matrix is 622885
The sparsity of the ratings subset matrix is 0.377115


Now we obtain the two user-item matrices for the train and test data that has been normalized by each user's ratings

In [16]:
train_matrix = ratings_pivot_train.as_matrix()
mean_train_matrix = np.mean(train_matrix, axis =1)
Norm_train = train_matrix - mean_train_matrix.reshape(-1,1)

test_matrix = ratings_pivot_test.as_matrix()
mean_test_matrix = np.mean(test_matrix, axis =1)
Norm_test = test_matrix - mean_test_matrix.reshape(-1,1)

In [13]:
subset_matrix = subset_pivot.as_matrix()
mean_subset_matrix = np.mean(subset_matrix, axis=1)

#Normalized Ratings Subset Matrix -- normalized by each user's average rating
Norm_RM = subset_matrix - mean_subset_matrix.reshape(-1,1)

In [102]:
#Full Ratings matrix for Model Based CF 
ratings_pivot = ratings.pivot(index = 'UserID', columns = 'MovieID', values = 'Rating').fillna(0)

ratingsmatrix = ratings_pivot.as_matrix()
mean_ratingsmatrix = np.mean(ratingsmatrix, axis=1)

#Full Normalized Ratings Matrix
Norm_RM = ratingsmatrix - mean_ratingsmatrix.reshape(-1,1)

Now that we have Norm_Train and Norm_Test, we can calculate pairwise distances to calculate the adjusted cosine similarity between items. The similarity(i,j) is measured over users who have rated both items i and j (corated).

In [48]:
from sklearn.metrics.pairwise import pairwise_distances
movie_similarity = pairwise_distances(Norm_train.T, metric = 'cosine')

print movie_similarity.shape

(100, 100)


Note that the above similarity matrix's values range from 0 to 1 where 0 is the most similar (distance = 0) and 1 is the farthest. 

Lastly we define a prediction function that takes as its input the ratings matrix and similarity matrix

In [52]:
def predict(ratings_matrix, movie_sim): 
    prediction = ratings_matrix.dot(movie_sim) / np.array([np.abs(movie_sim).sum(axis=1)])
    return prediction

item_prediction = predict(train_matrix, movie_similarity)
print item_

[[ 2.24582853  2.27919893  2.33387239 ...,  2.20001163  2.17035863
   2.21447905]
 [ 2.26690356  2.39011729  2.38873142 ...,  2.29656582  2.29948749
   2.32359231]
 [ 1.77970189  1.82176662  1.8198291  ...,  1.82184241  1.80737437
   1.81232816]
 ..., 
 [ 2.08775262  2.20442111  2.16729756 ...,  2.10405981  2.04054022
   2.06465039]
 [ 1.93244354  1.96529185  2.024831   ...,  1.95627726  1.93399987
   1.9427243 ]
 [ 1.54446714  1.66020773  1.65827868 ...,  1.52943019  1.48862508
   1.50991189]]


1. Define a set of mutually observed users that have specified ratings for movie i and j 
2. Compute similarity between movie i and movie j using Pearson's correlation
3. Find top k matching items to movie i for which user u has specified ratings. the weighted average of these ratings is reported as the predicted rating value of that item

# 2. Model Based CF Algorithm
Matrix Factorization

# 3. Evaluation Methods

## (i) Cross Validation Setup
To withhold 10% of user item pairs, generate a random number for each user item pair randomly from [0,1] and withhold those >.90 for that randomly generated #



## (ii) Accuracy on Training/Test Data
- % Correct /Hit Rate: great for KPI correlation but they need a baseline for model accuracy otherwise it's not helpful
- ROC - based (precision, recall, F1-score, AUC - area under curve) when building and comparing models, and communicate ROC
- ROC  / percent-correct / hit-rate metrics when communicating to key stake holders


(a) Graph of different similarity measures' MAE histogram
(b) Sensitivity of the parameter x: MAE vs Train/Test ratio x 
(c) Sensitivity of the neighborhood size: MAE vs # of neighbors


## (iii) Coverage on training and test data
1. Define using accuracy metrics what a good recommendation is
2. Then 
	a. User-Coverage: the fraction of users for which AT LEAST k items can be recommended well 
	b. Item-Coverage: the fraction of items that can be recommended to at least k users well
	c. Catalog Coverage: the fraction of items that are in the top-k for at least 1 user
    
? What are the tradeoffs between accuracy and coverage? 

User bias + item bias + mean rating = baseline 

How do your evaluation metrics change as a function of parameters such as neighborhood size, # of latent dimensions? 

_Personal Note_ In general when the neighborhood size K is small, we're forcing our classifier to be "more blind" to the overall distribution. A small K will have low bias but higher variance. On the other hand, a higher K averages more voters in each prediction and is more resilient to outliers. This consequently results in lower variance but increased bias. 

### Model Size Variation
How does overall accuracy change when you systematically sample your data from a small to large size? How does runtime scale? 

a) Error vs Sample Size

(i) Graph of accuracy measure (RMSE, MAE vs sample size)
(ii) Runtime Graph: Runtime vs Model Size 