# Objective
*What is your objective? What are you willing to sacrifice?*

Max accuracy or coverage of recommendations

MovieLens is a web-based recommender system and virtual community that recommends movies for its users to watch, based on their film preferences using CF of members' movie ratings and reviews. To address the cold-start problem for new users, MovieLens uses preference elicitation where they ask new users to rate how muh they enjoy watching different genres of movies.

Develop a small dataset(< 10000 users / < 100 items) be thoughtful as to how you're sampling data

The dataset contains 10000054 ratings and 95580 tags applied to 10681 movies by 71567 users. Users were selected at random but all users selected had rated at least 20 movies. The data contains three files: 

#### 1. Movies.dat
Each line of this file represents one movie, and has the following format: MovieID::Title::Genres
MovieID is the real MovieLens id.

Movie titles, by policy, should be entered identically to those found in IMDB, including year of release. However, they are entered manually, so errors and inconsistencies may exist.

Genres are a pipe-separated list, and are selected from the following:

Action
Adventure
Animation
Children's
Comedy
Crime
Documentary
Drama
Fantasy
Film-Noir
Horror
Musical
Mystery
Romance
Sci-Fi
Thriller
War
Western


#### 2. Ratings.dat
All ratings are contained in the file ratings.dat. Each line of this file represents one rating of one movie by one user, and has the following format:

UserID::MovieID::Rating::Timestamp

The lines within this file are ordered first by UserID, then, within user, by MovieID.

Ratings are made on a 5-star scale, with half-star increments.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.


#### 3. Tags.dat
All tags are contained in the file tags.dat. Each line of this file represents one tag applied to one movie by one user, and has the following format:

UserID::MovieID::Tag::Timestamp

The lines within this file are ordered first by UserID, then, within user, by MovieID.

Tags are user generated metadata about movies. Each tag is typically a single word, or short phrase. The meaning, value and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.



# 1. Item-Based Nearest Neighbors

First we will import the necessary libraries for building our item-based nearest neighbors CF model as well as read in our datasets described earlier. 

1) Check your sparsity 
2) If not too sparse, sample randomly. If too sparse, choose top n items / top n users. State concerns / assumptions. 

In [3]:
import pandas as pd 
import numpy as np

#Define the format of each of the data files
#Movies; MovieID::Title::Genres
moviescol = ['MovieId', 'Title', 'Genres','Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
movies = pd.read_csv('./ml-10M100K/movies.dat', sep='::', names = moviescol, engine='python')

moviegenre = []
for (idx, row) in movies.iterrows(): 
    genres = row.loc['Genres'].split("|")
    movieid = row.loc['MovieId']
    for g in genres:  
        moviegenre.append({'MovieId': movieid, 'Genre': g})

movie_genre = pd.DataFrame(moviegenre)


moviegenrecol = ['Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'IMAX', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

test = pd.DataFrame(0, index = np.arange(len(movies)), columns = moviegenrecol)
MovieGenres = pd.concat([movies['MovieId'], test], axis = 1)
MovieGenres.columns= ['MovieId','Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'IMAX', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
        
for row in movie_genre: 
    movieID = row['MovieId']
    genre = row['Genre']
    MovieGenres.loc[MovieGenres.MovieId==movieID,genre]=1
    
    



# #Ratings:
# #UserID::MovieID::Rating::Timestamp
# ratingscol = ['UserID', 'MovieID', 'Rating', 'Timestamp']
# ratings = pd.read_csv('./ml-10M100K/ratings.dat', sep='::', names = ratingscol, engine='python')

# #Tags: 
# tagscol = ['UserID', 'MovieID', 'Rating', 'Timestamp']
# tags = pd.read_csv('./ml-10M100K/ratings.dat', sep='::', names = tagscol, engine='python')
# tags.head()



IOError: [Errno 2] No such file or directory: './ml-10M100K/movies.dat'

# 2. Model Based CF Algorithm
Matrix Factorization

# 3. Evaluation Methods

## (i) Cross Validation Setup
To withhold 10% of user item pairs, generate a random number for each user item pair randomly from [0,1] and withhold those >.90 for that randomly generated #


## (ii) Accuracy on Training/Test Data
- % Correct /Hit Rate: great for KPI correlation but they need a baseline for model accuracy otherwise it's not helpful
- ROC - based (precision, recall, F1-score, AUC - area under curve) when building and comparing models, and communicate ROC
- ROC  / percent-correct / hit-rate metrics when communicating to key stake holders

## (iii) Coverage on training and test data
1. Define using accuracy metrics what a good recommendation is
2. Then 
	a. User-Coverage: the fraction of users for which AT LEAST k items can be recommended well 
	b. Item-Coverage: the fraction of items that can be recommended to at least k users well
	c. Catalog Coverage: the fraction of items that are in the top-k for at least 1 user
    
? What are the tradeoffs between accuracy and coverage? 

User bias + item bias + mean rating = baseline 

How do your evaluation metrics change as a function of parameters such as neighborhood size, # of latent dimensions? 

_Personal Note_ In general when the neighborhood size K is small, we're forcing our classifier to be "more blind" to the overall distribution. A small K will have low bias but higher variance. On the other hand, a higher K averages more voters in each prediction and is more resilient to outliers. This consequently results in lower variance but increased bias. 

### Model Size Variation
How does overall accuracy change when you systematically sample your data from a small to large size? How does runtime scale? 

a) Error vs Sample Size