# Capstone Project Technical Report

By Kevin

The title of this project is "What should I watch next?" with a sub title of: '(mis)Adventures in building a Recomendation Engine." Let's get started with the data set.

https://grouplens.org/datasets/movielens/
    
    MovieLens 20M dataset.
    "Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. Released 4/2015; updated 10/2016 to update links.csv and add tag genome data."
    
It's the first link at the time of this writing (9/11/17). It's big... 800 MB unzipped so I'm not going to upload that to github. but the code works* (your milage may vary) so whoever is reading this can just unzip the files to where this notebook is located. 
    

In [None]:
# more will be needed later
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## So we have 3 main csv files (the others but are less important at this time). 
### genome_scores.csv
This is in long format so pivot it where each row is an individual movie... and each row has 1128 columns. These columns are rated on a 0-5 scale for each movie. This is mostly how the content-filtering will be done

In [None]:
genome = pd.read_csv('genome-scores.csv')
genome = genome.pivot(index='movieId', columns='tagId', values='relevance')

### ratings.csv
There is the ratings matrix. 138,493 users rated 26,744 movies, 20,000,263 times. It's in long format where each row is ratings. I don't recommend pivoting this unless you have a lot of RAM. (there also isn't a reason too). 

In [None]:
ratings = pd.read_csv('ratings.csv')

### movies.csv
This has the movies. The movieID and genre information.

In [None]:
movies = pd.read_csv('movies.csv')

## We are missing data

specifically on the genome. There are only 10,381 movies in the genome but there are ~27,000 movies that are rated. Getting that genome data would be difficult so we are just going to drop anything that doesn't tie into the genome. 

    (note: the movies we are dropping only have 1-2 ratings in the ratings dataframe. When we drop them, we will still have close to 20M ratings. Also: not having any genome data on these movies won't matter for collaborative-filtering... but simplicity ((and my sanity)) we are just going to drop them to combine content/collaborative filtering)
    
The fastest way to do this is to:
1. Group the genome dataframe by movieID (if you havent pivoted yet)
2. left join with the ratings dataframe. ratings(left) & genome(right)
3. drop the null rows 

I did this on my laptop without problems in like 30 sec. (8 GB RAM, 4 cores 2.7 Ghz)

In [None]:
# I DID NOT PIVOT THIS GENOME BEFORE I DID THIS. so we'll just read it in again
new_df = pd.read_csv('genome-scores.csv').groupby('movieId')[['movieId']].count()
new_df.columns = ['dropmelater']
new_df.reset_index(inplace=True)

#now join them
ratings_updated = ratings.join(new_df.set_index('movieId'), on='movieId', how='left', rsuffix='_drop_me_too')
ratings_updated.dropna(axis=0, inplace=True)
ratings_updated.drop('dropmelater', axis=1, inplace=True)

The end result is 19,800,443 rows in the ratings Dataframe. It also helps to save this too.

In [None]:
ratings_updated.to_csv('ratings_updated.csv', index=False)

### Now to fix the movies Dataframe

the genres column is a STRINGED up mess. So lets seperate them in to dummy variables. 

In [None]:
movies = pd.read_csv('movies.csv')

### https://stackoverflow.com/questions/18889588/create-dummies-from-column-with-multiple-values-in-pandas

## This splits/dummies the strings in the genre column
dummies = pd.get_dummies(movies_updated['genres'])

atom_col = [c for c in dummies.columns if '|' not in c]

for col in atom_col:
    movies_updated[col] = dummies[[c for c in dummies.columns if col in c]].sum(axis=1)
    
movies_updated.drop('genres',axis=1, inplace=True)

There was one movie with "no genres listed" so I went to IMDB and manually entered the genre info.

In [None]:
# dont forget to save
movies_updated.to_csv('movies_updated.csv', index=False)

## When reading this into pandas again, the encoding changed. Don't ask my why, anyways, this fixes it.
movies = pd.read_csv('movies_updated.csv', encoding = "ISO-8859-1")

## If you are still following along, make sure the movies and ratings dataframes are the upated ones we just made

### Remember how I said there were 3 csv files?

I lied. We are going to make a 4th 'users' dataframe and save that to a .csv



In [None]:
users = ratings_updated.groupby('userId')[['userId']].count()
users.columns = ['rated_count']
users = users.join(ratings_updated.groupby('userId')[['rating']].mean())
users.sort_values('rated_count', inplace=True)
users.columns = ['rated_count', 'avg_rating']
users.reset_index(inplace=True)

# And save that to a CSV
users.to_csv('users.csv', index=False)

## Recap

We just preprocessed all of our data for modeling. We have 4 Dataframes we are constantly querrying (perfectly setup for SQL). 

1. Ratings
2. Genome
3. Movies
4. Users


I'm aware there are not outputs in the notebook and if you really want to see what they look like. you can find them here:
https://github.com/kevinperlas/my_workspace/blob/master/MovieLense20M/MovieLens20M_EDA.ipynb


Let's get modeling!

# Content Filtering

Our data is already setup for modeling. We have a genome with 1000+ predictors. It's just a matter of puting the right data to fit a model. 

## Evaluation
The idea is to see if we can make a model that could predict the "next" thing a user would like to see. To evaluate that model in an offline setting, we will take the latest movie a user rated as the test set, and all the previous movies as the training set. 

Over a bunch of users, we can compute the RMSE.

We have 130,000+ users and fitting that many models is going to take a long time. Luckily, we already sorted our users dataframe by how often they rated movies so we can take a subset of that and get an equal distribution of active users.


In [None]:
eval_list = []
for num in range(0, 138493, 100):
    eval_list.append(num)
evaluation_users = users.loc[eval_list]

Since we are running regression models. we need to round the predicted values to the nearest .5

In [None]:
def myround(x, prec=2, base=0.5):
    return round(base * round(float(x)/base),prec)

Now we can evaluate a model over a subset of our userbase. It takes some joining of our dataframes to fit the model so the following function will take a 'userId', get the appropriate data to about the user, fit the model, and predict the latest rating. 

In [None]:
from sklearn.linear_model import LinearRegression

def get_yhat(n): #Linear Regression, no Regularization
    
    # first step: get the movies that the user rated
    user_n_df = ratings.loc[ratings['userId'] == n]
    user_n_df = user_n_df.sort_values('timestamp')
    
    # Now join the movies to the genome
    user_n_df = user_n_df.join(genome, on='movieId', how='left')
    
    # prep the df for modeling
    user_n_df = user_n_df.set_index('movieId')
    user_n_df.drop(['userId', 'timestamp'], axis=1, inplace=True)
                   
    #train test split
    X= user_n_df.drop('rating', axis=1)
    y= user_n_df[['rating']]
                   
    y_test = np.array(y.iloc[-1:]['rating'])
    y_train = np.array(y.iloc[:-1]['rating'])
    X_test = X.iloc[-1:]
    X_train = X.iloc[:-1]
    
    #fit model
    LR = LinearRegression()
    LR.fit(X_train, y_train)
    
    #fix y_hat
    y_hat = myround(LR.predict(X_test))
    
    if y_hat > 5:
        y_hat = 5
    
    if y_hat < 0:
        y_hat = 0
    
    #return adjusted prediction and y_test
    return y_hat, y_test[0]

The next cell will get a list of the prediction (y_hat) and the actual (y_true)

Note: I kept the index from the users dataframe in my evaluation_users dataframe. it's every 100 users index 0, so its 1385 users or 1% of the users

In [None]:
y_hat_list = []
y_true_list = []
for num in range(0, 138500, 100): #it's 138500 so it will run the last one
    y_hat, y_true = get_yhat(evaluation_users.loc[num]['userId'])
    y_hat_list.append(y_hat)
    y_true_list.append(y_true)
    print(y_hat, y_true)
    
#my computer finished this in 3 min 45 sec

This simple Linear Regression was pretty fast and had a RMSE of: 1.0128239108395056

If an app has to fit and predict ratings when a user clicks "watch next", Speed and adaptability also become very important (along with accuracy). 

A Baysian Ridge Regression is the option that makes the most sense since It 'learns' what regularization coefficent to use from the priors (ie. ratings history) from the specific user. It runs with the code above. Just instantiate BaysianRidge instead of a Linear Regression.

This had a RMSE of : 0.9086428170370735
An interesting point about using BaysianRidge is that the RMSE got smaller the when a user had rated more things (this makes sense intuitively since it has more priors). Segementing the Evaulation_users DF into thirds (still sorted by movies rated) the RMSE is as follows:

    first: 0.9880235200593537
    second: 0.9279931931832044
    Third: 0.7997564564355761

The Linear Regression did not follow this pattern and stayed at ~1 throughout the same sub-subsets. 

The only downside to the BaysianRidge is the computation time and requirement. The evaluation took 3x longer. But in the pipeline (coming soon), a user with very active history took about 3 seconds to fit and predict using a much larger test set in order to find the next highest rating they would like.

## so how do we pipeline this?

The idea is a user will choose a genre and the engine picks a movie for them. So this function takes a userId and genre and outputs the 5 movies with the highest predicted ratings. 

it's in "dev mode" right now looking at the next 5. For the app the bottom line will change to:
```
return movies_subset.iloc[0].merge(movies, left_on='movieId', right_on='movieId', how='left')
```

In [None]:
def get_recommendations(user_number, genre='Romance'):

    #get user data
    user_profile = ratings.loc[ratings['userId'] == user_number]
    
    #subset the movies by genre, take out the movies the user has seen
    movies_subset =  movies.loc[movies[genre] == 1][['movieId']]
    movies_subset = movies_subset.merge(user_profile, how='left', left_on='movieId', right_on='movieId')
    movies_subset = movies_subset.loc[movies_subset['userId'].isnull()][['movieId']]
    
    ## now join with the genome
    movies_subset = movies_subset.join(genome, on='movieId', how='left')
    movies_subset.set_index('movieId', inplace=True)
    
    ## join the user profile with the genome and prepare for model fit
    user_profile = user_profile.join(genome, on='movieId', how='left')
    user_profile.set_index('movieId', inplace=True)
    user_profile.drop(['userId', 'timestamp'], axis=1, inplace=True)
    
    #train the model
    BR = BayesianRidge()
    BR.fit(user_profile.drop('rating', axis=1), user_profile['rating'])
    
    #predict and get the y_hat
    y_hat = BR.predict(movies_subset)
    movies_subset['yhat'] = y_hat
    movies_subset= movies_subset[['yhat']]
    movies_subset.sort_values('yhat', ascending=False, inplace=True)
    movies_subset.reset_index(inplace=True)
    
    #return top 5
    movies_subset.head()
    
    return movies_subset.head().merge(movies, left_on='movieId', right_on='movieId', how='left')
    

## Recap

This is the Content-Fileterning model. It works! But it's limitations are that of Content-Filtering. Once a user build a certain profile of the movies they like. the model doesn't really leave that space. 

For example. userId: 3584 has favorably liked a lot of specific comedy movies. So when I ran that user through the pipeline with "Romance" as the genre, the output was ROMCOMs that make too much sense (There's something about Mary, the 40 Year old virgin, naked gun ((it counts)). Ran the same user through with "Horror" and "Horror" Comedies came up (Scary Movie, Shaun of the dead, Club Dredd). 

It's not as Serendipitous as I'd like it to be. So now we need to add Collaborative Filtering

If you really want to see outputs, check out:
https://github.com/kevinperlas/my_workspace/blob/master/MovieLense20M/Content_Modeling.ipynb

# Collaborative Filtering

Things are about to get expensive. both mentally and computationally. You are going to need to have AT LEAST 10 GB of RAM. Fitting and Evaluating this model I was running between 64-256 GB of RAM. 

And you're also going to need Surprise.  A python package for Collaborative Filtering https://github.com/NicolasHug/Surprise

We are going to start with Evaluation of our model using user-based SVD. Default setting from the Surprise package.
that is...

    100 components dimensionality reduction
    20 epochs to reduce the RMSE using Gradient Decent
    Euclidean Distance for similarity measurement
    
### Evaluation

In [None]:
from surprise import SVD
from surprise import Dataset
from surprise import evaluate, print_perf
from surprise.dataset import Reader
from collections import defaultdict

In [None]:
#### WARNING: This took over 3 hours to run on a m4.4XLarge - 16 cores, 64 GBs of RAM

#Surprise reads directly fromt he CSV
filepath = 'ratings_updated.csv'
reader = Reader(line_format='user item rating timestamp', sep=',', skip_lines=1)

# Dataset is a python object specific to Surprise
data = Dataset.load_from_file(filepath, reader=reader)
data.split(n_folds=5)

perf = evaluate(algo, data, measures=['RMSE', 'MAE'])
print_perf(perf)

The output was:
```
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.7851  0.7847  0.7844  0.7846  0.7847  0.7847  
MAE     0.5974  0.5970  0.5968  0.5970  0.5970  0.5970  
```

The results are pretty good. Better than the content-filtering with the 5 folds



### About here is where distributed computing would help 
Now we need to train the model on the whole set.

.pkl the the model and run predictions with chunks of the dataset

In [None]:
### WARNING: NEED about 16 GB of RAM to run in about 30 min
data = Dataset.load_from_file(filepath, reader=reader)
trainset = data.build_full_trainset()

# then pickel the file using the DUMP in surprise
dump.dump('CF_full_fit.pkl', algo=algo)

#### RUNNING PREDICTIONS ON THE FULL SET WILL CRASH AN M4.16XLARGE INSTANCE
after about 7 hours and $21

The workaround is to predict the model 1% at a time. a system with 16 GB of RAM can predict a chunk in about 4 minutes. So this will take ~400 minutes on a single computer/instance to complete once the data is preprocessed. 

So, here's some code that splits the ratings matrix by users into 1% chunks


or x100 .csv files for surprise to read.

In [None]:
### Surprise can work with timestamps but dropping it made things run faster

def get_ratings_subset(block):
    
    block.set_index('userId', inplace=True)
    block = block[['rated_count']]
    
    new_df = ratings.join(block, on='userId', how='left')
    new_df.dropna(inplace=True)
    new_df.drop('rated_count', axis=1, inplace=True)
    new_df.drop('timestamp', axis=1, inplace=True) 
    return new_df

In [None]:
for i in range(100):
    
    temp_list = []
    for num in range(i, 138493, 100):
        temp_list.append(num)
    
    df = get_ratings_subset(users.loc[temp_list])
    
    filepath = 'Collab_sets/block' + str(i) + '.csv'
    df.to_csv(filepath, index=False)

The model will predict ALL the ratings for EVERY USER and EVERY MOVIE. But what we are interested in is the top predictions.

In post deployment, these would be important to evaluate the RMSE in a live setting. I'm running out of hard drive space on my EC2 instance so I'm just saving the top 10.

Users at most, have rated ~2000 movies. so there are predictions for the other 8000. These predictions as a whole can easily be saved in a pickled file for later evaluation and compliled to save hard drive space.

In [None]:
def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [None]:
# This line reads the fitted model from the .pkl
__, loaded_algo = dump.load('CF_full_fit.pkl')



for iteration in range(100):
    #get the data
    filename = 'Batches/' + 'block' + str(iteration) + '.csv' #I saved it in a file called 'batches/'
    data = Dataset.load_from_file(filename, reader=reader)
    
    #temperary step to build the testset
    _ = data.build_full_trainset()
    
    #this is the actual test set
    testset = _.build_anti_testset()
    
    # Prections are happening now
    predictions = loaded_algo.test(testset)
    
    # filters predictions to the top 10
    print(iteration, ": Getting top 10")
    top_n = get_top_n(predictions, n=10)
    
    #convert to datafram and save to CSV
    ### THIS IS VERY MESSY AND NEEDS CLEANING
    df2 = pd.DataFrame().from_dict(top_n, orient='index')   
    outputname = 'Batches/' +'output_' + 'block' + str(iteration) + '.csv'
    df2.to_csv(outputname)


# That's it!

Now all that's left is cleaning the output from collaborative filtering and some logic steps to determine which prediction (collaborative or content) takes priority. But this is a hybrid model of collaborative and content filtering.

if you really want to see the outputs for these cells.

Evaluation: https://github.com/kevinperlas/my_workspace/blob/master/MovieLense20M/Suprise_exploration.ipynb

Splitting the test set into chunks: https://github.com/kevinperlas/my_workspace/blob/master/MovieLense20M/Collaborative%20test%20sets.ipynb

Fitting, Predicting, pickle :  https://github.com/kevinperlas/my_workspace/blob/master/MovieLense20M/CF_100_testsets.ipynb

---
___


### Everything that follows is still a WIP and will probably have notes that I've set up for myself to read. 

This block of code cleans the output of collaborative filtering so it can be used in the Recommender engine notebook. The output is a new CSV of the top 10 recommendations using Collaborative-Filtering


In [None]:

# instantiate a DF
combineDF = pd.read_csv('output_block0.csv')

#Read every csv form the chunks and combine them into one big DF
for n in range(100):
    filename = 'output_block' + str(n) + '.csv'
    temp_df = pd.read_csv(filename)
    combinedDF = pd.concat([combinedDF, temp_df])

#rename the columns
combinedDF.columns = ['userId', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

#the ratings look like tuples but they are stings now. lets just grab the movieId since it's already in top 10 order
## we aren't going to actually use this function but it helps to remember just what I'm actually trying to do.
### for me anyways

def just_movieID(long_string):
    return int(long_string.split(',')[0][2:-1])

#This actually splits the string
for n in range(1, 10):
    combinedDF[str(n)] = combinedDF[str(n)].apply(lambda x: int(x.split(',')[0][2:-1]))

#and save it to a CSV
combinedDF.to_csv('CF_top10.csv', index=False)

# Hybrid Combiner

this is where things just turn into a logic puzzle and there really isn't a good/easy way to test its accuracy in an offilne setting.

So we have two different predictive ratings on movies that may or may not be the same. The content model will provide things are in line with what the user has already seen. The Collaborative results has greater potential to be serendipitous. And the collaborative also scored higher in its RMSE (offline which is important to remember since optimizing for an offline RMSE will lead to overfitting). 

_the block is reserved for the Hybrid combiner

### Test out how fast a single userId can get predictions

keep in mind that even if the predictions could be made in real time. the model we trained will be outdated. It could still be useful/relevant... plus we have the content filtering that can be trained in real-time to keep collaborative results in-line