# Movie Recommendation System: Memory-Based Collaborative Filtering Model

**Background**:

An accurate, personalized recommendation system can improve business and sales and build customer satisfaction.

**Types of Recommendation Engines**:
1. **Popularity Model (see "popularity_movie_recommendation" notebook)**
2. **Recommendation algorithms**
    - **Content-based filtering (see "content_based_movie_recommendation" notebook)**
    - **Memory-based collaborative filtering (see "memory_based_collaborative_filtering_movie_recommendation" notebook)**
        - **Find similar items based on a similarity or correlation coefficient and take the weighted average of ratings**
        - Idea: if person A likes items 1, 2, 3 and B likes 2, 3, 4 then they have similar interests and A should like item 4 and B should like item 1
        - Based on the past behavior and similarity in preferences, tastes, and choices of two users and not on the context
        - One of the most commonly used algorithm bc it does not depend on any additional information except similarity in preferences
        - `Item-item matrix` keeps a record of the pair of items that were rated together
        - **Advantages**:
            - Easy to create bc no training or optimization involved
            - Easy to interpret and explain results
        - **Disadvantages**:
            - Performance reduces when data is sparse
            - Non-scalable for most real world problems
        - Types of memory-based collaborative filtering algorithms:
            1. **User-user collaborative filtering**
                - Find look-alike customers based on similarity and offer products which first customer's look alike has chosen in the past
                - Effective but time and resource consuming
                - Need to compute every customer pair information
                - Hard to implement without a strong parallelizable system
            2. **Item-item collaborative filtering**
                - Similar to user-user collaborative filtering but find item look alikes
                - Recommend alike items to customers who have purchased any item from the store
                - Less resource and time consuming
                - Item-item matrix is fixed over time with fixed number of products
         - Types of similarity metrics supported by GraphLab
             1. **Jaccard Similarity**
                 - Based on the number of users which have rated item A and B divided by the number of users who have rated either A or B
                 - Typically used when there is a boolean value like a product being bought or an ad being clicked instead of numberic rating 
             2. **Cosine Similarity**
                 - Similarity is the cosine of the angle between the 2 vectors of the item vectors of A and B
                 - The closer the vectors, the smaller the angle, and the larger the cosine
             3. **Pearson Similarity**
                 - Uses Pearson coefficient as the similarity between two vectors
                
    - **Model-based collaborative filtering**
        1. **Matrix Factorization**
            1. **Singular Vector Decomposition (see "model_based_cf_matrix_factorization_movie_recommendation" notebook)**
            2. Probabilitistic Matrix Factorization
            3. Non -ve Matrix Factorization
        2. **Deep learning/neural network (see "model_based_deep_learning_movie_recommendation" notebook)**
                
3. **Using a classifier to make recommendation**
    - Classifiers are parametric solutions that require some parameters of the user and item to be defined first
    - Pros:
        - Incorporates personalization
        - Works even if the user's past history is short or not available
    - Cons:
        - Features might not be availalbe or sufficient to create a good classifier
        - Making a good classifier will become exponentially difficult as the number of user and items grow
        
(https://medium.com/@james_aka_yale/the-4-recommendation-engines-that-can-predict-your-movie-tastes-bbec857b8223)

**Data**:

We will be using the online movie recommender service MovieLens' dataset collected from the MovieLens website. The datasets were collected over several periods of time.
Users were selected at random to be included in the data. All users have rated 20+ movies. No demographic information is included.

The data includes:
- 100K ratings (1-5) from 1000 users on 1700 movies
- Each user has rated 20+ movies
- Simple demographic information for the users, such as gender, age, occupation, zip, etc.
- Genre information of movies

(https://grouplens.org/datasets/movielens/10m/)

In [None]:
!pip install numpy==1.10.4

In [1]:
import pandas as pd
import numpy as np
import scipy as sc
import graphlab as gl
import pickle

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

This non-commercial license of GraphLab Create for academic use is assigned to ngapui.leung@berkeley.edu and will expire on December 21, 2019.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1548547144.log


# Data

## Users Data

In [2]:
users = pd.read_pickle('users.pickle')
users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


## Ratings Data

In [3]:
ratings = pd.read_pickle('ratings.pickle')
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


## Movies Data

In [4]:
movies = pd.read_pickle('movies.pickle')
movies.head()

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,imdb_url,unknown,action,adventure,animation,childrens,comedy,crime,documentary,drama,fantasy,film_noir,horror,musical,mystery,romance,sci_fi,thriller,war,western,genres
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,animation|childrens|comedy
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,action|adventure|thriller
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,thriller
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,action|comedy|drama
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,crime|drama|thriller


## Train Data

In [5]:
train_data = pd.read_pickle('train_data.pickle')
train_data_gl = gl.SFrame(train_data)

## Test Data

In [6]:
test_data = pd.read_pickle('test_data.pickle')
test_data_gl = gl.SFrame(test_data)

# Memory-Based Collaborative Filtering Movie Recommendation Model

(https://medium.com/@james_aka_yale/the-4-recommendation-engines-that-can-predict-your-movie-tastes-bbec857b8223)

## Recommender System Using GraphLab

### Item Similarity Using Jaccard Similarity Metric

**Let's train a model using the Jaccard similarity metric.**

In [7]:
item_sim_jaccard = gl.item_similarity_recommender.create(train_data_gl, user_id='user_id', item_id='movie_id', similarity_type='jaccard')

**Let's make recommendations.**

In [8]:
jaccard_sim_rec = item_sim_jaccard.recommend(users=range(1, 6), k=5)
jaccard_sim_rec.print_rows(num_rows=25)

+---------+----------+-----------------+------+
| user_id | movie_id |      score      | rank |
+---------+----------+-----------------+------+
|    1    |   423    |  0.165949816695 |  1   |
|    1    |   568    |  0.157034008103 |  2   |
|    1    |   202    |  0.153320827102 |  3   |
|    1    |   403    |  0.149106345331 |  4   |
|    1    |   385    |  0.146414006712 |  5   |
|    2    |    7     |  0.16170423306  |  1   |
|    2    |   121    |  0.159973624807 |  2   |
|    2    |   181    |  0.157847735744 |  3   |
|    2    |    50    |  0.156254000389 |  4   |
|    2    |   117    |  0.154636738392 |  5   |
|    3    |   328    |  0.124686544592 |  1   |
|    3    |   313    |  0.101335969838 |  2   |
|    3    |   332    |  0.101279365745 |  3   |
|    3    |    12    |  0.10076934912  |  4   |
|    3    |   245    | 0.0975412225181 |  5   |
|    4    |    56    |  0.16524706568  |  1   |
|    4    |   121    |  0.156088250024 |  2   |
|    4    |   181    |  0.149083150285 |

### Item Similarity Using Cosine Similarity Metric

**Let's train a model using the Cosine similarity metric.**

In [9]:
item_sim_cosine = gl.item_similarity_recommender.create(train_data_gl, user_id='user_id', item_id='movie_id', similarity_type='cosine')

**Let's make recommendations.**

In [10]:
cosine_sim_rec = item_sim_cosine.recommend(users=range(1, 6), k=5)
cosine_sim_rec.print_rows(num_rows=25)

+---------+----------+----------------+------+
| user_id | movie_id |     score      | rank |
+---------+----------+----------------+------+
|    1    |   423    | 0.287442327228 |  1   |
|    1    |   202    | 0.263145389448 |  2   |
|    1    |   385    | 0.238017175716 |  3   |
|    1    |   568    | 0.234459889299 |  4   |
|    1    |   403    | 0.229076276526 |  5   |
|    2    |   181    | 0.309000855455 |  1   |
|    2    |    50    | 0.29930659097  |  2   |
|    2    |   121    | 0.288683969241 |  3   |
|    2    |    7     | 0.283500842177 |  4   |
|    2    |   117    | 0.258024985974 |  5   |
|    3    |   328    | 0.260885329409 |  1   |
|    3    |   313    | 0.220487566157 |  2   |
|    3    |   331    | 0.195539631627 |  3   |
|    3    |    50    | 0.18477635492  |  4   |
|    3    |   332    | 0.181983083487 |  5   |
|    4    |    50    | 0.294683899198 |  1   |
|    4    |   121    | 0.266690645899 |  2   |
|    4    |    56    | 0.258224593742 |  3   |
|    4    |  

### Item Similarity Using Pearson Similarity Metric

**Let's train a model using the Pearson similarity metric.**

In [11]:
item_sim_pearson = gl.item_similarity_recommender.create(train_data_gl, user_id='user_id', item_id='movie_id', similarity_type='pearson')

**Let's make recommendations.**

In [12]:
pearson_sim_rec = item_sim_pearson.recommend(users=range(1, 6), k=5)
pearson_sim_rec.print_rows(num_rows=25)

+---------+----------+----------------+------+
| user_id | movie_id |     score      | rank |
+---------+----------+----------------+------+
|    1    |   286    | 0.807692307692 |  1   |
|    1    |   294    | 0.803643724696 |  2   |
|    1    |   288    | 0.779352226721 |  3   |
|    1    |   300    |  0.7004048583  |  4   |
|    1    |   117    | 0.645748987854 |  5   |
|    2    |    50    |      1.0       |  1   |
|    2    |   181    | 0.886639676113 |  2   |
|    2    |   121    | 0.775303643725 |  3   |
|    2    |   174    | 0.765182186235 |  4   |
|    2    |    98    | 0.706477732794 |  5   |
|    3    |    50    |      1.0       |  1   |
|    3    |   100    | 0.894736842105 |  2   |
|    3    |   286    | 0.807692307692 |  3   |
|    3    |   294    | 0.803643724696 |  4   |
|    3    |    1     | 0.791497975709 |  5   |
|    4    |    50    |      1.0       |  1   |
|    4    |   100    | 0.894736842105 |  2   |
|    4    |   181    | 0.886639676113 |  3   |
|    4    |  

## Recommender System Using Sci-kit Learn

**Because the computing power of my laptop is very limited, we will only take a subset of the data.**

In [41]:
data_sub = ratings.sample(frac=0.3)

**Let's create train and test sets from the subset of data using an 80-20 split.**

In [42]:
from sklearn import cross_validation as cv
train_data_sub, test_data_sub = cv.train_test_split(data_sub, test_size=0.2)

**Let's create a user-item matrix for the train data and test data.**

In [43]:
train_data_matrix = train_data_sub.as_matrix(columns=['user_id', 'movie_id', 'rating'])
test_data_matrix = test_data_sub.as_matrix(columns=['user_id', 'movie_id', 'rating'])

print(train_data_matrix.shape)
print(test_data_matrix.shape)

(24000, 3)
(6000, 3)


**We use the `pairwise_distances` function from sklearn to calculate the Pearson Correlation Coefficient. This method provides a safe way to take a distance matrix as input while preserving compatibility with many other algorithms that take a vector array.**

In [44]:
from sklearn.metrics.pairwise import pairwise_distances

# User similarity matrix
user_correlation = 1 - pairwise_distances(train_data_matrix, metric='correlation')
user_correlation[np.isnan(user_correlation)] = 0
print(user_correlation[:4, :4])

[[ 1.          0.99866643  0.99708255  0.99222253]
 [ 0.99866643  1.          0.99969361  0.98447297]
 [ 0.99708255  0.99969361  1.          0.97982637]
 [ 0.99222253  0.98447297  0.97982637  1.        ]]


In [36]:
# Item similarity matrix
item_correlation = 1 - pairwise_distances(train_data_matrix.T, metric='correlation')
item_correlation[np.isnan(item_correlation)] = 0
print(item_correlation[:4, :4])

[[ 1.          0.00966396 -0.01328655]
 [ 0.00966396  1.         -0.19766291]
 [-0.01328655 -0.19766291  1.        ]]


**We can now predict the ratings that were not included with the data using the similarity matrix we have. We can evaluate the performance of the recommender by comparing the predictions with the test data, but first we need to normalize it so that the ratings stay between 1 and 5. Finally, we need to sum the average ratings for the user that we are trying to predict for because some users tend to give high or low ratings to all movies.**

In [37]:
def predict(ratings, similarity, type='user'):
    if type=='user':
        mean_user_rating = ratings.mean(axis=1)
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff)/np.array([np.abs(similarity).sum(axis=1)]).T
    elif type=='item':
        pred = ratings.dot(similarity)/np.array([np.abs(similarity).sum(axis=1)])
    return(pred)

# Evaluating Memory Based Collaborative Filtering Recommendation Engines

We can use precision-recall to evaluate the performance of the above recommendation engines.
- **Recall**
    - What ratio of items that a user likes were actually recommended
    - If a user likes 5 items and the recommendation shows 3 of them, then the recall is 0.6
    
    
- **Precision**
    - Out of all the recommended items, how many the user actually liked
    - If 5 items were recommended and the user liked 4, then the precision is 0.8

To improve pefromance, we need to balance the tradeoff between precision and recall and maximize both. An optimal recommender yields a precision and recall rate of 1.

- **Root Mean Square Error (RMSE)**
    - Metric that measures how much the signal and the noise is explained by the model

## GraphLab: Jaccard vs Cosine vs Pearson

In [46]:
model_performance = gl.compare(test_data_gl, [item_sim_jaccard, item_sim_cosine, item_sim_pearson])
gl.show_comparison(model_performance, [item_sim_jaccard, item_sim_cosine, item_sim_pearson])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.332979851538 | 0.0332979851538 |
|   2    | 0.296394485684 | 0.0592788971368 |
|   3    | 0.266878755744 | 0.0800636267232 |
|   4    | 0.244962884411 | 0.0979851537646 |
|   5    | 0.228419936373 |  0.114209968187 |
|   6    | 0.217568045246 |  0.130540827147 |
|   7    | 0.205423420694 |  0.143796394486 |
|   8    | 0.196182396607 |  0.156945917285 |
|   9    | 0.189112760693 |  0.170201484624 |
|   10   | 0.181866383881 |  0.181866383881 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.391304347826 | 0.0391304

**Although the cosine item similarity model is much better than the one using Pearson similarity metric and slightly better than the one using Jaccard similarity metric, it is still far from being a useful recommendation system. Let's try a more sophisticated model-based collaborative filtering algorithm, like matrix factorization, next.**

## Sci-kit Learn: RMSE

In [45]:
from sklearn.metrics import mean_squared_error as mse
from math import sqrt

def rmse(pred, actual):
    pred = pred[actual.nonzero()].flatten()
    actual = actual[actual.nonzero()].flatten()
    return(sqrt(mse(pred, actual)))

**Let's predict ratings on the training data with both similarity score.**

In [46]:
user_pred = predict(train_data_matrix, user_correlation, type='user')
item_pred = predict(train_data_matrix, item_correlation, type='item')

**Here are the RMSE on the train and test data for both user-based and item-based CF.**

In [47]:
print 'User-Based CF Train RMSE: ', rmse(user_pred, train_data_matrix)
print 'User-Based CF Test RMSE: ', rmse(user_pred, test_data_matrix)

print 'Item-Based CF Train RMSE: ', rmse(item_pred, train_data_matrix)
print 'Item-Based CF Test RMSE: ', rmse(item_pred, test_data_matrix)

User-Based CF Train RMSE:  139.807932009
User-Based CF Test RMSE:  297.446532805
Item-Based CF Train RMSE:  75.0622001867
Item-Based CF Test RMSE:  335.72877136


**The RMSEs, especially for the test data, are quite large possibly because our model has overfitted the training data since we only used a small subset (30%) of the data. Hence, it could not generalize to the test set that it has not previously seen.**