# Personal Video Ranker
### Collaborative Filtering Model

Historically, collaborative filtering models have been a large part of recommender algorithms. Personal video rankers, collaborative filtering models that seek to predict the content a user will rate highly, usually take prominent places in different capacities on all content delivery platforms.

In this notebook, you'll find my baseline model using the Biased Baseline algorithm from the Surprise package.

Following is my best Singular Value Decomposition algorithm, using SVD++ from the same package.


In [1]:
import pandas as pd
import numpy as np

import os
from surprise import Dataset
from surprise import Reader
from surprise import accuracy

from surprise import NormalPredictor

from surprise import BaselineOnly
from surprise import SVD
from surprise import SVDpp
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
from surprise.model_selection import RandomizedSearchCV

In [2]:
#notify me when a long running cell is complete
%load_ext jupyternotify

<IPython.core.display.Javascript object>

# Prepare data using Surprise format

In [24]:
train = pd.read_csv('data/train_1M.csv')
test = pd.read_csv('data/test_1M.csv')
holdout = pd.read_csv('data/ho_1M.csv')
data = pd.read_csv('data/1m_useratt_minreq.csv')

In [5]:
print('holdout shape: ', holdout.shape)
print('test shape: ', test.shape)
print('train shape: ', train.shape)

holdout shape:  (100000, 11)
test shape:  (200000, 11)
train shape:  (700000, 11)


In [25]:
reader = Reader(rating_scale=(1,5))

train_data = Dataset.load_from_df(train[['cust_id','mid','rating']], reader)
test_data = Dataset.load_from_df(test[['cust_id','mid','rating']], reader)
ho_data = Dataset.load_from_df(holdout[['cust_id','mid','rating']], reader)
all_data = Dataset.load_from_df(data[['cust_id','mid','rating']], reader)

#correct surprise dataset format
train_sr = train_data.build_full_trainset()

test_sr1 = test_data.build_full_trainset()
test_sr = test_sr1.build_testset()

ho_sr1 = ho_data.build_full_trainset()
ho_sr = ho_sr1.build_testset()

all_sr = all_data.build_full_trainset()

# Bias Baseline

In this project, I initially drew a lot of inspiration from the [Balkor](http://snap.stanford.edu/class/cs246-2015/slides/08-recsys2.pdf) solution for the Netflix Prize as this is an academic exercise and I am seeking to understand recommendation systems on a granular level.

This was the baseline model they used. Predictions are calculated using the following formula:
        
        rᵤᵢ=μ + bᵤ + bᵢ

Essentially, this model operates on the assumption that you can predict a user's rating based on their natural bias. In layman's terms:

    User's Rating = (mean ratings for the entire sample) + (the difference in how a user tends to rate videos) + (the difference in the content's own average rating)

Model is below:

In [7]:
#using stochastic gradient descent bc it performed the best
bsl_options = {'method': 'sgd'}
bias_baseline = BaselineOnly(bsl_options)
bias_baseline.fit(train_sr)
predictions = bias_baseline.test(test_sr)

accuracy.mae(predictions)

Estimating biases using sgd...
MAE:  0.7989


0.7988551435037793

In [8]:
accuracy.rmse(predictions, verbose=True)

RMSE: 0.9993


0.9992895107682512

In [9]:
preds_bsl_ho = bias_baseline.test(ho_sr)
accuracy.rmse(preds_bsl_ho)

RMSE: 0.9798


0.9798110140695234

In [10]:
accuracy.mae(preds_bsl_ho)

MAE:  0.7815


0.781468222150723

# Baseline Recommender Results

At every step, I am examining the diversity for each model. The following is the recommendation system that was built from the baseline model. 

Exploring diversity on a social level was added to the project later in the process so those values have been imported from a .csv and merged with the results.

In [12]:
#retrain Baseline on all data
bsl_options = {'method': 'sgd'}
bias_baseline = BaselineOnly(bsl_options)
bias_baseline.fit(all_sr)

Estimating biases using sgd...


<surprise.prediction_algorithms.baseline_only.BaselineOnly at 0x7fc37df2a490>

In [15]:
#list of all users
all_users = data['cust_id'].unique()
len(all_users)

290022

In [21]:
def get_recs(model, user_list):
    all_recommendations = []
    
    def rec_content(cust_id):
        # Get a list of all the movies the user has not rated
        all_content = data['mid'].unique()
        user_content = data[data['cust_id'] == cust_id]['mid'].unique()
        new_content = list(set(all_content) - set(user_content))

        # Predict the ratings for the new movies
        predictions = [model.predict(cust_id, mid) for mid in new_content]

        # Sort the predictions by estimated rating
        predictions.sort(key=lambda x: x.est, reverse=True)

        # Get the top 10 recommendations
        top_recommendations = [prediction.iid for prediction in predictions[:10]]

        return [(cust_id, movie_id) for movie_id in top_recommendations]

    #for all users apply recs
    for cust_id in user_list:
        user_recs = rec_content(cust_id)
        all_recommendations.extend(user_recs)
    
    #new df of recs for analysis
    recommendations_df = pd.DataFrame(all_recommendations, columns=["cust_id", "mid"])
    return recommendations_df

In [22]:
top10_allusers = get_recs(bias_baseline, all_users)
top10_allusers

Unnamed: 0,cust_id,mid
0,510180,7230
1,510180,7057
2,510180,7833
3,510180,14961
4,510180,14550
...,...,...
2900215,883348,16587
2900216,883348,12834
2900217,883348,2102
2900218,883348,12891


In [61]:
top10_allusers.to_csv('data/bslrecs.csv')

In [23]:
top10_allusers['mid'].value_counts()

mid
2102     289391
7833     287834
7230     287574
7057     287545
11662    287086
          ...  
1291          2
88            2
10            1
152           1
79            1
Name: count, Length: 93, dtype: int64

In [11]:
top15_allusers = top15_allusers.merge(dataminorityrec)
top15_allusers.head()

Unnamed: 0,cust_id,mid,m_minreq
0,510180,7230,0.0
1,1589382,7230,0.0
2,1878798,7230,0.0
3,1259176,7230,0.0
4,873369,7230,0.0


In [41]:
top15_allusers['mid'].value_counts()

mid
2102     289705
4427     289509
7833     289035
8571     288958
8535     288894
          ...  
113           2
14240         1
57            1
180           1
203           1
Name: count, Length: 122, dtype: int64

In [15]:
top15_allusers['m_minreq'].value_counts()

m_minreq
0.0    4346858
1.0       3472
Name: count, dtype: int64

In [16]:
top15_allusers['m_minreq'].value_counts(normalize = True)

m_minreq
0.0    0.999202
1.0    0.000798
Name: proportion, dtype: float64

Though this model is very accurate, it contains very little diversity. For over 290k users with a library of almost 18k users, it only recommended 122 different videos. 

Minority representation in the dataset was low to begin with so it is not surprising that this number shrank substantially.

# SVD++

In [11]:
svdpp = SVDpp(n_factors= 150, n_epochs= 20)
svdpp.fit(train_sr)
preds_svdpp = svdpp.test(test_sr)

accuracy.rmse(preds_svdpp)
accuracy.mae(preds_svdpp)

RMSE: 1.0302
MAE:  0.8274


0.8273689246712609

In [12]:
preds_svdpp_ho = svdpp.test(ho_sr)
accuracy.rmse(preds_svdpp_ho)
accuracy.mae(preds_svdpp_ho)

RMSE: 0.9091
MAE:  0.7009


0.7009034106130455

In [17]:
top15_svdpp_allusers = get_recs(svdpp, all_users)

In [19]:
top15_svdpp_allusers.head()

Unnamed: 0,cust_id,mid
0,510180,5760
1,510180,2057
2,510180,8116
3,510180,5837
4,510180,7749


In [35]:
top15_svdpp_allusers.to_csv('data/top15_svdpp_allusers.csv', index=False)

In [26]:
top15_svdpp_allusers = top15_svdpp_allusers.merge(minorityrec)
top15_svdpp_allusers.head()

Unnamed: 0,cust_id,mid,m_minreq
0,510180,5760,0.0
1,398661,5760,0.0
2,200684,5760,0.0
3,1136678,5760,0.0
4,712568,5760,0.0


In [27]:
top15_svdpp_allusers['mid'].value_counts()

mid
7230     145231
14961    138438
7057     135047
16587    109922
14302     95981
          ...  
862           1
1689          1
545           1
2315          1
16931         1
Name: count, Length: 2304, dtype: int64

In [28]:
top15_svdpp_allusers['m_minreq'].value_counts()

m_minreq
0.0    3938424
1.0     411906
Name: count, dtype: int64

In [29]:
top15_svdpp_allusers['m_minreq'].value_counts(normalize=True)

m_minreq
0.0    0.905316
1.0    0.094684
Name: proportion, dtype: float64

In [30]:
0.094684 - 0.000798

0.09388600000000001

This model performed much better all around. For over 290k users with a library of almost 18k users, it recommended 2304 different videos. 

Minority representation is still low, but exponentially higher than the baseline model, up .94 from .0008. 

# User #2407458

Our example user was given the following results for both models:

In [34]:
example = data[data['cust_id'] == 2407458]
example

Unnamed: 0,mid,cust_id,rating,r_date,m_decade,m_avg_rating,user_engagement,adopters
309189,16128,2407458,4.0,2005-11-05,4,3.964478,4,5
309190,15342,2407458,3.0,2005-11-17,4,3.476331,4,5
309191,4157,2407458,3.0,2005-11-17,4,3.357143,4,5
309192,14606,2407458,3.0,2005-11-05,5,3.124744,4,5


In [32]:
baseline_example = top15_allusers[(top15_allusers['cust_id'] == 2407458)]
baseline_example

Unnamed: 0,cust_id,mid,m_minreq
49507,2407458,7230,0.0
338184,2407458,7833,0.0
627153,2407458,7057,0.0
916081,2407458,2102,0.0
1205222,2407458,12834,0.0
1493488,2407458,16587,0.0
1781951,2407458,8535,0.0
2070457,2407458,14961,0.0
2358792,2407458,4427,0.0
2647768,2407458,15861,0.0


In [33]:
svdpp_example = top15_svdpp_allusers[(top15_svdpp_allusers['cust_id'] == 2407458)]
svdpp_example

Unnamed: 0,cust_id,mid,m_minreq
267998,2407458,12834,0.0
502332,2407458,7230,0.0
864390,2407458,10418,0.0
997165,2407458,16587,0.0
1151834,2407458,14621,0.0
1558123,2407458,12293,0.0
1713263,2407458,2102,0.0
2183369,2407458,5103,0.0
2553679,2407458,14691,0.0
2824970,2407458,10080,0.0


In [16]:
def pvr(model, cust_id):
    all_recs = []

    def rec_pvr(cust_id):

        # list of all content user has not rated
        all_content = data['mid'].unique()
        user_content = data[data['cust_id'] == cust_id]['mid'].unique()
        new_content = list(set(all_content) - set(user_content))

        # predict the ratings for new content
        preds = [model.predict(cust_id, mid) for mid in new_content]

        # sort preds by estimated rating
        preds.sort(key=lambda pred: pred.est, reverse=True)

        # top 10 recommendations
        top_10 = [prediction.iid for prediction in preds[:10]]
        
        #return list of cust,mid pairs
        return [(cust_id, movie_id) for movie_id in top_10]
    
    #apply rec_pvr fxn and + to list
    user_recs = rec_pvr(cust_id)
    all_recs.extend(user_recs)

    #new df of recs for analysis including minority requirement
    recs = pd.DataFrame(all_recs, columns=["cust_id", "mid"])
    recs = recs.merge(minorityrec)
    
    return recs