# Video to Video Ranker
### A content-filtering kNN model

In order to promote content diversity, content delivery platforms usually employ models that connect users with content that is similar to what they have been exposed to already. 

These models are trained only to examine the similarities between the content available.

In this notebook, you'll see that I've used the kNNBaseline model from the Python Surprise package to start and have left my best performing iteration from there.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import os
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise import KNNBasic
from surprise import KNNBaseline
from surprise import KNNWithMeans

In [2]:
#notify me when a long running cell is complete
%load_ext jupyternotify

<IPython.core.display.Javascript object>

# Import Data and Data Split

If you haven't already, you can see how I split the data in the 1M_PVR Notebook.

In [5]:
data = pd.read_csv('data/1m_useratt_minreq.csv')
minorityrec = pd.read_csv('data/minreq.csv')

train = pd.read_csv('data/train_1M.csv')
test = pd.read_csv('data/test_1M.csv')
holdout = pd.read_csv('data/ho_1M.csv')

print('holdout shape: ', holdout.shape)
print('test shape: ', test.shape)
print('train shape: ', train.shape)

holdout shape:  (100000, 9)
test shape:  (200000, 9)
train shape:  (700000, 9)


## Preparing Data in Surprise Format

In [9]:
reader = Reader(rating_scale=(1,5))

train_data = Dataset.load_from_df(train[['cust_id','mid','rating']], reader)
test_data = Dataset.load_from_df(test[['cust_id','mid','rating']], reader)
ho_data = Dataset.load_from_df(holdout[['cust_id','mid','rating']], reader)
all_data = Dataset.load_from_df(data[['cust_id','mid','rating']], reader)

#correct surprise dataset format
train_sr = train_data.build_full_trainset()

test_sr1 = test_data.build_full_trainset()
test_sr = test_sr1.build_testset()

ho_sr1 = ho_data.build_full_trainset()
ho_sr = ho_sr1.build_testset()

all_sr = all_data.build_full_trainset()

# kNN Baseline Model

In [31]:
sim_dict = {'name': 'cosine', 'user_based': False}
knn_bsl1 = KNNBaseline(sim_options=sim_dict)
knn_bsl1.fit(train_sr)
knn_bsl1_preds = knn_bsl1.test(test_sr)

accuracy.rmse(knn_bsl1_preds)
accuracy.mae(knn_bsl1_preds)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.1191
MAE:  0.8707


0.8706840387794991

# Best Performing kNN Model

In [10]:
sim_dict = {'name': 'cosine', 'user_based': False}
knn_bsl = KNNBaseline(min_k=5, sim_options=sim_dict)
knn_bsl.fit(train_sr)
knn_bsl_preds = knn_bsl.test(test_sr)

accuracy.rmse(knn_bsl_preds)
accuracy.mae(knn_bsl_preds)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9972
MAE:  0.8008


0.8008295119623806

In [13]:
knn_bsl_preds_ho = knn_bsl.test(ho_sr)

accuracy.rmse(knn_bsl_preds_ho)
accuracy.mae(knn_bsl_preds_ho)

RMSE: 0.9621
MAE:  0.7708


0.7708208990370433

# V2V Ranker

I retrain the best model using all the data and then generate a V2V Ranker based only on the last video that a user rated.

In [11]:
#retrain on all data
knn_bsl.fit(all_sr)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x7fb41e520820>

In [22]:
def v2v_recs(user_list):
    
    '''
    this is an unpersonalized item based recommendation
    returns a movie that is similar to one that has been watched 
    but doesn't take into account any other user behavior
    '''
    
    v2v = []
    
    def v2v_mod(cust_id):
        
        # last movie they rated
        rated = data[data['cust_id'] == cust_id]['mid'].to_list()
        mid = rated[-1]
        
        # neighbors
        neighbors = knn_bsl.get_neighbors(all_sr.to_inner_iid(mid), k=10)
        
        #get raw item ids
        item_ids = [all_sr.to_raw_iid(inner_id) for inner_id in neighbors]

        #returns list of cust_id, iid, neighbor for each neighbork
        return [(cust_id, iid, neighbor) for iid in item_ids]
    
    #for all users get the nearest neighbors to the last movie they rated
    for cust_id in user_list:
        user_recs = v2v_mod(cust_id)
        v2v.extend(user_recs)
    
    #new df of recs for analysis
    neighbors_df = pd.DataFrame(v2v, columns=["cust_id", "mid", 'recs'])
    return neighbors_df

# V2V Ranker Results

In [26]:
#list of all users
all_users = data['cust_id'].unique()
len(all_users)

290022

In [27]:
v2v_df = v2v_recs(all_users)
v2v_df.head()

Unnamed: 0,cust_id,mid,recs
0,510180,1428,2
1,510180,1428,5
2,510180,1428,9
3,510180,1428,10
4,510180,1428,11


In [9]:
v2v_df.to_csv('data/v2vrecs.csv', index=False)

In [4]:
v2v_df = v2v_df.merge(minorityrec)
v2v_df.head()

Unnamed: 0,cust_id,mid,recs,m_minreq
0,510180,1428,2,0.0
1,510180,1428,5,0.0
2,510180,1428,9,0.0
3,510180,1428,10,0.0
4,510180,1428,11,0.0


In [28]:
v2v_df['mid'].value_counts()

mid
15205    13440
17169    11680
5496     11040
14550    10270
10947    10220
         ...  
45          10
6271        10
17037       10
2842        10
13403       10
Name: count, Length: 13906, dtype: int64

In [5]:
v2v_df['m_minreq'].value_counts()

m_minreq
0.0    2388630
1.0     511590
Name: count, dtype: int64

In [6]:
v2v_df['m_minreq'].value_counts(normalize=True)

m_minreq
0.0    0.823603
1.0    0.176397
Name: proportion, dtype: float64

Unsurprisingly, this model performed the best in terms of content diversirty. It recommends almost 14k distinct videos and contains almost 18% instances of minority driven content.

# User #2407458

Our example user was given the following results for this ranker:

In [7]:
example = data[data['cust_id'] == 2407458]
example

Unnamed: 0,mid,cust_id,rating,r_date,m_decade,m_avg_rating,user_engagement,adopters
309189,16128,2407458,4.0,2005-11-05,4,3.964478,4,5
309190,15342,2407458,3.0,2005-11-17,4,3.476331,4,5
309191,4157,2407458,3.0,2005-11-17,4,3.357143,4,5
309192,14606,2407458,3.0,2005-11-05,5,3.124744,4,5


In [8]:
v2v_example = v2v_df[(v2v_df['cust_id'] == 2407458)]
v2v_example

Unnamed: 0,cust_id,mid,recs,m_minreq
88110,2407458,14606,0,0.0
88111,2407458,14606,4,0.0
88112,2407458,14606,7,0.0
88113,2407458,14606,8,0.0
88114,2407458,14606,11,0.0
88115,2407458,14606,15,0.0
88116,2407458,14606,22,0.0
88117,2407458,14606,39,0.0
88118,2407458,14606,45,0.0
88119,2407458,14606,48,0.0
