**Authors:** John Leraas

**Date:** 2/19/022

**Purpose:** This notebook was created to provide an example of collaborative filtering work on a non-proprietary, and famous, dataset

# Collaborative Filtering Recommendation System

Collaborative filtering is a highly effective recommendation system and popularlized through the Netflix Prize competition in 2009. The basic intuition behind collaborative filtering is that if User1 and User2 both like items A, B, C, then if User1 also likes item D we would expect that User2 would like it as well. Bilinear prediction of the target variable (i.e. like, rating) is a simple parametric method that has been highly successful. Of particular interest, this is independent of any features associated with the specified items.

While the Netflix data provides an example of explicit ratings (input directly by the user), it is also possible to use implicit feedback. Examples of implicit ratings could be coding a "1" for movies that were stopped within the first quarter of the movie and never finished, or perhaps a "4" or "5" could be coded if a movie was watched until the end and the user then watched a similar movie. 

Collaborative filtering recommendation systems can suffer from a "cold start" problem referring to the fact that new users and new products/content will not have any pre-existing data from which to draw and there is no way to calculate similarity to other items. This can be overcome through (i) solicitation of initial ratings for new users, and/or (ii) supplementation of content-based recommender systems. 

# Data Setup and Import Libraries

In [1]:
# Netflix Dataset
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import pandas as pd
import numpy as np
import gc # Garbage Collection
from surprise import Reader, Dataset, SVD, KNNBasic
from surprise.model_selection import cross_validate

/kaggle/input/netflix-prize-data/combined_data_3.txt
/kaggle/input/netflix-prize-data/movie_titles.csv
/kaggle/input/netflix-prize-data/combined_data_4.txt
/kaggle/input/netflix-prize-data/combined_data_1.txt
/kaggle/input/netflix-prize-data/README
/kaggle/input/netflix-prize-data/probe.txt
/kaggle/input/netflix-prize-data/combined_data_2.txt
/kaggle/input/netflix-prize-data/qualifying.txt


# Load and Structure Data 

In [2]:
# files = ['../input/netflix-prize-data/combined_data_1.txt',
#         '../input/netflix-prize-data/combined_data_2.txt',
#         '../input/netflix-prize-data/combined_data_3.txt'
#         '../input/netflix-prize-data/combined_data_4.txt']

# Single file to reduce runtime
files = ['../input/netflix-prize-data/combined_data_1.txt']

__Note__ that data is structured as:

1:

1488844,3,2005-09-06

822109,5,2005-05-13

...

Specifically, 1: indicates the movie_id followed by user ratings (including date of rating)

In [3]:
data = [] #Empty List

for file in files:
    with open(file) as f:
        for line in f:
            line = line.strip()
            if line.endswith(':'):
                movie_id = line.replace(':','')
            else:
                w_line = str(movie_id) +','+ line
                w_tup = tuple(w_line.split(','))
                data.append(w_tup)

df=pd.DataFrame(data, columns=['movie_id', 'user_id', 'rating', 'date'])                
df.shape                

(24053764, 4)

In [4]:
# Clean
df = df[['user_id', 'movie_id', 'rating']]
df = df.astype({'rating': 'uint8'})

In [5]:
df.head()

Unnamed: 0,user_id,movie_id,rating
0,1488844,1,3
1,822109,1,5
2,885013,1,4
3,30878,1,4
4,823519,1,3


## Data Structure for Collaborative Filtering

For collaborative filtering, we would typically seek to structure our data (at least conceptually) into a single matrix with rows corresponding to user_id, columns corresponding to movie_id, and the elements corresponding to the rating value. The size of the Netflix dataset makes this difficult to achieve through Pandas dataframes, however it is exemplified through a subset of data below. This should be visualized as the required data structure for new collaborative recommendation projects.

It should be noted that recommender systems frequently deal with sparse datasets, similar to the one below, due to the fact that a single user is likely to only interact with a fraction of the potential products/content. Similarly any given product/content is likely to only receive ratings (explicit or implicit) from a fraction of the entire userbase.

In [6]:
df_sub = df.iloc[0:10000, ]
pd.pivot_table(data=df_sub, values = 'rating', index='user_id', columns='movie_id')

movie_id,1,2,3,4,5,6,7,8
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
100006,,,3.0,,,,,
100029,,,5.0,,,,,
1000597,,,,,,,,5.0
1000721,,,,,,,,5.0
1000868,,,,,,,,1.0
...,...,...,...,...,...,...,...,...
999312,,,,,4.0,,,
999362,,,,,,,,2.0
999444,,,4.0,,,,,
999463,,,,,,,,3.0


In [7]:
# Movie Titles
df_titles = pd.read_csv('../input/netflix-prize-data/movie_titles.csv', names=['movie_id', 'year', 'title'], header=None, encoding = "ISO-8859-1")
df_titles['movie_id'] = df_titles['movie_id'].astype(str)
df_titles.head()

Unnamed: 0,movie_id,year,title
0,1,2003.0,Dinosaur Planet
1,2,2004.0,Isle of Man TT 2004 Review
2,3,1997.0,Character
3,4,1994.0,Paula Abdul's Get Up & Dance
4,5,2004.0,The Rise and Fall of ECW


# EDA
Obtain preliminary sense for: 
- Unique Customer Count
- Unique Movie Count
- Rating Distribution

In [8]:
# Very no null values
null_movie = df[df['movie_id'].isnull()]['movie_id'].count()
null_user = df[df['user_id'].isnull()]['user_id'].count()
null_rating = df[df['rating'].isnull()]['rating'].count()

print(f'Movie ID - Nulls: {null_movie}')
print(f'User ID - Nulls: {null_user}')
print(f'User ID - Nulls: {null_rating}')

Movie ID - Nulls: 0
User ID - Nulls: 0
User ID - Nulls: 0


In [9]:
# Very no null values
unique_movie = df['movie_id'].nunique()
unique_user = df['user_id'].nunique()

print(f'Unique Movie IDs: {unique_movie}')
print(f'Unique User IDs: {unique_user}')

Unique Movie IDs: 4499
Unique User IDs: 470758


In [10]:
# Rating Distribution
dist_rating = df.groupby('rating')['rating'].agg(['count'])
dist_rating / dist_rating.sum() #Display percentage

Unnamed: 0_level_0,count
rating,Unnamed: 1_level_1
1,0.046487
2,0.101401
3,0.287031
4,0.336153
5,0.228928


In [11]:
# List of all movie_ids
movie_list = df['movie_id'].unique()
movie_list[0:10]

array(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'], dtype=object)

# Additonal Evaluation & Analysis

* Get list of past "likes" (or other specified rating) for a given user

In [12]:
def map_names(df):
    ## This function maps the movie_id from the primary dataframe to the movie name and release year in the movie titles dataframe
    
    res_df = pd.merge(df, df_titles, on='movie_id', how='left')
    return res_df

def get_past_likes(df, user_id, score = 5):
    ## This function returns all movies that a given user has scored a particular value (likes, or dislikes as specified)
    
    # Get 'likes'
    df_user_likes = df[(df['user_id']==user_id) & (df['rating']==score)]
    # Map Title
    df_user_likes = map_names(df_user_likes)
    return df_user_likes

In [13]:
# Example of past likes (score = 4)
get_past_likes(df, user_id = '1488844', score = 4)

Unnamed: 0,user_id,movie_id,rating,year,title
0,1488844,8,4,2004.0,What the #$*! Do We Know!?
1,1488844,195,4,2004.0,Chasing Freedom
2,1488844,268,4,1980.0,The Final Countdown
3,1488844,270,4,2001.0,Sex and the City: Season 4
4,1488844,285,4,1997.0,The Devil's Own
...,...,...,...,...,...
111,1488844,4330,4,1995.0,While You Were Sleeping
112,1488844,4341,4,2002.0,The Scorpion King
113,1488844,4364,4,1976.0,Network
114,1488844,4389,4,2003.0,A Man Apart


# Singular Value Decomposition / Collaborative Filtering


As previously discussed, the basic intuition behind collaborative filtering is that if User1 and User2 both like items A, B, C, then if User1 also likes item D we would expect that User2 would like it as well. Bilinear prediction of the target variable (i.e. "like", "rating") is a simple parametric method that has been highly successful. Additionally, we seek to organize the data such into a matrix $R$ such that rows correspond to users, columns correspond to products/content, and elements correspond to ratings.

We can let $\hat{R}$ be a matrix containing our predictions, $A$ be a matrix with user embeddings in its rows, and $B$ be a matrix with item embeddings in its columns. Then our prediction of rating for a given user, item is: 

$$
\hat{R}_{u,i} = b_u + c_i + \sum_{j}A_{u,j} B_{j,i}
$$

These embeddings can be obtained through singular value decomposition (SVD) on the matrix $R$ (actual ratings). The rows in the $R$ matrix correspond to items (i.e. movies) and the columns correspond to users. SVD is the factorization of one matrix into three matrices:

$$
R = U \Sigma V^T
$$

However, from our rating matrix, $R$, we can define:

$$
A = U\Sigma
$$
$$
B = V^T
$$



In [14]:
# from surprise import Reader, Dataset, SVD
# from surprise.model_selection import cross_validate

In [15]:
# Load Data
reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(df[['user_id', 'movie_id', 'rating']], reader)

train = data.build_full_trainset()

In [16]:
# Define Model
svd = SVD(n_epochs=10, n_factors=50) #Default: n_epochs = 20, n_features = 100

In [17]:
# Cross Validation to Estimate Error -- Optional (time consuming)
# Note: cross_validate() works on data, not the full trainset - they are different datastructures

#cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

In [18]:
# Fit Model
svd.fit(train)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f48f9f19950>

In [19]:
# Example Prediction (user_id, movie_id)
svd.predict(1488844, 270)

Prediction(uid=1488844, iid=270, r_ui=None, est=3.5996343025565563, details={'was_impossible': False})

### SVD Estimates for Past Ratings

In [20]:
def score_past_ratings(user_id):
    ## This function provides ratings estimates for all movies previously rated by a given user
    
    # Get subset of single user movie ratings 
    df_user = map_names(df[df['user_id']==user_id])
    # Get Estimates
    df_user['estimate'] = df_user['movie_id'].apply(lambda x: svd.predict(user_id, x).est)
    df_user = df_user.sort_values(by='estimate', ascending = False)
    
    return df_user

In [21]:
# Example Results
score_past_ratings(user_id = '785314')

Unnamed: 0,user_id,movie_id,rating,year,title,estimate
6,785314,175,5,1992.0,Reservoir Dogs,4.249013
95,785314,2803,4,1995.0,Pride and Prejudice,4.242454
83,785314,2452,5,2001.0,Lord of the Rings: The Fellowship of the Ring,4.232521
69,785314,2122,5,1999.0,Being John Malkovich,4.221058
50,785314,1625,4,1986.0,Aliens: Collector's Edition,4.215230
...,...,...,...,...,...,...
149,785314,4216,1,2001.0,Jurassic Park III,2.649102
146,785314,4123,1,1998.0,Patch Adams,2.636572
143,785314,4056,2,2001.0,Planet of the Apes,2.584696
74,785314,2200,4,2002.0,Collateral Damage,2.479519


### SVD Estimates for New Movies - Collaborative Recommendations

Calculate new movie recommendations for a given user. Note that the sorted list can also be used to infer which movies the user is not expected to rate highly.

In [22]:
def user_new_movie_recommendations(user_id):
    ## This function returns an ordered list of movie recommendations for a given user.
    
    # Get List of Movies User has not Rated
    user_list = df[df['user_id']==user_id]
    user_list = user_list['movie_id'].unique()
    user_list = np.setdiff1d(movie_list, user_list)
    
    # User DF
    df_movlist = pd.DataFrame(user_list, columns=['movie_id'])
    # Predictions
    df_movlist['estimate'] = df_movlist['movie_id'].apply(lambda x: svd.predict(user_id, x).est)
    
    # Sort & Map Names
    df_movlist = df_movlist.sort_values(by='estimate', ascending = False)
    df_movlist = map_names(df_movlist)
    
    return df_movlist


In [23]:
# Example Output
user_new_movie_recommendations(user_id = '785314')

Unnamed: 0,movie_id,estimate,year,title
0,2019,4.571664,2004.0,Samurai Champloo
1,1499,4.484786,2000.0,FLCL
2,722,4.475898,2003.0,The Wire: Season 1
3,2114,4.470789,2002.0,Firefly
4,3456,4.470002,2004.0,Lost: Season 1
...,...,...,...,...
4329,3575,1.304073,2005.0,The Worst Horror Movie Ever Made
4330,4202,1.279676,2004.0,Half-Caste
4331,2999,1.278149,2003.0,Bad Bizness
4332,1725,1.154746,2003.0,Ben & Arthur


# Recommend Movies Based on Individual Movie

As oposed to the bilinear prediction estimated with SVD, we can also perform a K-Nearest Neighbors calculation to make recommendations based upon a single movie. Specifically, for a given movie, we are finding the nearest neighbors in "user space." Perhaps 'Movie A' is described by a certain group of users rating it as a "5". If we are looking for its nearest neighbors, these would correspond to the same users rating those movies highly.  




In [24]:
sim_options = {
    'name': 'pearson', 
    'user_based': False
}

clf = KNNBasic(k=40, k_min=1, sim_options = sim_options)

In [25]:
# Cross Validation to Estimate Error -- Optional (time consuming)
# Note: cross_validate() works on data, not the full trainset - they are different datastructures

#cross_validate(clf, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

In [26]:
# Fit Model to Training Data
clf.fit(train)

Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7f490bc06e10>

In [27]:
# Sample Predictions - (movie_id, num_nearest_neighbors)
clf.get_neighbors(1, 5)

[10, 19, 33, 40, 41]

In [28]:
def movies_return_similar(movie_name, num_results=10):
    
    # Get movie_id from name
    movie_id = df_titles[df_titles['title']==movie_name].iloc[0]['movie_id']
    # Calculate KNN
    results = clf.get_neighbors(int(movie_id), num_results)
    # Results into DataFrame format and map movie names
    results = pd.DataFrame(results, columns=['movie_id']).astype({'movie_id': 'str'})
    results = map_names(results)
    
    return results

In [29]:
movies_return_similar('Dinosaur Planet', num_results=10)

Unnamed: 0,movie_id,year,title
0,10,2001.0,Fighter
1,19,2000.0,By Dawn's Early Light
2,33,2000.0,Aqua Teen Hunger Force: Vol. 1
3,40,2004.0,Pitcher and the Pin-Up
4,41,2000.0,Horror Vision
5,52,2002.0,The Weather Underground
6,62,1991.0,Ken Burns' America: Empire of the Air
7,92,2002.0,ECW: Cyberslam '99
8,122,2002.0,Cube 2: Hypercube
9,129,2003.0,Darkwolf


# Conclusions

* Collaborative filtering models are an effective way to recommend content based upon user behavior (and utilize this data)
* Explicit data can be utilized (user ratings), though features can also be calculated with implicit data (e.g. did someone stop watching a movie after 5 minutes?)
* Collaborative filtering can be used to make recommendations for a given user or based on a single product
* The collaborative filtering model used in this notebook is based upon singular value decomposition
* "Surprise", a Python scikit, is a very useful and efficient tool for building recommender systems, even with particularly large, sparse matrices