# Recommender Systems part 2

* Welcome to my kernel, this is the second notebook for my final project series after finished 3 month Data Science class in Purwadhika Jakarta.
* I make it simple for presentation purpose

# Collaborative Filtering

Collaborative filtering is the most common way to do product recommendation online. It’s “collaborative” because it predicts a given customers preferences on the basis of other customers.

* Collaborative filtering technique works by building a database (user-item matrix) of preferences for items by users.
* It then matches users with relevant interest and preferences by calculating similarities between their profiles to make recommendations.
* An user gets recommendations to those items that he has not rated before but that were already positively rated by users in his neighborhood.

![medium](https://miro.medium.com/max/1400/1*7uW5hLXztSu_FOmZOWpB6g.png)
source: [medium](https://towardsdatascience.com/various-implementations-of-collaborative-filtering-100385c6dfe0)

# Model-Based Collaborative Filtering

Model-based Collaborative Filtering is based on matrix factorization (MF) which has received greater exposure, mainly as an unsupervised learning method for latent variable decomposition and dimensionality reduction. Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF:

* The goal of MF is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) to then predict the unknown ratings through the dot product of the latent features of users and items.
* When you have a very sparse matrix, with a lot of dimensions, by doing matrix factorization, you can restructure the user-item matrix into low-rank structure, and you can represent the matrix by the multiplication of two low-rank matrices, where the rows contain the latent vector.
* You fit this matrix to approximate your original matrix, as closely as possible, by multiplying the low-rank matrices together, which fills in the entries missing in the original matrix.

> # Because we use big & sparse data, i prefer model-based approach for this dataset

In [27]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import os
import re

from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import plotly.offline as py
py.init_notebook_mode(connected=True)

import warnings
warnings.filterwarnings('ignore')

plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = [18, 8]

# Load Dataset & Data Cleaning

In [28]:
reviews = pd.read_csv('ml-1m/ratings.dat', names=['userId', 'movieId', 'rating', 'time'], delimiter='::', engine='python', encoding='ISO-8859-1')
movies = pd.read_csv('ml-1m/movies.dat', names=['movieId', 'movie_names', 'genres'], delimiter='::', engine='python', encoding='ISO-8859-1')
users = pd.read_csv('ml-1m/users.dat', names=['userId', 'gender', 'age', 'occupation', 'zip'], delimiter='::', engine='python', encoding='ISO-8859-1')

print('Reviews shape:', reviews.shape)
print('Users shape:', users.shape)
print('Movies shape:', movies.shape)

Reviews shape: (1000209, 4)
Users shape: (6040, 5)
Movies shape: (3883, 3)


In [29]:
reviews.drop(['time'], axis=1, inplace=True)
users.drop(['zip'], axis=1, inplace=True)

In [30]:
movies['release_year'] = movies['movie_names'].str.extract(r'(?:\((\d{4})\))?\s*$', expand=False)

In [31]:
movies.head()

Unnamed: 0,movieId,movie_names,genres,release_year
0,1,Toy Story (1995),Animation|Children's|Comedy,1995
1,2,Jumanji (1995),Adventure|Children's|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama,1995
4,5,Father of the Bride Part II (1995),Comedy,1995


### Since we won't use age & occupation for prediction in this kernel, i changed this features value based on README from original datasets for better understanding in visualization

In [32]:
ages_map = {1: 'Under 18',
            18: '18 - 24',
            25: '25 - 34',
            35: '35 - 44',
            45: '45 - 49',
            50: '50 - 55',
            56: '56+'}

occupations_map = {0: 'Not specified',
                   1: 'Academic / Educator',
                   2: 'Artist',
                   3: 'Clerical / Admin',
                   4: 'College / Grad Student',
                   5: 'Customer Service',
                   6: 'Doctor / Health Care',
                   7: 'Executive / Managerial',
                   8: 'Farmer',
                   9: 'Homemaker',
                   10: 'K-12 student',
                   11: 'Lawyer',
                   12: 'Programmer',
                   13: 'Retired',
                   14: 'Sales / Marketing',
                   15: 'Scientist',
                   16: 'Self-Employed',
                   17: 'Technician / Engineer',
                   18: 'Tradesman / Craftsman',
                   19: 'Unemployed',
                   20: 'Writer'}

gender_map = {'M': 'Male', 'F': 'Female'}

users['age'] = users['age'].map(ages_map)
users['occupation'] = users['occupation'].map(occupations_map)
users['gender'] = users['gender'].map(gender_map)

### Merge all dataset

In [33]:
final_df = reviews.merge(movies, on='movieId', how='left').merge(users, on='userId', how='left')

print('final_df shape:', final_df.shape)

final_df shape: (1000209, 9)


In [34]:
final_df.head()

Unnamed: 0,userId,movieId,rating,movie_names,genres,release_year,gender,age,occupation
0,1,1193,5,One Flew Over the Cuckoo's Nest (1975),Drama,1975,Female,Under 18,K-12 student
1,1,661,3,James and the Giant Peach (1996),Animation|Children's|Musical,1996,Female,Under 18,K-12 student
2,1,914,3,My Fair Lady (1964),Musical|Romance,1964,Female,Under 18,K-12 student
3,1,3408,4,Erin Brockovich (2000),Drama,2000,Female,Under 18,K-12 student
4,1,2355,5,"Bug's Life, A (1998)",Animation|Children's|Comedy,1998,Female,Under 18,K-12 student


In [35]:
n_users = final_df['userId'].nunique()
n_movies = final_df['movieId'].nunique()

print('Number of users:', n_users)
print('Number of movies:', n_movies)

Number of users: 6040
Number of movies: 3706


In [36]:
final_df_matrix = final_df.pivot(index='userId',
                                 columns='movie_names',
                                 values='rating').fillna(0)

In [37]:
final_df_matrix

movie_names,"$1,000,000 Duck (1971)",'Night Mother (1986),'Til There Was You (1997),"'burbs, The (1989)",...And Justice for All (1979),1-900 (1994),10 Things I Hate About You (1999),101 Dalmatians (1961),101 Dalmatians (1996),12 Angry Men (1957),...,"Young Poisoner's Handbook, The (1995)",Young Sherlock Holmes (1985),Young and Innocent (1937),Your Friends and Neighbors (1998),Zachariah (1971),"Zed & Two Noughts, A (1985)",Zero Effect (1998),Zero Kelvin (Kjærlighetens kjøtere) (1995),Zeus and Roxanne (1997),eXistenZ (1999)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,0.0,3.0,0.0,0.0,0.0,0.0,2.0,4.0,0.0,0.0,...,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
6037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6039,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [38]:
final_df_matrix = final_df_matrix.fillna(0)
final_df_matrix

movie_names,"$1,000,000 Duck (1971)",'Night Mother (1986),'Til There Was You (1997),"'burbs, The (1989)",...And Justice for All (1979),1-900 (1994),10 Things I Hate About You (1999),101 Dalmatians (1961),101 Dalmatians (1996),12 Angry Men (1957),...,"Young Poisoner's Handbook, The (1995)",Young Sherlock Holmes (1985),Young and Innocent (1937),Your Friends and Neighbors (1998),Zachariah (1971),"Zed & Two Noughts, A (1985)",Zero Effect (1998),Zero Kelvin (Kjærlighetens kjøtere) (1995),Zeus and Roxanne (1997),eXistenZ (1999)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,0.0,3.0,0.0,0.0,0.0,0.0,2.0,4.0,0.0,0.0,...,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
6037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6039,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [40]:
# standardize or normalize data
def standardize(row):
    return (row - row.mean()) / (row.max() - row.min())

final_df_matrix = final_df_matrix.apply(standardize)
final_df_matrix


movie_names,"$1,000,000 Duck (1971)",'Night Mother (1986),'Til There Was You (1997),"'burbs, The (1989)",...And Justice for All (1979),1-900 (1994),10 Things I Hate About You (1999),101 Dalmatians (1961),101 Dalmatians (1996),12 Angry Men (1957),...,"Young Poisoner's Handbook, The (1995)",Young Sherlock Holmes (1985),Young and Innocent (1937),Your Friends and Neighbors (1998),Zachariah (1971),"Zed & Two Noughts, A (1985)",Zero Effect (1998),Zero Kelvin (Kjærlighetens kjøtere) (1995),Zeus and Roxanne (1997),eXistenZ (1999)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.003709,-0.007815,-0.004636,-0.029205,-0.02447,-0.000276,-0.079338,-0.067285,-0.036722,-0.087616,...,-0.009503,-0.04255,-0.001093,-0.012185,-0.000232,-0.003278,-0.037384,-0.00029,-0.001921,-0.044205
2,-0.003709,-0.007815,-0.004636,-0.029205,-0.02447,-0.000276,-0.079338,-0.067285,-0.036722,-0.087616,...,-0.009503,-0.04255,-0.001093,-0.012185,-0.000232,-0.003278,-0.037384,-0.00029,-0.001921,-0.044205
3,-0.003709,-0.007815,-0.004636,-0.029205,-0.02447,-0.000276,-0.079338,-0.067285,-0.036722,-0.087616,...,-0.009503,-0.04255,-0.001093,-0.012185,-0.000232,-0.003278,-0.037384,-0.00029,-0.001921,-0.044205
4,-0.003709,-0.007815,-0.004636,-0.029205,-0.02447,-0.000276,-0.079338,-0.067285,-0.036722,-0.087616,...,-0.009503,-0.04255,-0.001093,-0.012185,-0.000232,-0.003278,-0.037384,-0.00029,-0.001921,-0.044205
5,-0.003709,-0.007815,-0.004636,-0.029205,-0.02447,-0.000276,-0.079338,-0.067285,-0.036722,-0.087616,...,-0.009503,-0.04255,-0.001093,-0.012185,-0.000232,-0.003278,-0.037384,-0.00029,-0.001921,-0.044205
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,-0.003709,0.592185,-0.004636,-0.029205,-0.02447,-0.000276,0.320662,0.732715,-0.036722,-0.087616,...,-0.009503,0.55745,-0.001093,-0.012185,-0.000232,-0.003278,-0.037384,-0.00029,-0.001921,0.355795
6037,-0.003709,-0.007815,-0.004636,-0.029205,-0.02447,-0.000276,-0.079338,-0.067285,-0.036722,0.712384,...,-0.009503,-0.04255,-0.001093,-0.012185,-0.000232,-0.003278,-0.037384,-0.00029,-0.001921,-0.044205
6038,-0.003709,-0.007815,-0.004636,-0.029205,-0.02447,-0.000276,-0.079338,-0.067285,-0.036722,-0.087616,...,-0.009503,-0.04255,-0.001093,-0.012185,-0.000232,-0.003278,-0.037384,-0.00029,-0.001921,-0.044205
6039,-0.003709,-0.007815,-0.004636,-0.029205,-0.02447,-0.000276,-0.079338,-0.067285,-0.036722,-0.087616,...,-0.009503,0.55745,-0.001093,-0.012185,-0.000232,-0.003278,-0.037384,-0.00029,-0.001921,-0.044205


In [41]:
from sklearn.metrics.pairwise import cosine_similarity

item_similarity = cosine_similarity(final_df_matrix.T)
item_similarity_df = pd.DataFrame(item_similarity, index=final_df_matrix.columns, columns=final_df_matrix.columns)
item_similarity_df

movie_names,"$1,000,000 Duck (1971)",'Night Mother (1986),'Til There Was You (1997),"'burbs, The (1989)",...And Justice for All (1979),1-900 (1994),10 Things I Hate About You (1999),101 Dalmatians (1961),101 Dalmatians (1996),12 Angry Men (1957),...,"Young Poisoner's Handbook, The (1995)",Young Sherlock Holmes (1985),Young and Innocent (1937),Your Friends and Neighbors (1998),Zachariah (1971),"Zed & Two Noughts, A (1985)",Zero Effect (1998),Zero Kelvin (Kjærlighetens kjøtere) (1995),Zeus and Roxanne (1997),eXistenZ (1999)
movie_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"$1,000,000 Duck (1971)",1.000000,0.065338,0.030805,0.065478,0.048708,-0.001319,0.036612,0.176528,0.159973,0.075665,...,0.030758,0.060574,-0.002833,0.035056,-0.001237,0.040590,0.024165,-0.001332,0.116574,0.009243
'Night Mother (1986),0.065338,1.000000,0.107374,0.096778,0.144480,-0.001834,0.046122,0.123378,0.074706,0.083988,...,0.042061,0.065333,0.060202,0.124593,-0.001719,0.085001,0.054343,-0.001852,-0.005825,0.054699
'Til There Was You (1997),0.030805,0.107374,1.000000,0.082706,0.051965,0.079011,0.105672,0.091422,0.108952,0.054821,...,0.019685,0.043294,-0.003341,0.068935,-0.001459,0.016935,0.062262,-0.001571,0.042842,0.043483
"'burbs, The (1989)",0.065478,0.096778,0.082706,1.000000,0.110790,-0.003821,0.133881,0.198168,0.134041,0.113112,...,0.092597,0.165666,0.012226,0.114837,-0.003582,0.042862,0.121617,-0.003858,0.022249,0.062471
...And Justice for All (1979),0.048708,0.144480,0.051965,0.110790,1.000000,-0.003203,0.018614,0.151020,0.078917,0.160556,...,0.071819,0.115402,0.061253,0.088616,-0.003002,0.075720,0.075804,0.072283,-0.010171,0.071000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"Zed & Two Noughts, A (1985)",0.040590,0.085001,0.016935,0.042862,0.075720,-0.001186,-0.009530,0.030622,0.003318,0.019564,...,0.040208,0.059839,0.068069,0.134474,-0.001112,1.000000,0.071585,0.124038,-0.003766,0.125207
Zero Effect (1998),0.024165,0.054343,0.062262,0.121617,0.075804,-0.003931,0.114230,0.088855,0.047345,0.070705,...,0.159991,0.124075,0.013451,0.185265,-0.003684,0.071585,1.000000,0.056690,0.004788,0.199973
Zero Kelvin (Kjærlighetens kjøtere) (1995),-0.001332,-0.001852,-0.001571,-0.003858,0.072283,-0.000322,-0.006235,0.032238,-0.004277,0.032880,...,0.046727,0.043840,-0.000690,-0.002315,-0.000301,0.124038,0.056690,1.000000,-0.001021,0.042534
Zeus and Roxanne (1997),0.116574,-0.005825,0.042842,0.022249,-0.010171,-0.001011,0.042612,0.084739,0.059812,0.043186,...,-0.006375,0.034016,-0.002171,0.017447,-0.000948,-0.003766,0.004788,-0.001021,1.000000,0.031481


In [46]:
def get_similar_item(name, rating):
    similar_score = item_similarity_df[name]*(rating-2.5)
    similar_score = similar_score.sort_values(ascending=False)
    return similar_score

result = get_similar_item("One Flew Over the Cuckoo's Nest (1975)", 1)

temp_result = []
for key, value in result.items():
    temp_result.append([key, value])
    
# Remove all the rated movies from user - beautify
temp_result = pd.DataFrame(temp_result, columns=["movie", "predicted score"])
temp_result[~temp_result['movie'].isin(["action 1"])].head(20)

Unnamed: 0,movie,predicted score
0,Runaway Bride (1999),0.032272
1,"Saltmen of Tibet, The (1997)",0.032177
2,"Bachelor, The (1999)",0.031594
3,Quest for Camelot (1998),0.026879
4,Alien Escape (1995),0.026677
5,Center Stage (2000),0.026367
6,Wing Commander (1999),0.024915
7,"Skulls, The (2000)",0.02422
8,Man of the House (1995),0.023045
9,"First Love, Last Rites (1997)",0.022747
