# Recommender Systems part 2

* Welcome to my kernel, this is the second notebook for my final project series after finished 3 month Data Science class in Purwadhika Jakarta.
* I make it simple for presentation purpose

# Collaborative Filtering

Collaborative filtering is the most common way to do product recommendation online. It’s “collaborative” because it predicts a given customers preferences on the basis of other customers.

* Collaborative filtering technique works by building a database (user-item matrix) of preferences for items by users.
* It then matches users with relevant interest and preferences by calculating similarities between their profiles to make recommendations.
* An user gets recommendations to those items that he has not rated before but that were already positively rated by users in his neighborhood.

![medium](https://miro.medium.com/max/1400/1*7uW5hLXztSu_FOmZOWpB6g.png)
source: [medium](https://towardsdatascience.com/various-implementations-of-collaborative-filtering-100385c6dfe0)

# Model-Based Collaborative Filtering

Model-based Collaborative Filtering is based on matrix factorization (MF) which has received greater exposure, mainly as an unsupervised learning method for latent variable decomposition and dimensionality reduction. Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF:

* The goal of MF is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) to then predict the unknown ratings through the dot product of the latent features of users and items.
* When you have a very sparse matrix, with a lot of dimensions, by doing matrix factorization, you can restructure the user-item matrix into low-rank structure, and you can represent the matrix by the multiplication of two low-rank matrices, where the rows contain the latent vector.
* You fit this matrix to approximate your original matrix, as closely as possible, by multiplying the low-rank matrices together, which fills in the entries missing in the original matrix.

> # Because we use big & sparse data, i prefer model-based approach for this dataset

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import os
import re

from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import plotly.offline as py
py.init_notebook_mode(connected=True)

import warnings
warnings.filterwarnings('ignore')

plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = [18, 8]

# Load Dataset & Data Cleaning

In [6]:
reviews = pd.read_csv('ml-1m/ratings.dat', names=['userId', 'movieId', 'rating', 'time'], delimiter='::', engine='python', encoding='ISO-8859-1')
movies = pd.read_csv('ml-1m/movies.dat', names=['movieId', 'movie_names', 'genres'], delimiter='::', engine='python', encoding='ISO-8859-1')
users = pd.read_csv('ml-1m/users.dat', names=['userId', 'gender', 'age', 'occupation', 'zip'], delimiter='::', engine='python', encoding='ISO-8859-1')

print('Reviews shape:', reviews.shape)
print('Users shape:', users.shape)
print('Movies shape:', movies.shape)

Reviews shape: (1000209, 4)
Users shape: (6040, 5)
Movies shape: (3883, 3)


In [7]:
reviews.drop(['time'], axis=1, inplace=True)
users.drop(['zip'], axis=1, inplace=True)

In [8]:
movies['release_year'] = movies['movie_names'].str.extract(r'(?:\((\d{4})\))?\s*$', expand=False)

In [9]:
movies.head()

Unnamed: 0,movieId,movie_names,genres,release_year
0,1,Toy Story (1995),Animation|Children's|Comedy,1995
1,2,Jumanji (1995),Adventure|Children's|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama,1995
4,5,Father of the Bride Part II (1995),Comedy,1995


### Since we won't use age & occupation for prediction in this kernel, i changed this features value based on README from original datasets for better understanding in visualization

In [10]:
ages_map = {1: 'Under 18',
            18: '18 - 24',
            25: '25 - 34',
            35: '35 - 44',
            45: '45 - 49',
            50: '50 - 55',
            56: '56+'}

occupations_map = {0: 'Not specified',
                   1: 'Academic / Educator',
                   2: 'Artist',
                   3: 'Clerical / Admin',
                   4: 'College / Grad Student',
                   5: 'Customer Service',
                   6: 'Doctor / Health Care',
                   7: 'Executive / Managerial',
                   8: 'Farmer',
                   9: 'Homemaker',
                   10: 'K-12 student',
                   11: 'Lawyer',
                   12: 'Programmer',
                   13: 'Retired',
                   14: 'Sales / Marketing',
                   15: 'Scientist',
                   16: 'Self-Employed',
                   17: 'Technician / Engineer',
                   18: 'Tradesman / Craftsman',
                   19: 'Unemployed',
                   20: 'Writer'}

gender_map = {'M': 'Male', 'F': 'Female'}

users['age'] = users['age'].map(ages_map)
users['occupation'] = users['occupation'].map(occupations_map)
users['gender'] = users['gender'].map(gender_map)

### Merge all dataset

In [11]:
final_df = reviews.merge(movies, on='movieId', how='left').merge(users, on='userId', how='left')

print('final_df shape:', final_df.shape)

final_df shape: (1000209, 9)


In [12]:
final_df.head()

Unnamed: 0,userId,movieId,rating,movie_names,genres,release_year,gender,age,occupation
0,1,1193,5,One Flew Over the Cuckoo's Nest (1975),Drama,1975,Female,Under 18,K-12 student
1,1,661,3,James and the Giant Peach (1996),Animation|Children's|Musical,1996,Female,Under 18,K-12 student
2,1,914,3,My Fair Lady (1964),Musical|Romance,1964,Female,Under 18,K-12 student
3,1,3408,4,Erin Brockovich (2000),Drama,2000,Female,Under 18,K-12 student
4,1,2355,5,"Bug's Life, A (1998)",Animation|Children's|Comedy,1998,Female,Under 18,K-12 student


# Visualization

In [None]:
gender_counts = users['gender'].value_counts()

colors1 = ['dodgerblue', 'pink']

pie = go.Pie(labels=gender_counts.index,
             values=gender_counts.values,
             marker=dict(colors=colors1),
             hole=0.5)

layout = go.Layout(title='Male & Female users', font=dict(size=18), legend=dict(orientation='h'))

fig = go.Figure(data=[pie], layout=layout)
py.iplot(fig)

In [None]:
age_reindex = ['Under 18', '18 - 24', '25 - 34', '35 - 44', '45 - 49', '50 - 55', '56+']

age_counts = users['age'].value_counts().reindex(age_reindex)

sns.barplot(x=age_counts.values,
            y=age_counts.index,
            palette='magma').set_title(
                'Users age', fontsize=24)

plt.show()

* Wow, majority users age are from 25 - 34
* Let's check top-7 movies which is liked by them

In [None]:
final_df[final_df['age'] == '25 - 34']['movie_names'].value_counts()[:7]

In [None]:
occupation_counts = users['occupation'].value_counts().sort_values(ascending=False)

sns.barplot(x=occupation_counts.values,
            y=occupation_counts.index,
            palette='dark').set_title(
                'Occupation list', fontsize=14)

plt.show()

# Netflix 1 million prize

![awards](https://static01.nyt.com/images/2009/09/21/technology/netflixawards.480.jpg)
*source:* [The New York Times](https://bits.blogs.nytimes.com/2009/09/21/netflix-awards-1-million-prize-and-starts-a-new-contest/)

* In 2006, Netflix announced a competition to produce a better movie rating prediction system than their current one, at the time called Cinematch.
* The idea is that Netflix would like to predict how well its users might like individual movies, so that it could recommend movies to them.
* The more accurately they would be able to predict users ratings of movies, the better for their business.
* The grand prize was $1 million.

The winning entry for the famed Netflix Prize had a number of SVD models (including SVD++ blended with Restricted Boltzmann Machines).

> ### Using these methods they achieved a 10 percent increase in accuracy over Netflix’s existing algorithm.

# Support Vector Decomposition (SVD)

* A recommendation technique that is efficient when the number of dataset is limited may be unable to generate satisfactory number of recommendations when the volume of dataset is increased.
* Thus, it is crucial to apply recommendation techniques which are capable of scaling up in a successful manner as the number of dataset in a database increases.
* Methods used for solving scalability problem and speeding up recommendation generation are based on Dimensionality reduction techniques, such as Singular Value Decomposition (SVD) method, which has the ability to produce reliable and efficient recommendations.

> ### Option 1: calculate SVD by manual 

In [None]:
n_users = final_df['userId'].nunique()
n_movies = final_df['movieId'].nunique()

print('Number of users:', n_users)
print('Number of movies:', n_movies)

In [None]:
final_df_matrix = final_df.pivot(index='userId',
                                 columns='movieId',
                                 values='rating').fillna(0)

In [None]:
final_df_matrix.head()

In [None]:
user_ratings_mean = np.mean(final_df_matrix.values, axis=1)
ratings_demeaned = final_df_matrix.values - user_ratings_mean.reshape(-1, 1)

In [None]:
# Check data sparsity

sparsity = round(1.0 - final_df.shape[0] / float(n_users * n_movies), 3)
print('The sparsity level of MovieLens1M dataset is ' +  str(sparsity * 100) + '%')

In [None]:
from scipy.sparse.linalg import svds

U, sigma, Vt = svds(ratings_demeaned, k=50)  # Number of singular values and vectors to compute

#### As I'm going to leverage matrix multiplication to get predictions, I'll convert the $\Sigma$ (now are values) to the diagonal matrix form.

In [None]:
sigma = np.diag(sigma)

In [None]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)

In [None]:
preds = pd.DataFrame(all_user_predicted_ratings, columns = final_df_matrix.columns)

preds.head()

Now I write a function to return the movies with the highest predicted rating that the specified user hasn't already rated. Though I didn't use any explicit movie content features (such as genre or title), I'll merge in that information to get a more complete picture of the recommendations.

In [None]:
def recommend_movies(predictions, userID, movies, reviews, num_recommendations):
    
    # Get and sort the user's predictions
    user_row_number = userID - 1 # User ID starts at 1, not 0
    sorted_user_predictions = preds.iloc[user_row_number].sort_values(ascending=False)
    
    # Get the user's data and merge in the movie information.
    user_data = reviews[reviews.userId == (userID)]
    user_full = (user_data.merge(movies, how = 'left', on = 'movieId').
                     sort_values(['rating'], ascending=False)
                 )

    print('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    print('Recommending highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies[~movies['movieId'].isin(user_full['movieId'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movieId',
               right_on = 'movieId').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full.head(10), recommendations.sort_values('release_year', ascending=False)  # then sort by newest release year

In [None]:
user_already_rated, for_recommend = recommend_movies(preds, 1920, movies, reviews, 10)

### Example 10 movies user 1920 has already rated

In [None]:
user_already_rated

### Top-10 movies that user 1920 hopefully will enjoy

In [None]:
for_recommend

> ### Option 2 (faster) : use Surprise library 

# SurPRISE - Simple Python RecommendatIon System Engine.


In [None]:
from surprise import Reader, Dataset, SVD, SVDpp
from surprise import accuracy

In [None]:
reader = Reader(rating_scale=(1, 5))

dataset = Dataset.load_from_df(final_df[['userId', 'movieId', 'rating']], reader=reader)

svd = SVD(n_factors=50)
svd_plusplus = SVDpp(n_factors=50)

In [None]:
trainset = dataset.build_full_trainset()

svd.fit(trainset)  # old version use svd.train

In [None]:
### It will take a LONG....TIME...., but it'll give a better score in RMSE & MAE

# svd_plusplus.fit(trainset)

In [None]:
id_2_names = dict()

for idx, names in zip(movies['movieId'], movies['movie_names']):
    id_2_names[idx] = names

In [None]:
def Build_Anti_Testset4User(user_id):
    
    fill = trainset.global_mean
    anti_testset = list()
    u = trainset.to_inner_uid(user_id)
    
    # ur == users ratings
    user_items = set([item_inner_id for (item_inner_id, rating) in trainset.ur[u]])
    
    anti_testset += [(trainset.to_raw_uid(u), trainset.to_raw_iid(i), fill) for
                            i in trainset.all_items() if i not in user_items]
    
    return anti_testset

### First, let's try SVD for make Top-N recommendation

In [None]:
def TopNRecs_SVD(user_id, num_recommender=10, latest=False):
    
    testSet = Build_Anti_Testset4User(user_id)
    predict = svd.test(testSet)  # we can change to SVD++ later
    
    recommendation = list()
    
    for userID, movieID, actualRating, estimatedRating, _ in predict:
        intMovieID = int(movieID)
        recommendation.append((intMovieID, estimatedRating))
        
    recommendation.sort(key=lambda x: x[1], reverse=True)
    
    movie_names = []
    movie_ratings = []
    
    for name, ratings in recommendation[:20]:
        movie_names.append(id_2_names[name])
        movie_ratings.append(ratings)
        
    movie_dataframe =  pd.DataFrame({'movie_names': movie_names,
                                     'rating': movie_ratings}).merge(movies[['movie_names', 'release_year']],
                                            on='movie_names', how='left')
    
    if latest == True:
        return movie_dataframe.sort_values('release_year', ascending=False)[['movie_names', 'rating']].head(num_recommender)
    
    else:
        return movie_dataframe.drop('release_year', axis=1).head(num_recommender)

### First option:

* Sort by predicted rating
* Sometimes system will recommend an old movie
* If users doesn't like old movie, it'll become a problem

In [None]:
TopNRecs_SVD(1920, num_recommender=10)

### Second option:

* Sort by release year
* System will recommend the latest movies
* But the best predicted rating will go down from the top

In [None]:
TopNRecs_SVD(1920, num_recommender=10, latest=True)

## Model Evaluation

In [None]:
# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()

predictions_svd = svd.test(testset)

In [None]:
print('SVD - RMSE:', accuracy.rmse(predictions_svd, verbose=False))
print('SVD - MAE:', accuracy.mae(predictions_svd, verbose=False))

> ### Remember in recommendation, the most important is Top-N recommendation (list of product to recommend), not RMSE or MAE

It's also important to consider many performance metrics like: 
* diversity
* coverage
* serendipity
* novelty

I'ii try to explain this metrics in another kernel

## Optional

you can use function below to give a recommendation to all users

In [None]:
from collections import defaultdict

def GetTopN(predictions, n=10, minimumRating=4.0):
        topN = defaultdict(list)

        for userID, movieID, actualRating, estimatedRating, _ in predictions:
            if (estimatedRating >= minimumRating):
                topN[int(userID)].append((int(movieID), estimatedRating))

        for userID, ratings in topN.items():
            ratings.sort(key=lambda x: x[1], reverse=True)
            topN[int(userID)] = ratings[:n]

        return topN

In [None]:
top_n = GetTopN(predictions_svd, n=10)

ii = 0
for uid, predict_ratings in top_n.items():
    print(uid, [iid for (iid, _) in predict_ratings])
    ii += 1
    
    if ii > 5:
        break

* The first one is user id
* The second one is movie id (recommended to users)

### Last quote from me:

> ## Recommendation System take us out from *the age of information* and bring us in to *the age of recommendation*

# Reference

"SVD for Movie Recommendations", https://github.com/khanhnamle1994/movielens/blob/master/SVD_Model.ipynb

Surprise Library, http://surpriselib.com/