# The Age of the Recommedation

## How to use this Notebook

You can open it at https://colab.research.google.com

You require two files from the IMBD (Internet Movie Data Base) named *movies.csv* and *ratings.csv* that should be uploaded to Google Colab

## Import libraries

In [None]:
import numpy as np
import pandas as pd 

from matplotlib import pyplot as plt 
import seaborn as sns

import scipy as sp
from sklearn.metrics.pairwise import cosine_similarity

%matplotlib inline
plt.style.use('ggplot')
pd.options.display.max_rows = 999

## Import data

In [None]:
file_dir = '/content/'

In [None]:
ratings_ = pd.read_csv(file_dir + 'ratings.csv')

In [None]:
ratings_.info()

In [None]:
ratings_.head(10)

In [None]:
ratings_.describe()

In [None]:
sns.countplot(x='rating', data=ratings_)

In [None]:
movies_ = pd.read_csv(file_dir + 'movies.csv')

In [None]:
movies_.info()

In [None]:
movies_.head().T

## Merge of ratings data with movies metadata
Get the movie title into ratings data

In [None]:
rated_movies = ratings_.merge(movies_, left_on='movieId', right_on='movieId')[['userId', 'movieId','title', 'rating']]

In [None]:
rated_movies.head()

## EDA - Exploratory Data Analysis

In [None]:
mean_ratings = rated_movies.pivot_table(index='title', values='rating', aggfunc=[np.mean, len])

In [None]:
mean_ratings.columns = ['mean_rating', 'ratings']

In [None]:
mean_ratings.head()

In [None]:
mean_ratings.describe()

In [None]:
mean_ratings[mean_ratings['ratings'] > 50].sort_values(by='mean_rating', ascending=False).head(10)

In [None]:
mean_ratings[mean_ratings.index.str.contains('Indiana Jones')]

In [None]:
# pivot ratings into movie features
ratings_pivot = rated_movies.pivot_table(
    index='userId',
    columns='title',
    aggfunc=np.mean,
    values='rating'
).fillna(0)

In [None]:
ratings_pivot.info()

In [None]:
ratings_pivot.head(10).T.head(10)

## Calculate cosine similarity between users and between movies

In [None]:
sparse_ratings = sp.sparse.csr_matrix(ratings_pivot.values)

In [None]:
user_similarity = cosine_similarity(sparse_ratings)
item_similarity = cosine_similarity(sparse_ratings.T)

In [None]:
user_similarity_df = pd.DataFrame(user_similarity, index = ratings_pivot.index, columns = ratings_pivot.index)
item_similarity_df = pd.DataFrame(item_similarity, index = ratings_pivot.columns, columns = ratings_pivot.columns)

In [None]:
item_similarity_df.head()

In [None]:
user_similarity_df.head()

## Explore results of cosine similarity filters

### First with movies (items)

This function will return the top 10 shows with the highest cosine similarity value

In [None]:
def top_movies(name):
    count = 1
    print('Similar shows to {} include:\n'.format(name))
    for item in item_similarity_df.sort_values(by = name, ascending = False).index[1:11]:
        print('No. {}: {}'.format(count, item))
        count +=1  

Ask the function top_movies regarding one movie to get the 10 more similar

In [None]:
top_movies('Star Wars: Episode V - The Empire Strikes Back (1980)')

In [None]:
top_movies('[REC] (2007)')

In [None]:
top_movies('xXx (2002)')

### Same approach but now with users, instead of movies

Top 10 users with highest cosine similarity from a user defined

In [None]:
# This function will return the top 10 users with the highest similarity value 

def top_users(user):
    
    if user not in ratings_pivot.index:
        return('No data available on user {}'.format(user))
    
    print('Most Similar Users:\n')
    sim_values = user_similarity_df.sort_values(by=user, ascending=False).loc[:,user].tolist()[1:11]
    sim_users = user_similarity_df.sort_values(by=user, ascending=False).index[1:11]
    zipped = zip(sim_users, sim_values)
    for user, sim in zipped:
        print('User #{0}, Similarity value: {1:.2f}'.format(user, sim)) 

In [None]:
top_users(7)

In [None]:
ratings_[ratings_['userId'] == 7].head(10)

In [None]:
ratings_[ratings_['userId'] == 403].head(15)

## Another approach for recommenders - clustering models

### Dimension reduction with PCA

Data is too wide, so we require reducing the attributes - predictors with a known technique called Principal Component Analysis (aka PCA)

In [None]:
from sklearn.decomposition import PCA

pca = PCA()
pca.fit(ratings_pivot)
pca_samples = pca.transform(ratings_pivot)

PCA compares the reduction of attributes with the capacity of explaining the variance

More reduccion implies usually less explained variance... so we lose somehow "resolution"

In [None]:
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

In [None]:
pca_df = pd.DataFrame(pca_samples[:,0:50])

In [None]:
pca_df.head(20)

### Creation of the clustering model 
Used to segment and group movies or users by similarity, in this case, the Euclidian distance between users or movies

In [None]:
from sklearn.cluster import KMeans

sse = {}

for k in range(1,20):
    kmeans = KMeans(n_clusters = k)
    kmeans.fit(pca_df)
    sse[k] = kmeans.inertia_

plt.figure(figsize=(15,5))
plt.plot(list(sse.keys()), list(sse.values()))
plt.show()

After comparing differnt numbers of clusters, we decide to use 15 as per the chart above


In [None]:
kmeans = KMeans(n_clusters = 15)
kmeans.fit(pca_df)
pca_df['label'] = kmeans.labels_

In [None]:
sns.pairplot(pca_df, vars=[0,1,2], hue='label')

In [None]:
ratings_pivot['label'] = pca_df['label']

In [None]:
cluster_ = ratings_pivot[ratings_pivot['label'] == 1].drop('label', axis=1)

In [None]:
cluster_.replace({0: np.nan}, inplace = True)

In [None]:
cluster_.mean().sort_values(ascending=False).head(10)

A loop that shows the top10 movies for each cluster defined

In [None]:
for i in range(15):
    cluster_ = ratings_pivot[ratings_pivot['label'] == i].drop('label', axis=1)
    cluster_.replace({0: np.nan}, inplace = True)
    print('TOP10 movies for cluster ', i)
    votes = cluster_.count()
    most_voted_movies = votes[votes > 5]
    print(cluster_[most_voted_movies.index].mean().sort_values(ascending=False).head(10))
    print('-------')