<a href="https://colab.research.google.com/github/hugoalexg/Python-for-Data-Science-and-Machine-Learning-Bootcamp/blob/main/19_Recommender_Systems_part_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


**Advanced Recommender Systems with Python**

Welcome to the code notebook for creating Advanced Recommender Systems with Python. This is an optional lecture notebook for you to check out. Currently there is no video for this lecture because of the level of mathematics used and the heavy use of SciPy here.

In [3]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [4]:
from google.colab import drive
drive.mount("/content/drive", force_remount=True)

Mounted at /content/drive


In [5]:
sns.set()

Collaborative Filtering
In general, Collaborative filtering (CF) is more commonly used than content-based systems because it usually gives better results and is relatively easy to understand (from an overall implementation perspective). The algorithm has the ability to do feature learning on its own, which means that it can start to learn for itself what features to use. 

We can then read in the u.data file, which contains the full dataset.

In [6]:
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('/content/drive/My Drive/Python for Data Science and Machine Learning Bootcamp/Files/u.data', sep='\t', names=column_names)
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


In [7]:
movie_titles = pd.read_csv("/content/drive/My Drive/Python for Data Science and Machine Learning Bootcamp/Files/Movie_Id_Titles")
movie_titles.head()

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)



Then merge the dataframes:

In [8]:
df = pd.merge(df,movie_titles,on='item_id')
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,0,50,5,881250949,Star Wars (1977)
1,290,50,5,880473582,Star Wars (1977)
2,79,50,4,891271545,Star Wars (1977)
3,2,50,5,888552084,Star Wars (1977)
4,8,50,5,879362124,Star Wars (1977)


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100003 entries, 0 to 100002
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   user_id    100003 non-null  int64 
 1   item_id    100003 non-null  int64 
 2   rating     100003 non-null  int64 
 3   timestamp  100003 non-null  int64 
 4   title      100003 non-null  object
dtypes: int64(4), object(1)
memory usage: 4.6+ MB


Now let's take a quick look at the number of unique users and movies.

In [10]:
n_users = df.user_id.nunique()
n_items = df.item_id.nunique()

print('Num. of Users: '+ str(n_users))
print('Num of Movies: '+ str(n_items))

Num. of Users: 944
Num of Movies: 1682



**Memory-Based Collaborative Filtering**

Memory-Based Collaborative Filtering approaches can be divided into two main sections: user-item filtering and item-item filtering. 

A user-item filtering will take a particular user, find users that are similar to that user based on similarity of ratings, and recommend items that those similar users liked. 

In contrast, item-item filtering will take an item, find users who liked that item, and find other items that those users or similar users also liked. It takes items and outputs other items as recommendations. 

Item-Item Collaborative Filtering: “Users who liked this item also liked …”

User-Item Collaborative Filtering: “Users who are similar to you also liked …”

In [11]:
#criando matriz user vs items
data_matrix = np.zeros((n_users, n_items))
for line in df.itertuples():
    data_matrix[line[1]-1, line[2]-1] = line[3]

In [12]:
#calculando cosine similarity para diferentes usuarios, e diferentes filmes
from sklearn.metrics.pairwise import pairwise_distances
user_similarity = pairwise_distances(data_matrix, metric='cosine')
item_similarity = pairwise_distances(data_matrix.T, metric='cosine')

In [13]:
#apenas para entender a "cosine similarity", nao faz parte do exercicio
mat = np.matrix([[1, 2, 3], [2, 4, 6], [20, 7, 4]])
cos = pairwise_distances(mat, metric='cosine')
print(cos)

[[0.         0.         0.42987861]
 [0.         0.         0.42987861]
 [0.42987861 0.42987861 0.        ]]


In [14]:
#funcao que "preve" a nota de todos os filmes para cada usuario
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

In [15]:
item_prediction = predict(data_matrix, item_similarity, type='item')
user_prediction = predict(data_matrix, user_similarity, type='user')

In [16]:
#funcao que gera lista de filmes recomendados para determinado usuarios
def recommend_movies_user_n(user, user_pred, data_mat):
    user_n_dic = {'item_id': [],'rating': []}
    user_n = user_pred[user]
    rating_n = data_mat[user]
    for i in range(0,len(rating_n)):
        if rating_n[i] == 0:
            user_n_dic['item_id'].append(i+1)
            user_n_dic['rating'].append(user_n[i])
    user_n_dataframe = pd.DataFrame(user_n_dic)
    user_n_dataframe = pd.merge(user_n_dataframe, movie_titles,on='item_id')
    return user_n_dataframe.sort_values(by='rating',ascending=False).head(5)

In [17]:
#filmes recomendados para o usuario 550
recommend_movies_user_n(550, user_prediction, data_matrix)

Unnamed: 0,item_id,rating,title
0,1,2.264186,Toy Story (1995)
136,269,1.90642,"Full Monty, The (1997)"
91,173,1.793152,"Princess Bride, The (1987)"
79,151,1.683907,Willy Wonka and the Chocolate Factory (1971)
129,257,1.674988,Men in Black (1997)


In [18]:
#filme preferidos do usuario numero 550.
df[(df['user_id'] == 550) & (df['rating'] == 5)]

Unnamed: 0,user_id,item_id,rating,timestamp,title
238,550,50,5,883425283,Star Wars (1977)
9588,550,181,5,883425283,Return of the Jedi (1983)
11187,550,288,5,883425979,Scream (1996)
12478,550,15,5,883426027,Mr. Holland's Opus (1995)
17207,550,328,5,883425652,Conspiracy Theory (1997)
24343,550,323,5,883425465,Dante's Peak (1997)
26628,550,258,5,883425409,Contact (1997)
40916,550,271,5,883425652,Starship Troopers (1997)
55907,550,121,5,883426027,Independence Day (ID4) (1996)
66157,550,310,5,883425627,"Rainmaker, The (1997)"


In [19]:
#verificando correlação entre filmes
dfnew = pd.DataFrame(item_similarity, index=list(movie_titles['title']),columns=list(movie_titles['title']))

In [20]:
dfnew['Mask, The (1994)'].sort_values(ascending=True).head(6)

Mask, The (1994)                     0.000000
Ace Ventura: Pet Detective (1994)    0.389616
Mrs. Doubtfire (1993)                0.411207
Batman (1989)                        0.420240
Clueless (1995)                      0.437902
Interview with the Vampire (1994)    0.438974
Name: Mask, The (1994), dtype: float64

In [21]:
dfnew['Star Wars (1977)'].sort_values(ascending=True).head(6)

Star Wars (1977)                   0.000000
Return of the Jedi (1983)          0.116482
Raiders of the Lost Ark (1981)     0.235943
Empire Strikes Back, The (1980)    0.249550
Toy Story (1995)                   0.266223
Godfather, The (1972)              0.303423
Name: Star Wars (1977), dtype: float64

In [22]:
dfnew['Scream (1996)'].sort_values(ascending=True).head(6)

Scream (1996)            0.000000
Liar Liar (1997)         0.404640
Contact (1997)           0.429226
Twelve Monkeys (1995)    0.472323
Air Force One (1997)     0.476672
Game, The (1997)         0.484658
Name: Scream (1996), dtype: float64