<h2 style="text-align:center;font-size:200%;;">Filtrage collaboratif: système de Recommandation de films</h2>

In [None]:
# importing required libraries
import pandas as pd
import numpy as np

# Filtrage Collaboratif

**La filtrage collaboratif se base sur la notion de similitude (ou distance). Par exemple, si deux utilisateurs A et B ont acheté les mêmes produits et les ont notés de manière similaire sur une échelle de notation commune, alors A et B peuvent être considérés comme similaires en nature et dans leur comportement d'achat. Par conséquent, si A achète un nouveau produit et le note très haut, ce produit peut être recommandé à B et vice versa.**

**Le filtrage collaboratif se divise en deux variantes :**
1. Similarité basée sur l'utilisateur (User-Based Similarity)
2. Similarité basée sur l'article (Item-Based Similarity)

## 1. User-based similarity

In [None]:
df2 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/DataMining/ratings.csv')
df2.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [None]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [None]:
df2.drop('timestamp', axis=1, inplace=True)

In [None]:
len(df2.userId.unique()), len(df2.movieId.unique())

(610, 9724)

In [None]:
# create a pivot dataframe with index as a userid and columns as a movieid
um_df = df2.pivot(index = 'userId',
                  columns = 'movieId',
                  values = 'rating').reset_index(drop=True)
um_df.index = df2.userId.unique()
um_df.iloc[:5,:15]

movieId,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
1,4.0,,4.0,,,4.0,,,,,,,,,
2,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,
5,4.0,,,,,,,,,,,,,,


In [None]:
# use fillna method to convert NaN to zeros
um_df.fillna(0, inplace=True)
um_df.iloc[:5,:15]

movieId,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# calculating cosine similarity between users
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine, correlation

user_sim = 1 - pairwise_distances(um_df.values, 
                                  metric = 'cosine')
# store the results in a dataframe
user_sim_df = pd.DataFrame(user_sim)
# set the index and columns of the dataframe
user_sim_df.index = df2.userId.unique()
user_sim_df.columns = df2.userId.unique()
user_sim_df.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
1,1.0,0.027283,0.05972,0.194395,0.12908,0.128152,0.158744,0.136968,0.064263,0.016875,...,0.080554,0.164455,0.221486,0.070669,0.153625,0.164191,0.269389,0.291097,0.093572,0.145321
2,0.027283,1.0,0.0,0.003726,0.016614,0.025333,0.027585,0.027257,0.0,0.067445,...,0.202671,0.016866,0.011997,0.0,0.0,0.028429,0.012948,0.046211,0.027565,0.102427
3,0.05972,0.0,1.0,0.002251,0.00502,0.003936,0.0,0.004941,0.0,0.0,...,0.005048,0.004892,0.024992,0.0,0.010694,0.012993,0.019247,0.021128,0.0,0.032119
4,0.194395,0.003726,0.002251,1.0,0.128659,0.088491,0.11512,0.062969,0.011361,0.031163,...,0.085938,0.128273,0.307973,0.052985,0.084584,0.200395,0.131746,0.149858,0.032198,0.107683
5,0.12908,0.016614,0.00502,0.128659,1.0,0.300349,0.108342,0.429075,0.0,0.030611,...,0.068048,0.418747,0.110148,0.258773,0.148758,0.106435,0.152866,0.135535,0.261232,0.060792


> Ce code calcule la similarité cosinus entre les utilisateurs d'un jeu de données. Pour ce faire, il importe les fonctions de similarité cosinus et de corrélation de la bibliothèque scipy, ainsi que la fonction pairwise_distances de la bibliothèque sklearn.

> Ensuite, il utilise la fonction pairwise_distances avec la métrique "cosine" pour calculer la similarité cosinus entre tous les utilisateurs du jeu de données. Les résultats sont stockés dans une DataFrame appelée user_sim_df, qui est configurée pour avoir les identifiants d'utilisateur en index et en colonnes.

In [None]:
user_sim_df.shape

(610, 610)

In [None]:
# remove the diagonal values of similarity with itself
np.fill_diagonal(user_sim, 0)
user_sim_df.loc[:5, :10]

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
1,0.0,0.027283,0.05972,0.194395,0.12908,0.128152,0.158744,0.136968,0.064263,0.016875
2,0.027283,0.0,0.0,0.003726,0.016614,0.025333,0.027585,0.027257,0.0,0.067445
3,0.05972,0.0,0.0,0.002251,0.00502,0.003936,0.0,0.004941,0.0,0.0
4,0.194395,0.003726,0.002251,0.0,0.128659,0.088491,0.11512,0.062969,0.011361,0.031163
5,0.12908,0.016614,0.00502,0.128659,0.0,0.300349,0.108342,0.429075,0.0,0.030611


In [None]:
# Filtering similar users of first 5 user id
user_sim_df.idxmax(axis=1)[:5]

1    266
2    366
3    313
4    391
5    470
dtype: int64

**Pour l'identifiant d'utilisateur 1, l'utilisateur le plus similaire est l'identifiant d'utilisateur 266, et ainsi de suite...**

In [None]:
user_sim_df.iloc[1:2, 360:370]

Unnamed: 0,361,362,363,364,365,366,367,368,369,370
2,0.012776,0.115081,0.084261,0.0,0.149578,0.300074,0.031699,0.008637,0.016431,0.034816


**Comme on peut le voir, l'identifiant d'utilisateur 2 est le plus similaire à l'identifiant d'utilisateur 366.**

In [None]:
# load the movies dataset for finding common movies of similar users
movies_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/DataMining/movies.csv')
movies_df.head()

Unnamed: 0.1,Unnamed: 0,id,title,release_date,overview,popularity,vote_average,vote_count
0,0,19404,Dilwale Dulhania Le Jayenge,1995-10-20,"Raj is a rich, carefree, happy-go-lucky second...",24.153,8.8,3176
1,1,278,The Shawshank Redemption,1994-09-23,Framed in the 1940s for the double murder of h...,61.825,8.7,19877
2,2,238,The Godfather,1972-03-14,"Spanning the years 1945 to 1955, a chronicle o...",55.517,8.7,14911
3,3,724089,Gabriel's Inferno Part II,2020-07-31,Professor Gabriel Emerson finally learns the t...,8.477,8.7,1313
4,4,283566,Evangelion: 3.0+1.0 Thrice Upon a Time,2021-03-08,"In the aftermath of the Fourth Impact, strande...",163.517,8.7,337


In [None]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9222 entries, 0 to 9221
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    9222 non-null   int64  
 1   id            9222 non-null   int64  
 2   title         9222 non-null   object 
 3   release_date  9222 non-null   object 
 4   overview      9207 non-null   object 
 5   popularity    9222 non-null   float64
 6   vote_average  9222 non-null   float64
 7   vote_count    9222 non-null   int64  
dtypes: float64(2), int64(3), object(3)
memory usage: 576.5+ KB


In [None]:
# find movies similar users are watching
df2[df2.userId == 2].merge(df2[df2.userId == 366], on = 'movieId', how='inner')

Unnamed: 0,userId_x,movieId,rating_x,userId_y,rating_y
0,2,3578,4.0,366,4.5
1,2,6874,4.0,366,4.0
2,2,48516,4.0,366,4.5
3,2,58559,4.5,366,4.0
4,2,68157,4.5,366,4.5
5,2,79132,4.0,366,4.0
6,2,91529,3.5,366,4.0
7,2,109487,3.0,366,5.0
8,2,122882,5.0,366,2.0


On constate que le rating est similaire pour ces deux utilisateurs sur les films communs.

On peut donc conclure que si l'utilisateur 2 donne une note élevée à un nouveau film, il est conseillé de le recommander pour l'utilisateur 366 car il va probablement lui plaire de la meme manière.

## 2. Item-based similarity

In [None]:
# pivot based on movie rating
rating_mat = df2.pivot(index= 'movieId',
                            columns = 'userId',
                            values = 'rating').reset_index(drop = True)
#rating_mat.index = movies_df.movieId
rating_mat.loc[:5, :15]

userId,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,4.0,,,,4.0,,4.5,,,,,,,,2.5
1,,,,,,4.0,,4.0,,,,,,,
2,4.0,,,,,5.0,,,,,,,,,
3,,,,,,3.0,,,,,,,,3.0,
4,,,,,,5.0,,,,,,,,,
5,4.0,,,,,4.0,,,,,5.0,,,,


In [None]:
rating_mat.shape

(9724, 610)

In [None]:
rating_mat.fillna(0, inplace=True)
# find the correlation between the movies
movie_sim = 1 - pairwise_distances(rating_mat.values,
                                  metric='correlation')
movie_sim_df = pd.DataFrame(movie_sim)
movie_sim_df.loc[:5, :15]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,1.0,0.231327,0.173213,-0.028917,0.192474,0.192686,0.143743,0.085477,0.177245,0.183382,0.172799,0.159352,0.106217,0.099645,0.031566,0.111011
1,0.231327,1.0,0.191945,0.071269,0.200526,0.158341,0.127569,0.14154,-0.021045,0.285086,0.21709,0.11529,0.163556,0.033185,0.191785,0.108676
2,0.173213,0.191945,1.0,0.067143,0.370171,0.196442,0.351513,0.296897,0.275812,0.136916,0.174251,0.168038,0.118157,0.136819,0.111644,0.216929
3,-0.028917,0.071269,0.067143,1.0,0.16791,0.053755,0.258075,0.148726,-0.016025,0.056,0.128247,-0.016306,0.142266,0.095113,0.145606,0.082152
4,0.192474,0.200526,0.370171,0.16791,1.0,0.215503,0.42989,0.265777,0.308085,0.110833,0.201002,0.17363,0.089913,0.220718,0.07017,0.108118
5,0.192686,0.158341,0.196442,0.053755,0.215503,1.0,0.148109,0.114707,0.167909,0.251343,0.182082,0.115893,-0.013484,0.24288,0.091079,0.408483


> Ce code calcule la similarité de corrélation entre les films d'un jeu de données. Tout d'abord, il remplit les valeurs manquantes (NA) de la matrice de notation avec des zéros. Ensuite, il utilise la fonction pairwise_distances avec la métrique "correlation" pour calculer la similarité de corrélation entre tous les films du jeu de données.

In [None]:
movie_sim_df.shape

(9724, 9724)

In [None]:
np.fill_diagonal(movie_sim, 0)
movie_sim_df.loc[:5,:15]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0.0,0.231327,0.173213,-0.028917,0.192474,0.192686,0.143743,0.085477,0.177245,0.183382,0.172799,0.159352,0.106217,0.099645,0.031566,0.111011
1,0.231327,0.0,0.191945,0.071269,0.200526,0.158341,0.127569,0.14154,-0.021045,0.285086,0.21709,0.11529,0.163556,0.033185,0.191785,0.108676
2,0.173213,0.191945,0.0,0.067143,0.370171,0.196442,0.351513,0.296897,0.275812,0.136916,0.174251,0.168038,0.118157,0.136819,0.111644,0.216929
3,-0.028917,0.071269,0.067143,0.0,0.16791,0.053755,0.258075,0.148726,-0.016025,0.056,0.128247,-0.016306,0.142266,0.095113,0.145606,0.082152
4,0.192474,0.200526,0.370171,0.16791,0.0,0.215503,0.42989,0.265777,0.308085,0.110833,0.201002,0.17363,0.089913,0.220718,0.07017,0.108118
5,0.192686,0.158341,0.196442,0.053755,0.215503,0.0,0.148109,0.114707,0.167909,0.251343,0.182082,0.115893,-0.013484,0.24288,0.091079,0.408483


In [None]:
# finding most similar movies
def get_similar_movies( movieid, topN):
    # get the index of the movie record in movies_df
    movieidx = movies_df[movies_df.id == movieid].index[0]
    movies_df['similarity'] = movie_sim_df.iloc[movieidx]
    top_n = movies_df.sort_values(['similarity'], ascending=False)[0:topN]
    return top_n

In [None]:
# recommendation of 5 movies based on similarity for movie id on 858
get_similar_movies(858, 5)

Unnamed: 0.1,Unnamed: 0,id,title,release_date,overview,popularity,vote_average,vote_count,similarity
3162,3162,391757,Never Back Down: No Surrender,2016-06-05,Picking up after the events of Never Back Down...,52.759,7.0,344,0.50943
5290,5290,72387,Safe,2012-04-16,After a former elite agent rescues a 12-year-o...,19.742,6.5,1848,0.465511
3490,3490,11617,Rio Grande,1950-11-15,Lt. Col. Kirby Yorke is posted on the Texas fr...,8.219,6.9,207,0.463518
5915,5915,8689,Cannibal Holocaust,1980-02-07,A New York University professor returns from a...,23.234,6.3,1109,0.462813
4291,4291,790,The Fog,1980-02-08,Strange things begin to occurs as a tiny Calif...,18.433,6.7,993,0.460209


On constate que pour un utilisateur qui a donné une note élevée au film avec l'identifiant 858, il est conseillé de lui recommander les 5 films figurant dans cette liste de films similaires


In [None]:
def compute_prediction(user_id, movie_id, similarity_mtx, data):
    user_rating = data.iloc[:,user_id-1]
    item_similarity = similarity_mtx[movie_id-1]
    numerate = np.dot(user_rating, item_similarity)
    denom = item_similarity[user_rating > 0].sum()
            
    if denom == 0 or numerate == 0:
        return user_rating[user_rating>0].mean()
    
    return numerate / denom

In [None]:
# predict the rating of user2 on movie 1
#user_to_predict = 2
#movie_to_predict = 2

print(compute_prediction(2,2,movie_sim_df, rating_mat).round(2))

3.87


Cette fonction prédit les notes manquantes pour un film donné. Dans cet exemple, l'utilisateur avec l'identifiant 2 n'a pas noté le film avec l'identifiant 2 . La fonction prédit une note de 3.87