# **Description du projet:**
Si vous utilisez Netflix, vous remarquerez qu'il existe une section intitulée 'Parce
que vous avez regardez le film x, qui fournit des recommandations de films basées
sur les films les plus récents que vous avez regardez. Dans ce projet, l'idée un système de recommandation de films à l'aide d'une technique appelée *filtrage collaboratif.*

## **Description du jeu de données**
Le jeu de données est formé par deux fichiers csv ('films.csv', 'notes.csv').

Le premier fichier 'films.csv' contient trois colonnes:
1. IdFilm (identifiant numérique du film): Discret
2. Titre (titre du film): Chaine de caractères
3. Genre (genre du film) : Nominal

Le deuxième fichier 'notes.csv' contient quatres colonnes:
1. IdUtilisateur (identifiant numérique de l'utilisateur x): Discret
2. IdFilm (identifiant numérique du film y): Discret
3. Note (la note donnée par un utilisateur x à un film y) : Discrète
4. Horodatage (une date et une heure associée à un film y regardé par un
utilisateur x) : Continu

## Importation des bibliothèques

In [323]:
import pandas as pd

## Importation des données

In [324]:
notes = pd.read_csv('notes.csv')
films = pd.read_csv('films.csv', encoding="latin1")

# Exploration des données


### Description du fichier films

In [325]:
films.head(10)

Unnamed: 0,IdFilm,Titre,Genres
0,1,Toy Story (1995),Aventure|Animation|Enfants|Comédie|Fantaisie
1,2,Jumanji (1995),Aventure|Enfants|Fantaisie
2,3,Grumpier Old Men (1995),Comédie|Romance
3,4,Waiting to Exhale (1995),Comédie|Drame|Romance
4,5,Father of the Bride Part II (1995),Comédie
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comédie|Romance
7,8,Tom and Huck (1995),Aventure|Enfants
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Aventure|Thriller


### Description de fichier Notes

In [326]:
notes.head(10)

Unnamed: 0,IdUtilisateur,IdFilm,Note,Horodatage
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


1. Quel est le nombre de notes données par tous les utilisateurs?

In [327]:
nb_notes = notes['Note'].count()
nb_notes

100836

2.Quel est le nombre d'utilisateur? Quel est le nombre de films?

In [328]:
nb_utilisateurs = len(notes['IdUtilisateur'].unique())
nb_utilisateurs

610

In [329]:
nb_films = films['IdFilm'].count()
nb_films

9742

3. Quel est le nombre moyen de notes par utilisateur?

In [374]:
nb_moy_notes = notes.groupby('IdUtilisateur')['Note'].count().mean()
round(nb_moy_notes)

165

4. Quel est le nombre moyen de notes par film?

In [373]:
nb_moy_film = notes.groupby('IdFilm')['Note'].count().mean()
round(nb_moy_film)

10

5. Quel film a la note moyenne la plus basse?

In [332]:
film_basse_moy = notes.groupby('IdFilm').mean()
film = film_basse_moy.nsmallest(1,"Note")["Note"]
film

IdFilm
3604    0.5
Name: Note, dtype: float64

6. Quel film a la note moyenne la plus élevée?

In [333]:
film_max_moy = notes.groupby('IdFilm').mean()
film = film_max_moy.nlargest(1,"Note")["Note"]
film

IdFilm
53    5.0
Name: Note, dtype: float64

# Transformation des données

1. Les utilisateurs qui ont voté moins de 100 films et les films qui ont été voté par moins de 10 utlisateurs

In [334]:
# Les utilisateurs qui ont votés moins de 100 films
nb_vote_utilisateur = notes.groupby('IdUtilisateur').count()
selected_users = nb_vote_utilisateur[nb_vote_utilisateur["IdFilm"] < 100].index
selected_users

Int64Index([  2,   3,   5,   8,   9,  11,  12,  13,  14,  16,
            ...
            583, 584, 585, 588, 589, 591, 592, 595, 598, 609],
           dtype='int64', name='IdUtilisateur', length=362)

In [335]:
# Les films qui ont été votés par moins de 10 utilsateurs
nb_vote_films = notes.groupby('IdFilm').count()
selected_movies = nb_vote_films[nb_vote_films["IdUtilisateur"] < 10].index
selected_movies

Int64Index([     4,      8,     13,     27,     30,     38,     40,     42,
                43,     49,
            ...
            193565, 193567, 193571, 193573, 193579, 193581, 193583, 193585,
            193587, 193609],
           dtype='int64', name='IdFilm', length=7455)

**Filtrer la matrice**

In [336]:
notes_filtered = notes[(notes["IdUtilisateur"].isin(selected_users) & notes["IdFilm"].isin(selected_movies))]
notes_filtered

Unnamed: 0,IdUtilisateur,IdFilm,Note,Horodatage
249,2,86345,4.0,1445715166
257,2,114060,2.0,1445715276
260,2,131724,5.0,1445714851
264,3,688,0.5,1306464228
282,3,2851,5.0,1306463925
...,...,...,...,...
99527,609,828,3.0,847221054
99528,609,833,3.0,847221080
99530,609,1056,3.0,847221080
99532,609,1150,4.0,847221054


2. Création de la matrice utilisateur-film

In [337]:
Muf = notes_filtered.pivot(index='IdUtilisateur',columns=('IdFilm'), values=('Note'))
Muf

IdFilm,4,8,13,27,40,42,43,53,57,61,...,187541,189043,189111,189333,190207,190209,190213,190215,190219,190221
IdUtilisateur,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,
11,,,,,,,,,,,...,,,,,,,,,,
13,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
591,,,,,,,,,,,...,,,,,,,,,,
592,,,,,,,,,,,...,,,,,,,,,,
595,,,,,,,,,,,...,,,,,,,,,,
598,,,,,,,,,,,...,,,,,,,,,,


3. Calcul du taux de parcimonie de la matrice utilisateur-film

In [338]:
B = Muf.size
A = B - Muf.isnull().sum().sum()
S = A / B
S

0.005182267648437382

In [339]:
Muf = Muf.fillna(0)
Muf

IdFilm,4,8,13,27,40,42,43,53,57,61,...,187541,189043,189111,189333,190207,190209,190213,190215,190219,190221
IdUtilisateur,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
591,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
592,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
595,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
598,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Filtrage collaboratif**

1-Application de la SVD sur la matrice utilisateur-film, conserver les k valeurs singulières les plus grandes avec k entre 5 et 20

In [340]:
import numpy as np

In [341]:
U,S,V = np.linalg.svd(Muf.values)
S = np.diag(S)
#Reduction de la dimension
k = 5
U = U[:,:k]
S = S[:k,:k]
V = V[:k,:]

In [342]:
print("Vecteurs singuliers de gauche (users)")
print(U)
print("Valeurs singulieres")
print(S)
print("Vecteurs singuliers de droite (films)")
print(V)

Vecteurs singuliers de gauche (users)
[[ 3.89540049e-06 -6.95103364e-05  1.60391948e-05 -3.77902149e-05
   3.27858541e-05]
 [ 1.28388676e-04 -2.55327149e-03 -7.06945051e-05 -1.29911096e-03
   2.14746773e-03]
 [ 3.25776518e-04 -1.01667712e-02 -3.05177676e-04 -6.82801172e-03
  -6.45425795e-02]
 ...
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00]
 [ 9.20052690e-02  2.71852512e-02 -5.49439526e-03 -4.13734374e-02
  -3.68394429e-05]]
Valeurs singulieres
[[20.74524175  0.          0.          0.          0.        ]
 [ 0.         19.52641491  0.          0.          0.        ]
 [ 0.          0.         19.35865906  0.          0.        ]
 [ 0.          0.          0.         19.14212167  0.        ]
 [ 0.          0.          0.          0.         17.92501693]]
Vecteurs singuliers de droite (films)
[[ 3.51446109e-03  2.14125369e-04  1.36042680e-05 ... -3.52228544e-18
  

2. Les 5 films les plus similaires au premier film

In [343]:
from scipy.spatial import distance

In [344]:
def distance_consinus_similarite(idxFilm,V):
  similarite = {}
  for i in range(1,V.shape[1]):
    similarite[i] = distance.cosine(V[:,idxFilm],V[:,i])
  similarite = sorted(similarite.items(), key=lambda x:x[1],reverse=True)
  if len(similarite) > 5:
    return similarite[:5]
  else:
    return similarite

In [345]:
top_5 = distance_consinus_similarite(0,V)
top_5

  dist = 1.0 - uv / np.sqrt(uu * vv)


[(726, 1.982439894860521),
 (746, 1.977416724950595),
 (755, 1.9717615854327826),
 (1011, 1.9717615854327826),
 (1015, 1.9717615854327826)]

In [346]:
top_5_index = []
for val in top_5:
  top_5_index.append(val[0])
top_5_index

[726, 746, 755, 1011, 1015]

In [347]:
def afficher_films(index,Muf,films):
  idFilm = []
  for val in index:
    idFilm.append(Muf.columns[val])
  return films[films["IdFilm"].isin(idFilm)]


In [348]:
print("Les 5 films les plus similaires au prémier:")
film= afficher_films(top_5_index,Muf,films)
film

Les 5 films les plus similaires au prémier:


Unnamed: 0,IdFilm,Titre,Genres
5036,7841,Enfants of Dune (2003),Fantaisie|Sci-Fi
5252,8614,Overboard (1987),Comédie|Romance
5338,8880,Mask (1985),Drame
8132,101577,"Host, The (2013)",Action|Aventure|Romance
8210,103543,"Lifeguard, The (2013)",Comédie|Drame


3. Afficher les films recommandés par le prémier utilisateur

In [349]:
premier_utilisateur = Muf.index[0]
premier_utilisateur

2

In [350]:
film_regarder = notes_filtered[notes_filtered["IdUtilisateur"] == premier_utilisateur]
film_regarder

Unnamed: 0,IdUtilisateur,IdFilm,Note,Horodatage
249,2,86345,4.0,1445715166
257,2,114060,2.0,1445715276
260,2,131724,5.0,1445714851


Recuperer les positions des films regardés par le prémier utilisateur et selection des films avec les notes les plus élevées si le nombre de film est supérieur à 5.

In [351]:
i = 0
watched_movie = {}
for val in Muf.values[0]:
  if val != 0.0:
    watched_movie[i] = val
  i=i+1
watched_movie = sorted(watched_movie.items(), key=lambda x:x[1],reverse=True)
if len(watched_movie)>5:
  watched_movie = watched_movie[:5]
watched_movie_index = []
for val in watched_movie:
  watched_movie_index.append(val[0])
watched_movie_index

[1057, 966, 1032]

verifions si on a les bons index

In [352]:
watched_movie = films[films["IdFilm"].isin(Muf.columns[watched_movie_index])]
watched_movie

Unnamed: 0,IdFilm,Titre,Genres
7590,86345,Louis C.K.: Hilarious (2010),Comédie
8509,114060,The Drop (2014),Crime|Drame|Thriller
8828,131724,The Jinx: The Life and Deaths of Robert Durst ...,Documentaire


Recherche des films les plus similaires de chacun des films que cet utilisateur a régardé:

In [353]:
import time

In [359]:
lst_similarities = []
for idxFilm in watched_movie_index:
 lst_similarities.append(distance_consinus_similarite(idxFilm,V))
print(lst_similarities)

[[(1067, 1.977707949097932), (1086, 1.9727312579408038), (695, 1.9058142286220412), (776, 1.9058142286220412), (921, 1.8694288861859238)], [(1067, 1.9777410867068026), (1086, 1.9723804736012474), (695, 1.9119361550580467), (776, 1.9119361550580467), (694, 1.8787228820309982)], [(1067, 1.9777079490979383), (1086, 1.9727312579408063), (695, 1.9058142286220496), (776, 1.9058142286220496), (921, 1.8694288861859256)]]


  dist = 1.0 - uv / np.sqrt(uu * vv)


Supprimer les films que cet utilisateur a déjà régardé si on en rencontre

In [367]:
movie_idx = {}
for i in range(0,len(lst_similarities)):
  for j in range(0,len(lst_similarities[i])):
    idx = lst_similarities[i][j][0]
    sim = lst_similarities[i][j][1]
    if idx not in movie_idx:
     movie_idx[idx] = sim
movie_idx = sorted(movie_idx.items(), key=lambda x:x[1],reverse=True)
movie_idx_recom = []
for val in movie_idx:
  if len(movie_idx_recom)>4:
    break
  else:
   movie_idx_recom.append(val[0])
movie_idx_recom

[1067, 1086, 695, 776, 694]

In [368]:
recommended_film = afficher_films(movie_idx_recom,Muf,films)
recommended_film

Unnamed: 0,IdFilm,Titre,Genres
4708,7025,"Midnight Clear, A (1992)",Drame|Guerre
4711,7028,Quick Change (1990),Comédie|Crime
5557,26704,State of Grace (1990),Crime|Drame|Thriller
9058,142056,Iron Man & Hulk: Heroes United (2013),Action|Aventure|Animation
9444,167296,Iron Man (1931),Drame
