SYSTEME DE RECOMMANDATION DE FILMS

Nous avons utilisé un dataset MovieLens 100k Small Dataset provenant de Kaggle contenant 2 fichiers: 
    - movies.csv contenant les informations sur les films(movieId, title et genre)
    - rating.csv contenant les évaluations des utilisateurs sur les films(userId, movieId, rating, timestamp)

Le fichier movies.csv contient 3 variables et 9125 observations; le fichier rating.csv a 4 variables dont 100004 observations.

1- IMPORT ET CHARGEMENT DES DONNEES

In [1]:
import pandas as pd

movies = pd.read_csv("data/movies.csv", sep=";")
ratings = pd.read_csv("data/rating.csv", sep=";")

movies.head()
# ratings.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


2- DATA CLEANINIG MINIMAL

a- Structure des données

In [2]:
movies.shape

(9125, 3)

In [3]:
ratings.shape

(100004, 4)

In [4]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9125 entries, 0 to 9124
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9125 non-null   int64 
 1   title    9125 non-null   object
 2   genres   9125 non-null   object
dtypes: int64(1), object(2)
memory usage: 214.0+ KB


In [5]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100004 non-null  int64  
 1   movieId    100004 non-null  int64  
 2   rating     100004 non-null  float64
 3   timestamp  100004 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


b- Vérification des valeurs manquantes

In [6]:
movies.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [7]:
ratings.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

c- Suppression des doublons

In [23]:
movies = movies.drop_duplicates(subset=["movieId"])

In [9]:
ratings = ratings.drop_duplicates(subset=["userId", "movieId"])

d- Nettoyage de la colonne genres

In [10]:
movies['genres'] = movies['genres'].str.replace('|', ' ', regex=False)

e- Suppression des films sans genre

In [11]:
movies = movies[movies['genres'] != '(no genres listed)']

cette action est effectuée parce qu'un film sans genre est impossible à comparer aux autres.

In [19]:
movies.head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy
1,2,Jumanji (1995),Adventure Children Fantasy
2,3,Grumpier Old Men (1995),Comedy Romance
3,4,Waiting to Exhale (1995),Comedy Drama Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action Crime Thriller
6,7,Sabrina (1995),Comedy Romance
7,8,Tom and Huck (1995),Adventure Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action Adventure Thriller


3- Implémentation du coeur du système de recommandation

a- Cas 1: Content-Based: recommandation basée uniquement sur les genres

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer #librairie pour transformer les textes en nombres
from sklearn.metrics.pairwise import cosine_similarity #librairie pour calculer la similarité entre deux vecteurs(choses)

#on crée le transformeur TF-IDF qui sait comment transformer des mots en nombres, pour le moment il n'a encore rien appris
tfidf = TfidfVectorizer()

#on fait apprendre le transformeur, il va apprendre tous les genres de films(fit) puis les tranformer en nombres(transform) 
tfidf_matrix = tfidf.fit_transform(movies['genres'])

#on calcule la similarité entre les films
cosine_sim = cosine_similarity(tfidf_matrix)

#on crée une fonction qui prend en entrée le titre d'un film et qui retourne une liste de 10 films recommandés
def recommend_movies(movie_title, cosine_sim=cosine_sim):

    #on vérifie si le film existe
    if movie_title not in movies['title'].values:
        return "Film non trouvé"
    
    #on récupère l'index du film donné en entrée
    idx = movies[movies['title'] == movie_title].index[0]
    
    #on récupère les similarités de ce film avec tous les autres films
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    #on trie les films en fonction de leur similarité avec le film donné en entrée
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    #on prend les 10 films les plus similaires (en excluant le premier qui est le film lui-même)
    sim_scores = sim_scores[1:11]
    
    #on récupère les indices des films recommandés
    movie_indices = [i[0] for i in sim_scores]
    
    #on retourne les titres des films recommandés
    return movies.iloc[movie_indices][['title', 'genres']]

recommend_movies("Mr. Holland's Opus (1995)")

Unnamed: 0,title,genres
25,Othello (1995),Drama
30,Dangerous Minds (1995),Drama
38,"Cry, the Beloved Country (1995)",Drama
41,Restoration (1995),Drama
52,Georgia (1995),Drama
53,Home for the Holidays (1995),Drama
58,Mr. Holland's Opus (1995),Drama
104,Margaret's Museum (1995),Drama
109,"Boys of St. Vincent, The (1992)",Drama
112,"Star Maker, The (Uomo delle stelle, L') (1995)",Drama


b- Cas 2: Collaborative-Based Filtering

Item-Based: on compare les films entre eux. on regarde les films qui sont notés de manière similaire. A et B aiment les memes films, B a aimé un film que A n'a pas encore vu alors on recommande A ce film. 

b1- On crée la matrice utilisateur-film

In [20]:
user_movie_matrix = ratings.pivot_table(
    index = "userId",
    columns = "movieId",
    values = "rating"
)

user_movie_matrix.head(10)

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,4.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,4.0,...,,,,,,,,,,
5,,,4.0,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,3.0,,,,,,,,,3.0,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,4.0,,,,,,,,,,...,,,,,,,,,,
10,,,,,,,,,,,...,,,,,,,,,,


b2- on remplace les films non notés par 0

In [21]:
user_movie_matrix = user_movie_matrix.fillna(0)

user_movie_matrix.head(10)

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


b3- Similarité entre films(on compare les colonnes)

In [16]:
movie_similarity = cosine_similarity(user_movie_matrix.T)

b4- on crée un dataFrame de similarité

In [22]:
movie_similarity_df = pd.DataFrame(
    movie_similarity,
    index = user_movie_matrix.columns,
    columns = user_movie_matrix.columns
)

movie_similarity_df.head(10)

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.394511,0.306516,0.133614,0.245102,0.377086,0.278629,0.063031,0.117499,0.310689,...,0.055829,0.031902,0.079755,0.079755,0.079755,0.079755,0.079755,0.0,0.0,0.055829
2,0.394511,1.0,0.217492,0.164651,0.278476,0.222003,0.207299,0.223524,0.113669,0.418124,...,0.0,0.055038,0.068797,0.082557,0.082557,0.137594,0.068797,0.0,0.0,0.0
3,0.306516,0.217492,1.0,0.177012,0.370732,0.247499,0.435648,0.127574,0.306717,0.191255,...,0.0,0.0,0.0,0.116226,0.116226,0.0,0.0,0.0,0.0,0.0
4,0.133614,0.164651,0.177012,1.0,0.179556,0.072518,0.184626,0.501513,0.25463,0.111447,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.245102,0.278476,0.370732,0.179556,1.0,0.272645,0.388476,0.194113,0.367941,0.246846,...,0.0,0.176845,0.0,0.117897,0.117897,0.0,0.0,0.0,0.0,0.0
6,0.377086,0.222003,0.247499,0.072518,0.272645,1.0,0.278855,0.097561,0.248155,0.307948,...,0.061724,0.098758,0.111103,0.0,0.0,0.0,0.111103,0.0,0.0,0.061724
7,0.278629,0.207299,0.435648,0.184626,0.388476,0.278855,1.0,0.196091,0.349827,0.177425,...,0.079399,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.079399
8,0.063031,0.223524,0.127574,0.501513,0.194113,0.097561,0.196091,1.0,0.264477,0.042169,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.117499,0.113669,0.306717,0.25463,0.367941,0.248155,0.349827,0.264477,1.0,0.130475,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.310689,0.418124,0.191255,0.111447,0.246846,0.307948,0.177425,0.042169,0.130475,1.0,...,0.0,0.076835,0.076835,0.102446,0.102446,0.0,0.076835,0.0,0.0,0.0


b5- Fonction de recommandation item-based

In [18]:
def recommend_movies_collaborative(movie_title, n=10):

    if movie_title not in movies['title'].values:
        return None

    movie_id = movies[movies['title'] == movie_title]['movieId'].values[0]

    sim_scores = movie_similarity_df[movie_id].sort_values(ascending=False)
    sim_scores = sim_scores.iloc[1:n+1]

    return movies[movies['movieId'].isin(sim_scores.index)][['title', 'genres']]

recommend_movies_collaborative("Mr. Holland's Opus (1995)")

Unnamed: 0,title,genres
16,Sense and Sensibility (1995),Drama Romance
34,Dead Man Walking (1995),Crime Drama
122,"Birdcage, The (1996)",Comedy
561,Mission: Impossible (1996),Action Adventure Mystery Thriller
599,"Truth About Cats & Dogs, The (1996)",Comedy Romance
615,"Rock, The (1996)",Action Adventure Thriller
644,Independence Day (a.k.a. ID4) (1996),Action Adventure Sci-Fi Thriller
658,Phenomenon (1996),Drama Romance
661,"Time to Kill, A (1996)",Drama Thriller
866,Willy Wonka & the Chocolate Factory (1971),Children Comedy Fantasy Musical


c- Cas 3: Hybrid Technique: cette méthode permet de combiner Content-Based et Collaborative Filtering

Avec cette technique on aura une meilleure qualité de recommandation.

Fonction Hybride

In [35]:
def recommend_movies_hybrid(movie_title, n=10, w_content=0.5, w_collaborative=0.5):

    if movie_title not in movies['title'].values:
        return "Film non trouvé"

    # index pour content-based
    movie_idx = movies[movies['title'] == movie_title].index[0]

    # movieId pour collaborative
    movie_id = movies.loc[movie_idx, 'movieId']

    #  SCORES CONTENT 
    content_scores = pd.Series(
        cosine_sim[movie_idx],
        index=movies['movieId']
    )

    #  SCORES COLLABORATIVE 
    if movie_id in movie_similarity_df.columns:
        collaborative_scores = movie_similarity_df[movie_id]
    else:
        collaborative_scores = pd.Series(0, index=movies['movieId'])

    #  SCORE HYBRIDE 
    hybrid_scores = (
        w_content * content_scores +
        w_collaborative * collaborative_scores
    )

    # enlever le film lui-même
    hybrid_scores = hybrid_scores.drop(movie_id)

    # top N
    top_movies = hybrid_scores.sort_values(ascending=False).head(n)

    return movies[movies['movieId'].isin(top_movies.index)][['title', 'genres']]

recommend_movies_hybrid("Toy Story (1995)")


Unnamed: 0,title,genres
521,Aladdin (1992),Adventure Animation Children Comedy Musical
1815,Antz (1998),Adventure Animation Children Comedy Fantasy
1866,"Bug's Life, A (1998)",Adventure Animation Children Comedy
2506,Toy Story 2 (1999),Adventure Animation Children Comedy Fantasy
3217,"Emperor's New Groove, The (2000)",Adventure Animation Children Comedy Fantasy
3419,Shrek (2001),Adventure Animation Children Comedy Fantasy Ro...
3805,"Monsters, Inc. (2001)",Adventure Animation Children Comedy Fantasy
4002,Ice Age (2002),Adventure Animation Children Comedy
4610,Finding Nemo (2003),Adventure Animation Children Comedy
5626,"Incredibles, The (2004)",Action Adventure Animation Children Comedy


Conclusion:Le système hybride combine les similarités de contenu et les comportements des utilisateurs à l’aide d’une moyenne pondérée afin d’améliorer la pertinence des recommandations.