# Analyse de données de films

Nous proposons de travailler sur des données décrivant des films. Les possibilités sont larges et vous êtes évalués sur vos propositions et votre méthodologie plus que sur vos résultats.

Les données de départ sont disponibles sur:
https://grouplens.org/datasets/movielens/
au format CSV. 

Nous nous intéresserons en particulier au jeu de données: **MovieLens 20M Dataset**. Dans ce jeu de données, vous disposez entre autre de:
* Idendifiant du film dans IMdb et TMdb (ça sera important ensuite)
* Catégorie(s) du film
* Titre du film
* Notes données par les internautes aux films

Afin de rendre le projet plus intéressant, nous ajoutons des données sur les acteurs et producteurs associés aux films (récupéré sur TMdb). Ces données sont disponibles sur les liens suivants:

http://webia.lip6.fr/~guigue/film_v2.pkl <br>
http://webia.lip6.fr/~guigue/act_v2.pkl <br>
http://webia.lip6.fr/~guigue/crew_v2.pkl

Ces fichiers contiennent respectivement : une nouvelle description des films (dont l'identifiant TMdb et la note moyenne donnée par les internautes, la date de sortie,...), une description des acteurs de chaque film et une description des équipes (scénariste, producteur, metteur en scène) pour chaque film.

Ces données sont des listes de taille 26908, chaque élément de la liste correspondant à un dictionnaire dont vous étudierez les clés pour récupérer les informations utiles.

**ATTENTION** Les contraintes de récupération d'informations en ligne font que la base MovieLens compte 27278 films mais les fichiers ci-dessus n'en comptent que 26908. Le plus simple est probablement d'éliminer les films de MovieLens qui ne sont pas dans cette seconde base.

## Consignes générales pour l'analyse des données

Vous devez proposer plusieurs analyses des données, qui devront à minima utiliser les
 techniques suivantes:
 
1. Mettre en forme les données pour identifier les acteurs et les catégories, les indexer
1. Traiter au moins un problème de régression supervisé (par exemple la prédiction de la note moyenne donnée à un film par les internautes).
1. Traiter au moins un problème de classification supervisé (par exemple la prédiction de la catégorie d'un film)
1. Utiliser les données catégorielles (catégories, acteurs,...) de manière discrète ET de manière coninue (*dummy coding*) dans des approches différentes
1. Proposer au moins une approche de catégorisation non supervisée (pour regrouper les acteurs par exemple)
1. Mener une campagne d'expérience permettant de comparer les performances sur un problème en fonction des valeurs d'un paramètre (et donc, in fine, trouver la meilleure valeur du paramètre)
1. Proposer quelques illustrations

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import pickle as pkl

import sys
sys.path.append('../')

# Importation de la librairie iads
import iads as iads

# importation de LabeledSet
from iads import LabeledSet as ls

# importation de Classifiers
from iads import Classifiers as cl

# importation de utils
from iads import utils as ut

%load_ext autoreload
%autoreload 1
from iads import clustering as ct

## Chargement des données (base MovieLens + enrichissements)

In [2]:
actors = pd.DataFrame(data=pd.read_csv("movielens/actors_most_pop_genres.csv", sep=","))

In [3]:
merge_matrix = ct.clustering_hierarchique(actors)

In [5]:
import pickle
with open("movielens/actors_hier_clust.pkl", "wb") as file:
    pickle.dump(merge_matrix, file, protocol=pickle.HIGHEST_PROTOCOL)

# Do not under this cell

In [4]:
centres, matrix = km.kmoyennes(100, actors, 0.05, 100)

KeyboardInterrupt: 

In [4]:
actors.head()

Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,3,3,0,0,3,1,14,0,2,0,0,0,0,0,0,1,1,0,0
1,0,12,12,10,4,9,4,3,3,2,0,3,0,0,2,0,13,3,1,0
2,0,19,12,0,0,14,8,3,24,1,0,1,2,0,3,7,7,14,2,3
3,0,6,8,0,0,26,4,5,11,2,0,2,0,2,3,5,4,5,0,0
4,0,4,10,0,1,4,5,0,9,4,0,34,0,0,10,3,13,5,1,0


In [3]:
len(actors)

164833

In [3]:
merge_matrix = ct.clustering_hierarchique(actors)

164833


KeyboardInterrupt: 

# Don't go under this cell

In [2]:
movies_movielens = pd.DataFrame(data=pd.read_csv("movielens/movies.csv", sep=","))
movies_worldwide = pd.DataFrame(data=pd.read_csv("movielens/movies_worldwide.csv", sep=","))
movies_domestic = pd.DataFrame(data=pd.read_csv("movielens/movies_domestic.csv", sep=",", encoding = "ISO-8859-1"))
scores = pd.DataFrame(data=pd.read_csv("movielens/genome-scores.csv", sep=","))
ratings = pd.DataFrame(data=pd.read_csv("movielens/ratings.csv", sep=","))
links = pd.DataFrame(data=pd.read_csv("movielens/links.csv", sep=","))
tags = pd.DataFrame(data=pd.read_csv("movielens/tags.csv", sep=","))
genome_tags =pd.DataFrame(data=pd.read_csv("movielens/genome-tags.csv", sep=","))

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
movies_movielens['year'] = movies_movielens.title.str.extract("\((\d{4})\)", expand=True)
# the release year for movies w/o one was set to 0
#movies_movielens['year'] = pd.to_numeric(movies_movielens['year'], errors='raise').fillna(0).astype(np.int64)
movies_movielens['title'] = movies_movielens.title.str.replace("\s\((\d{4})\)","")
movies_movielens.year = pd.to_numeric(movies_movielens.year, errors='coerce', \
                                       downcast='integer')
ratings["timestamp"] = pd.to_datetime(ratings["timestamp"], unit='s')
ratings["timestamp"] = ratings["timestamp"].dt.date

In [4]:
movies_title = movies_movielens.drop(['title'], inplace=False,axis=1)
movies_movielens, movies_title = movies_title, movies_movielens
movies_worldwide.drop(['homepage', 'poster_path', 'status', 'tagline', 'adult', \
                  'belongs_to_collection', 'genres', 'original_language', \
                  'original_title', 'overview', 'spoken_languages', 'video', \
                    'budget', 'imdb_id', 'popularity', 'production_companies', \
                       'production_countries', 'release_date', 'runtime', 'title', \
                       'vote_average', 'vote_count'],
                 inplace=True, axis=1)
movies_worldwide.rename(columns={'revenue': 'worldwide_gross', 'id': 'tmdbId'}, inplace=True)
movies_worldwide.tmdbId = pd.to_numeric(movies_worldwide.tmdbId, errors='coerce', \
                                       downcast='integer')
movies_domestic.drop(['genre', 'released', 'rating'], inplace=True, axis=1)
movies_domestic.rename(columns={'gross': 'domestic_gross', 'score': 'score_imdb', \
                              'votes': 'votes_imdb', 'name': 'title'}, inplace=True)

In [5]:
import itertools
fname = "movielens/act_v2.pkl"
actors_pkl = pkl.load(open(fname, "rb"))
fname = "movielens/film_v2.pkl" 
movies_pkl = pkl.load(open(fname, "rb"))
fname = "movielens/crew_v2.pkl" 
crew_pkl = pkl.load(open(fname, "rb"))

# add the movie id (TMDb) to every actor in the cast
for cast, film in zip(actors_pkl, movies_pkl):
    for actor in cast:
        actor['tmdbId'] = film['id']
        
# add the movie id (TMDb) to every actor in the cast
for crew, film in zip(crew_pkl, movies_pkl):
    for worker in crew:
        worker['tmdbId'] = film['id']

merged_actors = list(itertools.chain(*actors_pkl))
merged_crew =  list(itertools.chain(*crew_pkl))

In [6]:
actors_tmdb = pd.DataFrame(merged_actors)
crew_tmdb = pd.DataFrame(merged_crew)
movies_tmdb = pd.DataFrame(movies_pkl)

In [7]:
actors_tmdb.drop(['profile_path', 'credit_id'], inplace=True, axis=1)
crew_tmdb.drop(['profile_path', 'credit_id'], inplace=True, axis=1)
movies_tmdb.drop(['backdrop_path', 'poster_path', 'overview', 'video', 'adult', \
                  'genre_ids'], inplace=True, axis=1)
movies_tmdb.rename(columns={'id': 'tmdbId', 'vote_average': 'score_tmdb', 'vote_count': 'votes_tmdb'}, inplace=True)

In [8]:
movies_ids = pd.merge(movies_tmdb, links, on='tmdbId')
movies_genres = pd.merge(movies_ids, movies_movielens, on='movieId')
movies_ww_gross = pd.merge(movies_genres, movies_worldwide, on='tmdbId')
movies_all_gross = pd.merge(movies_ww_gross, movies_domestic, on=['title', 'year'])

In [9]:
len(movies_ww_gross), len(movies_domestic), len(movies_all_gross)

(26745, 6820, 5563)

In [10]:
movies_ww_gross.head(2)

Unnamed: 0,tmdbId,original_language,original_title,popularity,release_date,title,score_tmdb,votes_tmdb,movieId,imdbId,genres,year,worldwide_gross
0,862,en,Toy Story,22.773,1995-10-30,Toy Story,7.9,9550,1,114709,Adventure|Animation|Children|Comedy|Fantasy,1995.0,373554033.0
1,8844,en,Jumanji,2.947,1995-12-15,Jumanji,7.1,5594,2,113497,Adventure|Children|Fantasy,1995.0,262797249.0


In [11]:
movies_domestic[movies_domestic.title == 'Toy Story']

Unnamed: 0,budget,company,country,director,domestic_gross,title,runtime,score_imdb,star,votes_imdb,writer,year
1988,30000000.0,Pixar Animation Studios,USA,John Lasseter,191796233.0,Toy Story,81,8.3,Tom Hanks,694113,John Lasseter,1995


In [12]:
movies_all_gross.head(2)

Unnamed: 0,tmdbId,original_language,original_title,popularity,release_date,title,score_tmdb,votes_tmdb,movieId,imdbId,...,budget,company,country,director,domestic_gross,runtime,score_imdb,star,votes_imdb,writer
0,862,en,Toy Story,22.773,1995-10-30,Toy Story,7.9,9550,1,114709,...,30000000.0,Pixar Animation Studios,USA,John Lasseter,191796233.0,81,8.3,Tom Hanks,694113,John Lasseter
1,8844,en,Jumanji,2.947,1995-12-15,Jumanji,7.1,5594,2,113497,...,50000000.0,TriStar Pictures,USA,Joe Johnston,100475249.0,104,6.9,Robin Williams,232339,Jonathan Hensleigh


In [13]:
movies_all_gross.columns

Index(['tmdbId', 'original_language', 'original_title', 'popularity',
       'release_date', 'title', 'score_tmdb', 'votes_tmdb', 'movieId',
       'imdbId', 'genres', 'year', 'worldwide_gross', 'budget', 'company',
       'country', 'director', 'domestic_gross', 'runtime', 'score_imdb',
       'star', 'votes_imdb', 'writer'],
      dtype='object')

In [14]:
movies_all_gross.sort_values(by='popularity', inplace=False, ascending=False)

Unnamed: 0,tmdbId,original_language,original_title,popularity,release_date,title,score_tmdb,votes_tmdb,movieId,imdbId,...,budget,company,country,director,domestic_gross,runtime,score_imdb,star,votes_imdb,writer
5396,118340,en,Guardians of the Galaxy,53.156,2014-07-30,Guardians of the Galaxy,7.9,16698,112852,2015381,...,170000000.0,Marvel Studios,USA,James Gunn,333176600.0,121,8.1,Chris Pratt,791340,James Gunn
4798,24428,en,The Avengers,51.700,2012-04-25,The Avengers,7.6,18191,89745,848228,...,220000000.0,Marvel Studios,USA,Joss Whedon,623357910.0,143,8.1,Robert Downey Jr.,1064633,Joss Whedon
2725,22,en,Pirates of the Caribbean: The Curse of the Bla...,41.969,2003-07-09,Pirates of the Caribbean: The Curse of the Bla...,7.7,11677,6539,325980,...,140000000.0,Walt Disney Pictures,USA,Gore Verbinski,305413918.0,143,8.0,Johnny Depp,886092,Ted Elliott
2279,120,en,The Lord of the Rings: The Fellowship of the Ring,41.258,2001-12-18,The Lord of the Rings: The Fellowship of the Ring,8.3,13695,4993,120737,...,93000000.0,New Line Cinema,New Zealand,Peter Jackson,315544750.0,178,8.8,Elijah Wood,1352483,J.R.R. Tolkien
2527,672,en,Harry Potter and the Chamber of Secrets,40.198,2002-11-13,Harry Potter and the Chamber of Secrets,7.7,11011,5816,295297,...,100000000.0,1492 Pictures,UK,Chris Columbus,261988482.0,161,7.4,Daniel Radcliffe,432417,J.K. Rowling
5427,198663,en,The Maze Runner,39.503,2014-09-10,The Maze Runner,7.1,10149,114180,1790864,...,34000000.0,Twentieth Century Fox Film Corporation,USA,Wes Ball,102427862.0,113,6.8,Dylan O'Brien,344991,Noah Oppenheim
5318,157336,en,Interstellar,39.401,2014-11-05,Interstellar,8.2,17521,109487,816692,...,165000000.0,Paramount Pictures,USA,Christopher Nolan,188020017.0,169,8.6,Matthew McConaughey,1095553,Jonathan Nolan
5488,122917,en,The Hobbit: The Battle of the Five Armies,39.140,2014-12-10,The Hobbit: The Battle of the Five Armies,7.2,7908,118696,2310332,...,250000000.0,New Line Cinema,New Zealand,Peter Jackson,255119788.0,144,7.4,Ian McKellen,396797,Fran Walsh
4547,27205,en,Inception,38.838,2010-07-15,Inception,8.3,21060,79132,1375666,...,160000000.0,Warner Bros.,USA,Christopher Nolan,292576195.0,148,8.8,Leonardo DiCaprio,1629342,Christopher Nolan
5273,109445,en,Frozen,33.732,2013-11-27,Frozen,7.3,9175,106696,2294629,...,150000000.0,Walt Disney Animation Studios,USA,Chris Buck,400738009.0,102,7.5,Kristen Bell,464149,Jennifer Lee


In [15]:
movies_reduced = movies_all_gross.drop(['original_title', 'release_date'], axis=1)

In [16]:
genres_unique = pd.DataFrame(movies_movielens.genres.str.split('|').tolist()).stack().unique()
genres_unique = pd.DataFrame(genres_unique, columns=['genre'])
genres_unique

Unnamed: 0,genre
0,Adventure
1,Animation
2,Children
3,Comedy
4,Fantasy
5,Romance
6,Drama
7,Action
8,Crime
9,Thriller


In [17]:
# Faire un dictionnaire avec tous les acteurs (acteur => indice)
# + un dictionnaire inversé (indice => acteur)
actor_index = dict()
index_actor = dict()
actor_films = dict()
for film in actors_pkl:
    for act in film:
        # affecte une valeur à une clé si la clé n'est pas utilisé
        res = actor_index.setdefault(act['name'], len(actor_index))
        if res == len(actor_index)-1:
            index_actor[len(actor_index)-1] = act['name']
        try:
            actor_films[act['name']].append(act['tmdbId'])
        except Exception:
             actor_films[act['name']] = [act['tmdbId']]

# Exemple de transformation supplémentaire
# Dans combien de films de base joue Tom Hanks? (Réponse 57)
# Dans combien de comédies...

# => On voit qu'il est possible de créer facilement des nouvelles caractéristiques qui
# apporteront des informations utiles pour certaines tâches

movies_tmdb.dropna(inplace=True)

movies_movielens.dropna(inplace=True)

tags.dropna(inplace=True)

In [18]:
movies_movielens.sort_values(by='movieId', inplace=True)
ratings.sort_values(by='movieId', inplace=True)
tags.sort_values(by='movieId', inplace=True)
links.sort_values(by='movieId', inplace=True)
scores.sort_values(by='movieId', inplace=True)


In [19]:
movies_ratings = pd.merge(movies_movielens, ratings, on='movieId')

In [27]:
rating_count = pd.DataFrame(movies_ratings.groupby('movieId', as_index = False)['rating']
                           .count().rename(columns={'rating' : 'votes_ml'}))

In [28]:
rating_count.sort_values('votes_ml',ascending=False).head()

Unnamed: 0,movieId,votes_ml
315,318,97999
352,356,97040
293,296,92406
587,593,87899
2487,2571,84545


In [29]:
rating_score = pd.DataFrame(movies_ratings.groupby('movieId', as_index = False)['rating']
                            .mean().rename(columns={'rating' : 'score_ml'}))
rating_score.sort_values('score_ml',ascending=False).head(5)

Unnamed: 0,movieId,score_ml
43805,169338,5.0
51585,187729,5.0
45018,172149,5.0
40484,160966,5.0
30667,134387,5.0


In [31]:
movies_ratings_ml = pd.merge(rating_count, rating_score, on='movieId')
movies_ratings_ml.sort_values('votes_ml',ascending=False).head(5)

Unnamed: 0,movieId,votes_ml,score_ml
315,318,97999,4.424188
352,356,97040,4.056585
293,296,92406,4.173971
587,593,87899,4.151412
2487,2571,84545,4.149695


In [32]:
movies_three_ratings = pd.merge(movies_all_gross, movies_ratings_ml, on='movieId')

In [33]:
movies_three_ratings.head(2)

Unnamed: 0,tmdbId,original_language,original_title,popularity,release_date,title,score_tmdb,votes_tmdb,movieId,imdbId,...,country,director,domestic_gross,runtime,score_imdb,star,votes_imdb,writer,votes_ml,score_ml
0,862,en,Toy Story,22.773,1995-10-30,Toy Story,7.9,9550,1,114709,...,USA,John Lasseter,191796233.0,81,8.3,Tom Hanks,694113,John Lasseter,68469,3.886649
1,8844,en,Jumanji,2.947,1995-12-15,Jumanji,7.1,5594,2,113497,...,USA,Joe Johnston,100475249.0,104,6.9,Robin Williams,232339,Jonathan Hensleigh,27143,3.246583


In [46]:
# store movies complete data
movies_three_ratings.to_csv("movielens/movies_complete.csv", index=False)

In [49]:
movies_genres_reduced = movies_genres.drop(['original_language', 'original_title', \
                                            'popularity', 'release_date', 'title', \
                                           'score_tmdb', 'votes_tmdb', 'movieId', \
                                           'imdbId', 'year'], axis=1)
actors_tmdb_reduced = actors_tmdb.drop(['character', 'cast_id', 'gender', 'name', 'order'],
                                       axis=1)

In [51]:
movies_dummy_genres = movies_genres_reduced.join(movies_genres_reduced.genres.str.get_dummies())
movies_dummy_genres.drop('genres', inplace=True, axis=1)

In [58]:
actors_dummy_genres = pd.merge(movies_dummy_genres, actors_tmdb_reduced, on='tmdbId')
actors_dummy_genres.drop(['tmdbId'], inplace=True, axis=1)

In [63]:
# store actors-genres dummy data
actors_dummy_genres.to_csv("movielens/actors_genres.csv", index=False)

In [62]:
actors_dummy_genres.head(2)

Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,...,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,id
0,0,0,1,1,1,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,31
1,0,0,1,1,1,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,12898


## Feature engineering
A vous de créer les caractéristiques de description des données qui permettront d'améliorer les performances dans les tâches que vous aurez choisi d'aborder dans le projet.

In [29]:
# Faire un dictionnaire avec tous les acteurs (acteur => indice)
# + un dictionnaire inversé (indice => acteur)
actor_index = dict()
index_actor = dict()
for film in actors_pkl:
    for act in film:
        # affecte une valeur à une clé si la clé n'est pas utilisé
        res = actor_index.setdefault(act['name'], len(actor_index))
        if res == len(actor_index)-1:
            index_actor[len(actor_index)-1] = act['name']

# Exemple de transformation supplémentaire
# Dans combien de films de base joue Tom Hanks? (Réponse 57)
# Dans combien de comédies...

# => On voit qu'il est possible de créer facilement des nouvelles caractéristiques qui
# apporteront des informations utiles pour certaines tâches

In [32]:
index_actor[0], actor_index["Tom Hanks"]

('Tom Hanks', 0)

In [31]:
genres = dict()
genres_inv = dict()
for g in genres_unique.itertuples():
    # affecte une valeur à une clé si la clé n'est pas utilisée
    res = genres.setdefault(g.genre, len(genres))
    if res == len(genres)-1:
        genres_inv[len(genres)-1] = g.genre

In [32]:
genres["Fantasy"]

4

In [33]:
actors_tmdb.sort_values(by='id', inplace=True)

In [34]:
movies_tmdb.sort_values(by='id', inplace=True)

In [35]:
actors_tmdb.head()

Unnamed: 0,cast_id,character,credit_id,gender,id,name,order
267934,11,Himself,56757ade92514179d4002e15,2,1,George Lucas,7
182264,78,Baron Papanoida,584a4128c3a368141f01b620,2,1,George Lucas,24
182650,13,Himself,5be613610e0a263bf80039c4,2,1,George Lucas,11
8009,5,Disappointed Man,52fe4235c3a36847f800c2f7,2,1,George Lucas,21
339106,1,Himself,52fe48799251416c9108db99,2,1,George Lucas,0


In [36]:
movies_tmdb.loc[movies_tmdb.original_title == 'Toy Story']

Unnamed: 0,adult,genre_ids,id,original_language,original_title,overview,popularity,release_date,title,video,vote_average,vote_count
0,False,"[16, 35, 10751]",862,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",22.773,1995-10-30,Toy Story,False,7.9,9550


In [37]:
movies_movielens.loc[movies_movielens.title == 'Toy Story']

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995


In [38]:
movies_tmdb.loc[movies_tmdb.title == 'Toy Story']

Unnamed: 0,adult,genre_ids,id,original_language,original_title,overview,popularity,release_date,title,video,vote_average,vote_count
0,False,"[16, 35, 10751]",862,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",22.773,1995-10-30,Toy Story,False,7.9,9550


In [39]:
actors_pkl[0]

[{'cast_id': 14,
  'character': 'Woody (voice)',
  'credit_id': '52fe4284c3a36847f8024f95',
  'gender': 2,
  'id': 31,
  'name': 'Tom Hanks',
  'order': 0,
  'profile_path': '/xxPMucou2wRDxLrud8i2D4dsywh.jpg'},
 {'cast_id': 15,
  'character': 'Buzz Lightyear (voice)',
  'credit_id': '52fe4284c3a36847f8024f99',
  'gender': 2,
  'id': 12898,
  'name': 'Tim Allen',
  'order': 1,
  'profile_path': '/dDbtWMGdhatUjCIYolc312R2ygu.jpg'},
 {'cast_id': 16,
  'character': 'Mr. Potato Head (voice)',
  'credit_id': '52fe4284c3a36847f8024f9d',
  'gender': 2,
  'id': 7167,
  'name': 'Don Rickles',
  'order': 2,
  'profile_path': '/h5BcaDMPRVLHLDzbQavec4xfSdt.jpg'},
 {'cast_id': 17,
  'character': 'Slinky Dog (voice)',
  'credit_id': '52fe4284c3a36847f8024fa1',
  'gender': 2,
  'id': 12899,
  'name': 'Jim Varney',
  'order': 3,
  'profile_path': '/eIo2jVVXYgjDtaHoF19Ll9vtW7h.jpg'},
 {'cast_id': 18,
  'character': 'Rex (voice)',
  'credit_id': '52fe4284c3a36847f8024fa5',
  'gender': 2,
  'id': 12900,
 

In [40]:
actors_tmdb.loc[actors_tmdb.name == 'Tim Allen']

Unnamed: 0,cast_id,character,credit_id,gender,id,name,order
356723,7,Buzz Lightyear,52fe4dbec3a368484e1fac15,2,12898,Tim Allen,1
266660,9,Buzz Lightyear (voice),52fe433f9251416c75009169,2,12898,Tim Allen,1
5988,10,Santa Claus / Scott Calvin,52fe44379251416c7502ce9b,2,12898,Tim Allen,0
152772,1,Luther Krank,52fe458a9251416c7505a0a7,2,12898,Tim Allen,0
225420,2,Chet Frank,52fe44dc9251416c750437b5,2,12898,Tim Allen,1
195611,22,Dave Douglas,52fe431a9251416c75003823,2,12898,Tim Allen,0
278490,1,Tommy,52fe45a69251416c910399e7,2,12898,Tim Allen,0
28496,11,Michael Cromwell,52fe44f8c3a36847f80b4f77,2,12898,Tim Allen,0
389622,1,Buzz Lightyear (voice),52fe48509251416c91087f5d,2,12898,Tim Allen,1
197625,68,Buzz Lightyear Car (voice),550da0af9251414695005f49,2,12898,Tim Allen,23


In [41]:
movies_movielens.loc[movies_movielens.title == 'Pulp Fiction']

Unnamed: 0,movieId,title,genres,year
293,296,Pulp Fiction,Comedy|Crime|Drama|Thriller,1994


In [51]:
movies_tmdb.loc[movies_tmdb.original_title == 'Pulp Fiction']

Unnamed: 0,adult,genre_ids,id,original_language,original_title,overview,popularity,release_date,title,video,vote_average,vote_count
293,False,"[53, 80]",680,en,Pulp Fiction,"A burger-loving hit man, his philosophical par...",29.059,1994-09-10,Pulp Fiction,False,8.4,14296


In [49]:
genres_vals = movies_tmdb.genre_ids.values
genres_vals = [el for sub in genres_vals for el in sub]
un_genre_vals = np.unique(genres_vals)

In [50]:
un_genre_vals

array([   12,    14,    16,    18,    27,    28,    35,    36,    37,
          53,    80,    99,   878,  9648, 10402, 10749, 10751, 10752,
       10770])

## problème rencontré en essayant de faire des dictionnaires Films: {actors} et Actors:{categories} :
## Je ne comprends pas a quel genre correspond chaque genre_ids et comment trouver les films dans lesquels chaque acteur a jouer