# Analyse de données de films

Nous proposons de travailler sur des données décrivant des films. Les possibilités sont larges et vous êtes évalués sur vos propositions et votre méthodologie plus que sur vos résultats.

Les données de départ sont disponibles sur:
https://grouplens.org/datasets/movielens/
au format CSV. 

Nous nous intéresserons en particulier au jeu de données: **MovieLens 20M Dataset**. Dans ce jeu de données, vous disposez entre autre de:
* Idendifiant du film dans IMdb et TMdb (ça sera important ensuite)
* Catégorie(s) du film
* Titre du film
* Notes données par les internautes aux films

Afin de rendre le projet plus intéressant, nous ajoutons des données sur les acteurs et producteurs associés aux films (récupéré sur TMdb). Ces données sont disponibles sur les liens suivants:

http://webia.lip6.fr/~guigue/film_v2.pkl <br>
http://webia.lip6.fr/~guigue/act_v2.pkl <br>
http://webia.lip6.fr/~guigue/crew_v2.pkl

Ces fichiers contiennent respectivement : une nouvelle description des films (dont l'identifiant TMdb et la note moyenne donnée par les internautes, la date de sortie,...), une description des acteurs de chaque film et une description des équipes (scénariste, producteur, metteur en scène) pour chaque film.

Ces données sont des listes de taille 26908, chaque élément de la liste correspondant à un dictionnaire dont vous étudierez les clés pour récupérer les informations utiles.

**ATTENTION** Les contraintes de récupération d'informations en ligne font que la base MovieLens compte 27278 films mais les fichiers ci-dessus n'en comptent que 26908. Le plus simple est probablement d'éliminer les films de MovieLens qui ne sont pas dans cette seconde base.

## Consignes générales pour l'analyse des données

Vous devez proposer plusieurs analyses des données, qui devront à minima utiliser les
 techniques suivantes:
 
1. Mettre en forme les données pour identifier les acteurs et les catégories, les indexer
1. Traiter au moins un problème de régression supervisé (par exemple la prédiction de la note moyenne donnée à un film par les internautes).
1. Traiter au moins un problème de classification supervisé (par exemple la prédiction de la catégorie d'un film)
1. Utiliser les données catégorielles (catégories, acteurs,...) de manière discrète ET de manière coninue (*dummy coding*) dans des approches différentes
1. Proposer au moins une approche de catégorisation non supervisée (pour regrouper les acteurs par exemple)
1. Mener une campagne d'expérience permettant de comparer les performances sur un problème en fonction des valeurs d'un paramètre (et donc, in fine, trouver la meilleure valeur du paramètre)
1. Proposer quelques illustrations

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import pickle as pkl
import sys
sys.path.append('../')
import iads as iads
from iads import LabeledSet as ls
from iads import Classifiers as cl

## Chargement des données (base MovieLens + enrichissements)

In [2]:
# Chargement des données MovieLens
fname_links = "data/ml-20m/links.csv"# à compléter avec le chemin d'accès au fichier
links = pd.read_csv(fname_links, encoding='utf8')

In [3]:
# Chargement des données complémentaires
acteurs = pkl.load(open("data/act_v2.pkl", "rb"))
equipes = pkl.load(open("data/crew_v2.pkl", "rb"))
films = pkl.load(open("data/film_v2.pkl", "rb"))

In [None]:
acteursPd = pd.DataFrame.from_dict(acteurs)
equipesPd = pd.DataFrame.from_dict(equipes)
filmsPd = pd.DataFrame.from_dict(films)

In [45]:
links

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
5,6,113277,949.0
6,7,114319,11860.0
7,8,112302,45325.0
8,9,114576,9091.0
9,10,113189,710.0


In [39]:
filmsPd

Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,False,/dji4Fm0gCDVb9DQQMRvAI8YNnTz.jpg,"[16, 35, 10751]",862,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",22.773,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,1995-10-30,Toy Story,False,7.9,9550
1,False,/7k4zEgUZbzMHawDaMc9yIkmY1qR.jpg,"[12, 14, 10751]",8844,en,Jumanji,When siblings Judy and Peter discover an encha...,2.947,/vgpXmVaVyUL7GGiDeiK1mKEKzcX.jpg,1995-12-15,Jumanji,False,7.1,5594
2,False,/1ENbkuIYK2taNGGKNMs2hw6SaJb.jpg,"[35, 10749]",15602,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,6.076,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,1995-12-22,Grumpier Old Men,False,6.5,140
3,False,/u0hQzp4xfag3ZhsKKBBdgyIVvCl.jpg,"[35, 18, 10749]",31357,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",2.917,/4wjGMwPsdlvi025ZqR4rXnFDvBz.jpg,1995-12-22,Waiting to Exhale,False,6.1,55
4,False,/cZs50rEk4T13qWedon0uCnbYQzW.jpg,[35],11862,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,6.817,/e64sOI48hQXyru7naBFyssKFxVd.jpg,1995-02-10,Father of the Bride Part II,False,6.1,288
5,False,/jMzVSwQp1lLVq9fnQQ4yOjr1YZ2.jpg,"[28, 80, 18, 53]",949,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",13.666,/zMyfPUelumio3tiDKPffaUpsQTD.jpg,1995-12-15,Heat,False,7.8,3002
6,False,/hSy5yZG18ogNQn1tHSlxSqV24cf.jpg,"[35, 10749]",11860,en,Sabrina,An ugly duckling having undergone a remarkable...,6.177,/jQh15y5YB7bWz1NtffNZmRw0s9D.jpg,1995-12-15,Sabrina,False,6.1,260
7,False,/43r8WYBhOrj0SLSTuShynuWj6Z.jpg,"[28, 12, 18, 10751]",45325,en,Tom and Huck,"A mischievous young boy, Tom Sawyer, witnesses...",3.567,/sGO5Qa55p7wTu7FJcX4H4xIVKvS.jpg,1995-12-22,Tom and Huck,False,5.3,73
8,False,/y6A4PUAD61r15CgtuuQhWxLh6Vx.jpg,"[28, 12, 18, 53]",9091,en,Sudden Death,When a man's daughter is suddenly taken during...,5.890,/ridz4IucWay8dBP5t68rGYykCvi.jpg,1995-10-27,Sudden Death,False,5.7,279
9,False,/dA9I0Vd9OZzRQ2GyGcsFXdKGMz3.jpg,"[12, 28, 53]",710,en,GoldenEye,James Bond must unmask the mysterious head of ...,16.629,/5c0ovjT41KnYIHYuF4AWsTe3sKh.jpg,1995-11-16,GoldenEye,False,6.8,1853


In [38]:
equipesPd

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,962,963,964,965,966,967,968,969,970,971
0,"{'credit_id': '52fe4284c3a36847f8024f49', 'dep...","{'credit_id': '52fe4284c3a36847f8024f4f', 'dep...","{'credit_id': '52fe4284c3a36847f8024f55', 'dep...","{'credit_id': '52fe4284c3a36847f8024f5b', 'dep...","{'credit_id': '52fe4284c3a36847f8024f61', 'dep...","{'credit_id': '52fe4284c3a36847f8024f67', 'dep...","{'credit_id': '52fe4284c3a36847f8024f6d', 'dep...","{'credit_id': '52fe4284c3a36847f8024f73', 'dep...","{'credit_id': '52fe4284c3a36847f8024f79', 'dep...","{'credit_id': '52fe4284c3a36847f8024f8b', 'dep...",...,,,,,,,,,,
1,"{'credit_id': '52fe44bfc3a36847f80a7c7d', 'dep...","{'credit_id': '52fe44bfc3a36847f80a7c83', 'dep...","{'credit_id': '52fe44bfc3a36847f80a7c89', 'dep...","{'credit_id': '52fe44bfc3a36847f80a7c8f', 'dep...","{'credit_id': '52fe44bfc3a36847f80a7c95', 'dep...","{'credit_id': '52fe44bfc3a36847f80a7cb3', 'dep...","{'credit_id': '52fe44bfc3a36847f80a7cb9', 'dep...","{'credit_id': '52fe44bfc3a36847f80a7cbf', 'dep...","{'credit_id': '52fe44bfc3a36847f80a7cc5', 'dep...","{'credit_id': '52fe44bfc3a36847f80a7ccb', 'dep...",...,,,,,,,,,,
2,"{'credit_id': '52fe466a9251416c75077a89', 'dep...","{'credit_id': '52fe466b9251416c75077aa3', 'dep...","{'credit_id': '52fe466b9251416c75077aa9', 'dep...","{'credit_id': '5af991bac3a368103d00dd69', 'dep...","{'credit_id': '5af9925892514139c300ae22', 'dep...","{'credit_id': '5af9926dc3a368105d00d633', 'dep...","{'credit_id': '5af9930692514139b90089ba', 'dep...","{'credit_id': '5af9933d92514139b0009a00', 'dep...","{'credit_id': '5af9934d0e0a263eee00a8be', 'dep...","{'credit_id': '5af9935c92514139d600c5aa', 'dep...",...,,,,,,,,,,
3,"{'credit_id': '52fe44779251416c91011acb', 'dep...","{'credit_id': '52fe44779251416c91011ad5', 'dep...","{'credit_id': '52fe44779251416c91011adb', 'dep...","{'credit_id': '52fe44779251416c91011ae1', 'dep...","{'credit_id': '52fe44779251416c91011ae7', 'dep...","{'credit_id': '52fe44779251416c91011aed', 'dep...","{'credit_id': '52fe44779251416c91011af3', 'dep...","{'credit_id': '52fe44779251416c91011af9', 'dep...","{'credit_id': '52fe44779251416c91011aff', 'dep...","{'credit_id': '52fe44779251416c91011b05', 'dep...",...,,,,,,,,,,
4,"{'credit_id': '52fe44959251416c75039ecb', 'dep...","{'credit_id': '52fe44959251416c75039ed1', 'dep...","{'credit_id': '52fe44959251416c75039ed7', 'dep...","{'credit_id': '52fe44959251416c75039edd', 'dep...","{'credit_id': '52fe44959251416c75039ee3', 'dep...","{'credit_id': '52fe44959251416c75039ee9', 'dep...","{'credit_id': '52fe44959251416c75039eef', 'dep...",,,,...,,,,,,,,,,
5,"{'credit_id': '52fe4292c3a36847f802916d', 'dep...","{'credit_id': '52fe4292c3a36847f8029173', 'dep...","{'credit_id': '52fe4292c3a36847f8029179', 'dep...","{'credit_id': '52fe4292c3a36847f802917f', 'dep...","{'credit_id': '52fe4292c3a36847f8029185', 'dep...","{'credit_id': '52fe4292c3a36847f802918b', 'dep...","{'credit_id': '52fe4292c3a36847f8029191', 'dep...","{'credit_id': '52fe4292c3a36847f8029197', 'dep...","{'credit_id': '52fe4292c3a36847f802919d', 'dep...","{'credit_id': '52fe4292c3a36847f80291a3', 'dep...",...,,,,,,,,,,
6,"{'credit_id': '52fe44959251416c75039da9', 'dep...","{'credit_id': '52fe44959251416c75039daf', 'dep...","{'credit_id': '52fe44959251416c75039db5', 'dep...","{'credit_id': '52fe44959251416c75039dbb', 'dep...","{'credit_id': '52fe44959251416c75039dc7', 'dep...","{'credit_id': '55a3b9c2c3a3681ce30058b3', 'dep...","{'credit_id': '55a3ba349251412974005664', 'dep...","{'credit_id': '569cf5c89251415e7000342f', 'dep...","{'credit_id': '569cfb339251415e5e0033b3', 'dep...","{'credit_id': '569cf55ec3a36858e500357c', 'dep...",...,,,,,,,,,,
7,"{'credit_id': '52fe46bdc3a36847f810f76d', 'dep...","{'credit_id': '52fe46bdc3a36847f810f78b', 'dep...","{'credit_id': '52fe46bdc3a36847f810f791', 'dep...","{'credit_id': '52fe46bdc3a36847f810f797', 'dep...",,,,,,,...,,,,,,,,,,
8,"{'credit_id': '52fe44dbc3a36847f80ae0f1', 'dep...","{'credit_id': '52fe44dbc3a36847f80ae0f7', 'dep...","{'credit_id': '52fe44dbc3a36847f80ae103', 'dep...","{'credit_id': '52fe44dbc3a36847f80ae109', 'dep...","{'credit_id': '52fe44dbc3a36847f80ae115', 'dep...","{'credit_id': '52fe44dbc3a36847f80ae121', 'dep...","{'credit_id': '52fe44dbc3a36847f80ae127', 'dep...","{'credit_id': '5a5d02530e0a2674360011e7', 'dep...","{'credit_id': '5a5d0267925141119a001198', 'dep...","{'credit_id': '5a5d027d0e0a26743f001303', 'dep...",...,,,,,,,,,,
9,"{'credit_id': '52fe426ec3a36847f801e14b', 'dep...","{'credit_id': '52fe426ec3a36847f801e157', 'dep...","{'credit_id': '52fe426ec3a36847f801e15d', 'dep...","{'credit_id': '52fe426ec3a36847f801e163', 'dep...","{'credit_id': '52fe426ec3a36847f801e169', 'dep...","{'credit_id': '52fe426ec3a36847f801e16f', 'dep...","{'credit_id': '52fe426ec3a36847f801e175', 'dep...","{'credit_id': '52fe426ec3a36847f801e17b', 'dep...","{'credit_id': '52fe426ec3a36847f801e181', 'dep...","{'credit_id': '52fe426ec3a36847f801e187', 'dep...",...,,,,,,,,,,


In [46]:
equipesPd.[0][0][0]

{'credit_id': '52fe4284c3a36847f8024f49',
 'department': 'Directing',
 'gender': 2,
 'id': 7879,
 'job': 'Director',
 'name': 'John Lasseter',
 'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'}

In [37]:
acteursPd

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,303,304,305,306,307,308,309,310,311,312
0,"{'cast_id': 14, 'character': 'Woody (voice)', ...","{'cast_id': 15, 'character': 'Buzz Lightyear (...","{'cast_id': 16, 'character': 'Mr. Potato Head ...","{'cast_id': 17, 'character': 'Slinky Dog (voic...","{'cast_id': 18, 'character': 'Rex (voice)', 'c...","{'cast_id': 19, 'character': 'Hamm (voice)', '...","{'cast_id': 20, 'character': 'Bo Peep (voice)'...","{'cast_id': 26, 'character': 'Andy (voice)', '...","{'cast_id': 22, 'character': 'Sid (voice)', 'c...","{'cast_id': 23, 'character': 'Mrs. Davis (voic...",...,,,,,,,,,,
1,"{'cast_id': 1, 'character': 'Alan Parrish', 'c...","{'cast_id': 8, 'character': 'Samuel Alan Parri...","{'cast_id': 2, 'character': 'Judy Shepherd', '...","{'cast_id': 24, 'character': 'Peter Shepherd',...","{'cast_id': 10, 'character': 'Sarah Whittle', ...","{'cast_id': 25, 'character': 'Nora Shepherd', ...","{'cast_id': 26, 'character': 'Carl Bentley', '...","{'cast_id': 11, 'character': 'Carol Anne Parri...","{'cast_id': 14, 'character': 'Young Alan', 'cr...","{'cast_id': 13, 'character': 'Young Sarah', 'c...",...,,,,,,,,,,
2,"{'cast_id': 2, 'character': 'Max Goldman', 'cr...","{'cast_id': 3, 'character': 'John Gustafson', ...","{'cast_id': 4, 'character': 'Ariel Gustafson',...","{'cast_id': 5, 'character': 'Maria Sophia Cole...","{'cast_id': 6, 'character': 'Melanie Gustafson...","{'cast_id': 9, 'character': 'Grandpa Gustafson...","{'cast_id': 10, 'character': 'Jacob Goldman', ...",,,,...,,,,,,,,,,
3,"{'cast_id': 1, 'character': 'Savannah 'Vannah'...","{'cast_id': 2, 'character': 'Bernadine 'Bernie...","{'cast_id': 3, 'character': 'Gloria 'Glo' Matt...","{'cast_id': 4, 'character': 'Robin Stokes', 'c...","{'cast_id': 5, 'character': 'Marvin King', 'cr...","{'cast_id': 6, 'character': 'Kenneth Dawkins',...","{'cast_id': 8, 'character': 'John Harris, Sr.'...","{'cast_id': 10, 'character': 'Troy', 'credit_i...","{'cast_id': 20, 'character': 'Joseph', 'credit...","{'cast_id': 21, 'character': 'James Wheeler', ...",...,,,,,,,,,,
4,"{'cast_id': 1, 'character': 'George Banks', 'c...","{'cast_id': 2, 'character': 'Nina Banks', 'cre...","{'cast_id': 3, 'character': 'Franck Eggelhoffe...","{'cast_id': 4, 'character': 'Annie Banks-MacKe...","{'cast_id': 13, 'character': 'Bryan MacKenzie'...","{'cast_id': 14, 'character': 'Matty Banks', 'c...","{'cast_id': 15, 'character': 'Howard Weinstein...","{'cast_id': 16, 'character': 'John MacKenzie',...","{'cast_id': 17, 'character': 'Joanna MacKenzie...","{'cast_id': 18, 'character': 'Dr. Megan Eisenb...",...,,,,,,,,,,
5,"{'cast_id': 25, 'character': 'Lt. Vincent Hann...","{'cast_id': 26, 'character': 'Neil McCauley', ...","{'cast_id': 27, 'character': 'Chris Shiherlis'...","{'cast_id': 28, 'character': 'Nate', 'credit_i...","{'cast_id': 29, 'character': 'Michael Cheritto...","{'cast_id': 30, 'character': 'Justine Hanna', ...","{'cast_id': 31, 'character': 'Eady', 'credit_i...","{'cast_id': 32, 'character': 'Charlene Shiherl...","{'cast_id': 33, 'character': 'Sergeant Drucker...","{'cast_id': 38, 'character': 'Lauren Gustafson...",...,,,,,,,,,,
6,"{'cast_id': 1, 'character': 'Linus Larrabee', ...","{'cast_id': 2, 'character': 'Sabrina Fairchild...","{'cast_id': 3, 'character': 'David Larrabee', ...","{'cast_id': 4, 'character': 'Mrs. Ingrid Tyson...","{'cast_id': 11, 'character': 'Maude Larrabee',...","{'cast_id': 12, 'character': 'Fairchild', 'cre...","{'cast_id': 13, 'character': 'Patrick Tyson', ...","{'cast_id': 14, 'character': 'Elizabeth Tyson'...","{'cast_id': 15, 'character': 'Mack', 'credit_i...","{'cast_id': 16, 'character': 'Irene', 'credit_...",...,,,,,,,,,,
7,"{'cast_id': 2, 'character': 'Tom Sawyer', 'cre...","{'cast_id': 3, 'character': 'Huck Finn', 'cred...","{'cast_id': 4, 'character': 'Becky Thatcher', ...","{'cast_id': 5, 'character': 'Muff Potter', 'cr...","{'cast_id': 6, 'character': 'Aunt Polly', 'cre...","{'cast_id': 7, 'character': 'Injun Joe', 'cred...","{'cast_id': 11, 'character': 'Townsperson', 'c...",,,,...,,,,,,,,,,
8,"{'cast_id': 1, 'character': 'Darren Francis Th...","{'cast_id': 2, 'character': 'Joshua Foss', 'cr...","{'cast_id': 4, 'character': 'Matthew Hallmark'...","{'cast_id': 15, 'character': 'Vizepräsident Da...","{'cast_id': 16, 'character': 'Tyler', 'credit_...","{'cast_id': 17, 'character': 'Emily McCord', '...",,,,,...,,,,,,,,,,
9,"{'cast_id': 1, 'character': 'James Bond', 'cre...","{'cast_id': 2, 'character': 'Alec Trevelyan', ...","{'cast_id': 3, 'character': 'Natalya Fyodorovn...","{'cast_id': 4, 'character': 'Xenia Onatopp', '...","{'cast_id': 5, 'character': 'Jack Wade', 'cred...","{'cast_id': 6, 'character': 'M', 'credit_id': ...","{'cast_id': 7, 'character': 'General Arkady Gr...","{'cast_id': 8, 'character': 'Valentin Dmitrovi...","{'cast_id': 9, 'character': 'Boris Grishenko',...","{'cast_id': 10, 'character': 'Defense Minister...",...,,,,,,,,,,


In [44]:
acteursPd.head(1)[0][0]

{'cast_id': 14,
 'character': 'Woody (voice)',
 'credit_id': '52fe4284c3a36847f8024f95',
 'gender': 2,
 'id': 31,
 'name': 'Tom Hanks',
 'order': 0,
 'profile_path': '/xxPMucou2wRDxLrud8i2D4dsywh.jpg'}

## Feature engineering
A vous de créer les caractéristiques de description des données qui permettront d'améliorer les performances dans les tâches que vous aurez choisi d'aborder dans le projet.

In [4]:
# Faire un dictionnaire avec tous les acteurs (acteur => indice)
# + un dictionnaire inversé (indice => acteur)
actors = dict()
actors_inv = dict()
for lista in acteurs:
    for a in lista:
        # affecte une valeur à une clé si la clé n'est pas utilisée
        res = actors.setdefault(a['name'], len(actors))
        if res == len(actors)-1:
            actors_inv[len(actors)-1] = a['name']

# Exemple de transformation supplémentaire
# Dans combien de films de base joue Tom Hanks? (Réponse 57)
# Dans combien de comédies...

# => On voit qu'il est possible de créer facilement des nouvelles caractéristiques qui
# apporteront des informations utiles pour certaines tâches