<center><h3 style="color:red;">Préparation des données</h3></center>

Ce notebook prépare les données en vue de leur visualisation, l'objectif étant de produire des jeux de données propres et prêts à être explorés graphiquement.  

In [1]:
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim
import time

# Définier les colonnes à conserver pour D3
COLUMNS_D3 = ["type", "title", "director", "country", "date_added", "release_year", "rating", "duration_num", "duration_type", "genre", "month_added", "year_added", "audience", "continent"]

In [2]:
import pycountry_convert as pc

def get_continent_from_country(country_name):
    try:
        country_code = pc.country_name_to_country_alpha2(country_name)
        continent_code = pc.country_alpha2_to_continent_code(country_code)
        continent_map = {
            "AF": "Africa",
            "AS": "Asia",
            "EU": "Europe",
            "NA": "North America",
            "SA": "South America",
            "OC": "Oceania",
            "AN": "Antarctica",
        }
        return continent_map[continent_code]
    except:
        return "Unknown"
    
def map_rating_to_audience(rating):
    if rating in ['TV-Y', 'TV-G', 'TV-Y7', 'ALL']:
        return 'kids'
    elif rating in ['TV-PG', 'TV-14', 'PG', '13+']:
        return 'teens'
    elif rating in ['TV-MA', 'R', 'NC-17', '18+']:
        return 'adults'
    else:
        return 'unknown'

#### Netflix Titles And Movies

In [3]:
netflix_df = pd.read_csv("datasets/netflix_titles.csv")
netflix_df.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [4]:
netflix_df.drop(columns="show_id", inplace=True)

In [5]:
netflix_df.isnull().sum()

type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

Gestion des valeurs manquantes et suppression des duplicatats

In [6]:
# Gestion des manquantes
netflix_df['director'] = netflix_df['director'].fillna('Unknown')
netflix_df['cast'] = netflix_df['cast'].fillna('Unknown')
netflix_df['country'] = netflix_df['country'].fillna(netflix_df['country'].mode()[0])
netflix_df.dropna(inplace=True)

# Suppression des duplicatats
netflix_df.drop_duplicates(inplace=True)

In [7]:
netflix_df.isnull().sum()

type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

In [8]:
netflix_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8790 entries, 0 to 8806
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   type          8790 non-null   object
 1   title         8790 non-null   object
 2   director      8790 non-null   object
 3   cast          8790 non-null   object
 4   country       8790 non-null   object
 5   date_added    8790 non-null   object
 6   release_year  8790 non-null   int64 
 7   rating        8790 non-null   object
 8   duration      8790 non-null   object
 9   listed_in     8790 non-null   object
 10  description   8790 non-null   object
dtypes: int64(1), object(10)
memory usage: 824.1+ KB


Modifier le type pour date et ajout de colonnes pour le mois et l'année

In [9]:
netflix_df["date_added"] = pd.to_datetime(netflix_df['date_added'])

netflix_df['month_added']=netflix_df['date_added'].dt.month_name()
netflix_df['year_added'] = netflix_df['date_added'].dt.year

netflix_df.head(5)

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,month_added,year_added
0,Movie,Dick Johnson Is Dead,Kirsten Johnson,Unknown,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",September,2021
1,TV Show,Blood & Water,Unknown,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",September,2021
2,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",United States,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,September,2021
3,TV Show,Jailbirds New Orleans,Unknown,Unknown,United States,2021-09-24,2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",September,2021
4,TV Show,Kota Factory,Unknown,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,September,2021


Nettoyage des colonnes textuelles

In [10]:
netflix_df['title'] = netflix_df['title'].str.strip()
netflix_df['director'] = netflix_df['director'].str.strip()
netflix_df['country'] = netflix_df['country'].str.strip()
netflix_df['listed_in'] = netflix_df['listed_in'].str.strip()
netflix_df['description'] = netflix_df['description'].str.strip()

netflix_df.head(5)

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,month_added,year_added
0,Movie,Dick Johnson Is Dead,Kirsten Johnson,Unknown,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",September,2021
1,TV Show,Blood & Water,Unknown,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",September,2021
2,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",United States,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,September,2021
3,TV Show,Jailbirds New Orleans,Unknown,Unknown,United States,2021-09-24,2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",September,2021
4,TV Show,Kota Factory,Unknown,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,September,2021


Création des colonnes dérivées

In [11]:
netflix_df[['duration_num', 'duration_type']] = netflix_df['duration'].str.extract(r'(\d+)\s*(\w+)', expand=True)
netflix_df['duration_num'] = pd.to_numeric(netflix_df['duration_num'], errors='coerce')
netflix_df['country'] = netflix_df['country'].str.split(',').str[0].str.strip()
netflix_df['genre'] = netflix_df['listed_in'].str.split(',').str[0].str.strip()
netflix_df['audience'] = netflix_df['rating'].apply(map_rating_to_audience)
netflix_df['continent'] = netflix_df['country'].apply(get_continent_from_country)


netflix_df.head(5)

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,month_added,year_added,duration_num,duration_type,genre,audience,continent
0,Movie,Dick Johnson Is Dead,Kirsten Johnson,Unknown,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",September,2021,90,min,Documentaries,unknown,North America
1,TV Show,Blood & Water,Unknown,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",September,2021,2,Seasons,International TV Shows,adults,Africa
2,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",United States,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,September,2021,1,Season,Crime TV Shows,adults,North America
3,TV Show,Jailbirds New Orleans,Unknown,Unknown,United States,2021-09-24,2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",September,2021,1,Season,Docuseries,adults,North America
4,TV Show,Kota Factory,Unknown,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,September,2021,2,Seasons,International TV Shows,adults,Asia


Conserver certaines colonnes pour D3 et sauvegarder le document

In [12]:
netflix_df = netflix_df[COLUMNS_D3]

#### Hulu Titles And Movies

In [13]:
hulu_df = pd.read_csv("datasets/hulu_titles.csv")
hulu_df.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Ricky Velez: Here's Everything,,,,"October 24, 2021",2021,TV-MA,,"Comedy, Stand Up",​Comedian Ricky Velez bares it all with his ho...
1,s2,Movie,Silent Night,,,,"October 23, 2021",2020,,94 min,"Crime, Drama, Thriller","Mark, a low end South London hitman recently r..."
2,s3,Movie,The Marksman,,,,"October 23, 2021",2021,PG-13,108 min,"Action, Thriller",A hardened Arizona rancher tries to protect an...
3,s4,Movie,Gaia,,,,"October 22, 2021",2021,R,97 min,Horror,A forest ranger and two survivalists with a cu...
4,s5,Movie,Settlers,,,,"October 22, 2021",2021,,104 min,"Science Fiction, Thriller",Mankind's earliest settlers on the Martian fro...


In [14]:
hulu_df.drop(columns="show_id", inplace=True)

In [15]:
hulu_df.isnull().sum()

type               0
title              0
director        3070
cast            3073
country         1453
date_added        28
release_year       0
rating           520
duration         479
listed_in          0
description        4
dtype: int64

Gestion des valeurs manquantes et suppression des duplicatats

In [16]:
# Gestion des manquantes
hulu_df['director'] = hulu_df['director'].fillna('Unknown')
hulu_df['cast'] = hulu_df['cast'].fillna('Unknown')
hulu_df['country'] = hulu_df['country'].fillna(hulu_df['country'].mode()[0])
hulu_df.dropna(inplace=True)

# Suppression des duplicatats
hulu_df.drop_duplicates(inplace=True)

In [17]:
hulu_df.isnull().sum()

type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

In [18]:
hulu_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2152 entries, 2 to 3044
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   type          2152 non-null   object
 1   title         2152 non-null   object
 2   director      2152 non-null   object
 3   cast          2152 non-null   object
 4   country       2152 non-null   object
 5   date_added    2152 non-null   object
 6   release_year  2152 non-null   int64 
 7   rating        2152 non-null   object
 8   duration      2152 non-null   object
 9   listed_in     2152 non-null   object
 10  description   2152 non-null   object
dtypes: int64(1), object(10)
memory usage: 201.8+ KB


Modifier le type pour date et ajout de colonnes pour le mois et l'année

In [19]:
hulu_df["date_added"] = pd.to_datetime(hulu_df['date_added'])

hulu_df['month_added']=hulu_df['date_added'].dt.month_name()
hulu_df['year_added'] = hulu_df['date_added'].dt.year

hulu_df.head(5)

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,month_added,year_added
2,Movie,The Marksman,Unknown,Unknown,United States,2021-10-23,2021,PG-13,108 min,"Action, Thriller",A hardened Arizona rancher tries to protect an...,October,2021
3,Movie,Gaia,Unknown,Unknown,United States,2021-10-22,2021,R,97 min,Horror,A forest ranger and two survivalists with a cu...,October,2021
8,TV Show,Queens,Unknown,Unknown,United States,2021-10-20,2021,TV-14,1 Season,"Drama, Music",Four women in their 40s reunite for a chance t...,October,2021
9,TV Show,The Bachelorette,Unknown,Unknown,United States,2021-10-20,2003,TV-14,3 Seasons,"Reality, Romance",ABC's romance reality show lets one lucky lady...,October,2021
11,Movie,Dream Horse,Unknown,Unknown,United States,2021-10-18,2020,PG,113 min,"Comedy, Drama",The film tells the inspiring true story of a s...,October,2021


Nettoyage des colonnes textuelles

In [20]:
hulu_df['title'] = hulu_df['title'].str.strip()
hulu_df['director'] = hulu_df['director'].str.strip()
hulu_df['country'] = hulu_df['country'].str.strip()
hulu_df['listed_in'] = hulu_df['listed_in'].str.strip()
hulu_df['description'] = hulu_df['description'].str.strip()

hulu_df.head(5)

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,month_added,year_added
2,Movie,The Marksman,Unknown,Unknown,United States,2021-10-23,2021,PG-13,108 min,"Action, Thriller",A hardened Arizona rancher tries to protect an...,October,2021
3,Movie,Gaia,Unknown,Unknown,United States,2021-10-22,2021,R,97 min,Horror,A forest ranger and two survivalists with a cu...,October,2021
8,TV Show,Queens,Unknown,Unknown,United States,2021-10-20,2021,TV-14,1 Season,"Drama, Music",Four women in their 40s reunite for a chance t...,October,2021
9,TV Show,The Bachelorette,Unknown,Unknown,United States,2021-10-20,2003,TV-14,3 Seasons,"Reality, Romance",ABC's romance reality show lets one lucky lady...,October,2021
11,Movie,Dream Horse,Unknown,Unknown,United States,2021-10-18,2020,PG,113 min,"Comedy, Drama",The film tells the inspiring true story of a s...,October,2021


Création des colonnes dérivées

In [21]:
hulu_df[['duration_num', 'duration_type']] = hulu_df['duration'].str.extract(r'(\d+)\s*(\w+)', expand=True)
hulu_df['duration_num'] = pd.to_numeric(hulu_df['duration_num'], errors='coerce')
hulu_df['country'] = hulu_df['country'].str.split(',').str[0].str.strip()
hulu_df['genre'] = hulu_df['listed_in'].str.split(',').str[0].str.strip()
hulu_df['audience'] = hulu_df['rating'].apply(map_rating_to_audience)
hulu_df['continent'] = hulu_df['country'].apply(get_continent_from_country)

hulu_df.head(5)

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,month_added,year_added,duration_num,duration_type,genre,audience,continent
2,Movie,The Marksman,Unknown,Unknown,United States,2021-10-23,2021,PG-13,108 min,"Action, Thriller",A hardened Arizona rancher tries to protect an...,October,2021,108,min,Action,unknown,North America
3,Movie,Gaia,Unknown,Unknown,United States,2021-10-22,2021,R,97 min,Horror,A forest ranger and two survivalists with a cu...,October,2021,97,min,Horror,adults,North America
8,TV Show,Queens,Unknown,Unknown,United States,2021-10-20,2021,TV-14,1 Season,"Drama, Music",Four women in their 40s reunite for a chance t...,October,2021,1,Season,Drama,teens,North America
9,TV Show,The Bachelorette,Unknown,Unknown,United States,2021-10-20,2003,TV-14,3 Seasons,"Reality, Romance",ABC's romance reality show lets one lucky lady...,October,2021,3,Seasons,Reality,teens,North America
11,Movie,Dream Horse,Unknown,Unknown,United States,2021-10-18,2020,PG,113 min,"Comedy, Drama",The film tells the inspiring true story of a s...,October,2021,113,min,Comedy,teens,North America


Conserver certaines colonnes pour D3 et sauvegarder le document

In [22]:
hulu_df = hulu_df[COLUMNS_D3]

#### Disney+ Titles And Movies

In [23]:
disney_df = pd.read_csv("datasets/disney_plus_titles.csv")
disney_df.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Duck the Halls: A Mickey Mouse Christmas Special,"Alonso Ramirez Ramos, Dave Wasson","Chris Diamantopoulos, Tony Anselmo, Tress MacN...",,"November 26, 2021",2016,TV-G,23 min,"Animation, Family",Join Mickey and the gang as they duck the halls!
1,s2,Movie,Ernest Saves Christmas,John Cherry,"Jim Varney, Noelle Parker, Douglas Seale",,"November 26, 2021",1988,PG,91 min,Comedy,Santa Claus passes his magic bag to a new St. ...
2,s3,Movie,Ice Age: A Mammoth Christmas,Karen Disher,"Raymond Albert Romano, John Leguizamo, Denis L...",United States,"November 26, 2021",2011,TV-G,23 min,"Animation, Comedy, Family",Sid the Sloth is on Santa's naughty list.
3,s4,Movie,The Queen Family Singalong,Hamish Hamilton,"Darren Criss, Adam Lambert, Derek Hough, Alexa...",,"November 26, 2021",2021,TV-PG,41 min,Musical,"This is real life, not just fantasy!"
4,s5,TV Show,The Beatles: Get Back,,"John Lennon, Paul McCartney, George Harrison, ...",,"November 25, 2021",2021,,1 Season,"Docuseries, Historical, Music",A three-part documentary from Peter Jackson ca...


In [24]:
disney_df.drop(columns="show_id", inplace=True)

In [25]:
disney_df.isnull().sum()

type              0
title             0
director        473
cast            190
country         219
date_added        3
release_year      0
rating            3
duration          0
listed_in         0
description       0
dtype: int64

Gestion des valeurs manquantes et suppression des duplicatats

In [26]:
# Gestion des manquantes
disney_df['director'] = disney_df['director'].fillna('Unknown')
disney_df['cast'] = disney_df['cast'].fillna('Unknown')
disney_df['country'] = disney_df['country'].fillna(disney_df['country'].mode()[0])
disney_df.dropna(inplace=True)

# Suppression des duplicatats
disney_df.drop_duplicates(inplace=True)

In [27]:
disney_df.isnull().sum()

type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

In [28]:
disney_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1444 entries, 0 to 1449
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   type          1444 non-null   object
 1   title         1444 non-null   object
 2   director      1444 non-null   object
 3   cast          1444 non-null   object
 4   country       1444 non-null   object
 5   date_added    1444 non-null   object
 6   release_year  1444 non-null   int64 
 7   rating        1444 non-null   object
 8   duration      1444 non-null   object
 9   listed_in     1444 non-null   object
 10  description   1444 non-null   object
dtypes: int64(1), object(10)
memory usage: 135.4+ KB


Modifier le type pour date et ajout de colonnes pour le mois et l'année

In [29]:
disney_df["date_added"] = pd.to_datetime(disney_df['date_added'])

disney_df['month_added']=disney_df['date_added'].dt.month_name()
disney_df['year_added'] = disney_df['date_added'].dt.year

disney_df.head(5)

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,month_added,year_added
0,Movie,Duck the Halls: A Mickey Mouse Christmas Special,"Alonso Ramirez Ramos, Dave Wasson","Chris Diamantopoulos, Tony Anselmo, Tress MacN...",United States,2021-11-26,2016,TV-G,23 min,"Animation, Family",Join Mickey and the gang as they duck the halls!,November,2021
1,Movie,Ernest Saves Christmas,John Cherry,"Jim Varney, Noelle Parker, Douglas Seale",United States,2021-11-26,1988,PG,91 min,Comedy,Santa Claus passes his magic bag to a new St. ...,November,2021
2,Movie,Ice Age: A Mammoth Christmas,Karen Disher,"Raymond Albert Romano, John Leguizamo, Denis L...",United States,2021-11-26,2011,TV-G,23 min,"Animation, Comedy, Family",Sid the Sloth is on Santa's naughty list.,November,2021
3,Movie,The Queen Family Singalong,Hamish Hamilton,"Darren Criss, Adam Lambert, Derek Hough, Alexa...",United States,2021-11-26,2021,TV-PG,41 min,Musical,"This is real life, not just fantasy!",November,2021
5,Movie,Becoming Cousteau,Liz Garbus,"Jacques Yves Cousteau, Vincent Cassel",United States,2021-11-24,2021,PG-13,94 min,"Biographical, Documentary",An inside look at the legendary life of advent...,November,2021


Nettoyage des colonnes textuelles

In [30]:
disney_df['title'] = disney_df['title'].str.strip()
disney_df['director'] = disney_df['director'].str.strip()
disney_df['country'] = disney_df['country'].str.strip()
disney_df['listed_in'] = disney_df['listed_in'].str.strip()
disney_df['description'] = disney_df['description'].str.strip()

disney_df.head(5)

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,month_added,year_added
0,Movie,Duck the Halls: A Mickey Mouse Christmas Special,"Alonso Ramirez Ramos, Dave Wasson","Chris Diamantopoulos, Tony Anselmo, Tress MacN...",United States,2021-11-26,2016,TV-G,23 min,"Animation, Family",Join Mickey and the gang as they duck the halls!,November,2021
1,Movie,Ernest Saves Christmas,John Cherry,"Jim Varney, Noelle Parker, Douglas Seale",United States,2021-11-26,1988,PG,91 min,Comedy,Santa Claus passes his magic bag to a new St. ...,November,2021
2,Movie,Ice Age: A Mammoth Christmas,Karen Disher,"Raymond Albert Romano, John Leguizamo, Denis L...",United States,2021-11-26,2011,TV-G,23 min,"Animation, Comedy, Family",Sid the Sloth is on Santa's naughty list.,November,2021
3,Movie,The Queen Family Singalong,Hamish Hamilton,"Darren Criss, Adam Lambert, Derek Hough, Alexa...",United States,2021-11-26,2021,TV-PG,41 min,Musical,"This is real life, not just fantasy!",November,2021
5,Movie,Becoming Cousteau,Liz Garbus,"Jacques Yves Cousteau, Vincent Cassel",United States,2021-11-24,2021,PG-13,94 min,"Biographical, Documentary",An inside look at the legendary life of advent...,November,2021


Création des colonnes dérivées

In [31]:
disney_df[['duration_num', 'duration_type']] = disney_df['duration'].str.extract(r'(\d+)\s*(\w+)', expand=True)
disney_df['duration_num'] = pd.to_numeric(disney_df['duration_num'], errors='coerce')
disney_df['country'] = disney_df['country'].str.split(',').str[0].str.strip()
disney_df['genre'] = disney_df['listed_in'].str.split(',').str[0].str.strip()
disney_df['audience'] = disney_df['rating'].apply(map_rating_to_audience)
disney_df['continent'] = disney_df['country'].apply(get_continent_from_country)

disney_df.head(5)

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,month_added,year_added,duration_num,duration_type,genre,audience,continent
0,Movie,Duck the Halls: A Mickey Mouse Christmas Special,"Alonso Ramirez Ramos, Dave Wasson","Chris Diamantopoulos, Tony Anselmo, Tress MacN...",United States,2021-11-26,2016,TV-G,23 min,"Animation, Family",Join Mickey and the gang as they duck the halls!,November,2021,23,min,Animation,kids,North America
1,Movie,Ernest Saves Christmas,John Cherry,"Jim Varney, Noelle Parker, Douglas Seale",United States,2021-11-26,1988,PG,91 min,Comedy,Santa Claus passes his magic bag to a new St. ...,November,2021,91,min,Comedy,teens,North America
2,Movie,Ice Age: A Mammoth Christmas,Karen Disher,"Raymond Albert Romano, John Leguizamo, Denis L...",United States,2021-11-26,2011,TV-G,23 min,"Animation, Comedy, Family",Sid the Sloth is on Santa's naughty list.,November,2021,23,min,Animation,kids,North America
3,Movie,The Queen Family Singalong,Hamish Hamilton,"Darren Criss, Adam Lambert, Derek Hough, Alexa...",United States,2021-11-26,2021,TV-PG,41 min,Musical,"This is real life, not just fantasy!",November,2021,41,min,Musical,teens,North America
5,Movie,Becoming Cousteau,Liz Garbus,"Jacques Yves Cousteau, Vincent Cassel",United States,2021-11-24,2021,PG-13,94 min,"Biographical, Documentary",An inside look at the legendary life of advent...,November,2021,94,min,Biographical,unknown,North America


Conserver certaines colonnes pour D3 et sauvegarder le document

In [32]:
disney_df = disney_df[COLUMNS_D3]

#### Amazon Prime Titles And Movies

In [33]:
amazon_df = pd.read_csv("datasets/amazon_prime_titles.csv")
amazon_df.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,The Grand Seduction,Don McKellar,"Brendan Gleeson, Taylor Kitsch, Gordon Pinsent",Canada,"March 30, 2021",2014,,113 min,"Comedy, Drama",A small fishing village must procure a local d...
1,s2,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,"March 30, 2021",2018,13+,110 min,"Drama, International",A Metro Family decides to fight a Cyber Crimin...
2,s3,Movie,Secrets of Deception,Josh Webber,"Tom Sizemore, Lorenzo Lamas, Robert LaSardo, R...",United States,"March 30, 2021",2017,,74 min,"Action, Drama, Suspense",After a man discovers his wife is cheating on ...
3,s4,Movie,Pink: Staying True,Sonia Anderson,"Interviews with: Pink, Adele, Beyoncé, Britney...",United States,"March 30, 2021",2014,,69 min,Documentary,"Pink breaks the mold once again, bringing her ..."
4,s5,Movie,Monster Maker,Giles Foster,"Harry Dean Stanton, Kieran O'Brien, George Cos...",United Kingdom,"March 30, 2021",1989,,45 min,"Drama, Fantasy",Teenage Matt Banting wants to work with a famo...


In [34]:
amazon_df.drop(columns="show_id", inplace=True)

In [35]:
amazon_df.isnull().sum()

type               0
title              0
director        2082
cast            1233
country         8996
date_added      9513
release_year       0
rating           337
duration           0
listed_in          0
description        0
dtype: int64

Gestion des valeurs manquantes et suppression des duplicatats

In [36]:
# Gestion des manquantes
amazon_df['director'] = amazon_df['director'].fillna('Unknown')
amazon_df['cast'] = amazon_df['cast'].fillna('Unknown')
amazon_df['country'] = amazon_df['country'].fillna(amazon_df['country'].mode()[0])
amazon_df.dropna(inplace=True)

# Suppression des duplicatats
amazon_df.drop_duplicates(inplace=True)

In [37]:
amazon_df.isnull().sum()

type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

In [38]:
amazon_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 140 entries, 1 to 9647
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   type          140 non-null    object
 1   title         140 non-null    object
 2   director      140 non-null    object
 3   cast          140 non-null    object
 4   country       140 non-null    object
 5   date_added    140 non-null    object
 6   release_year  140 non-null    int64 
 7   rating        140 non-null    object
 8   duration      140 non-null    object
 9   listed_in     140 non-null    object
 10  description   140 non-null    object
dtypes: int64(1), object(10)
memory usage: 13.1+ KB


Modifier le type pour date et ajout de colonnes pour le mois et l'année

In [39]:
amazon_df["date_added"] = pd.to_datetime(amazon_df['date_added'])

amazon_df['month_added']=amazon_df['date_added'].dt.month_name()
amazon_df['year_added'] = amazon_df['date_added'].dt.year

amazon_df.head(5)

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,month_added,year_added
1,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,2021-03-30,2018,13+,110 min,"Drama, International",A Metro Family decides to fight a Cyber Crimin...,March,2021
553,TV Show,The Lucy Show,Unknown,"Lucille Ball, Vivian Vance, Gale Gordon, Jimmy...",United States,2021-03-30,1967,13+,3 Seasons,Comedy,"After her husband's death, Lucy Carmichael, an...",March,2021
1089,TV Show,Scaredy Squirrel,Unknown,"Terry McGurrin, Jonathan Gould, Pat McKenna, L...",United States,2021-03-30,2012,TV-G,4 Seasons,"Animation, Kids",Based on a series of children's books written ...,March,2021
1442,TV Show,Oddbods,Unknown,Unknown,United States,2021-03-30,2020,ALL,2 Seasons,"Animation, Comedy, Kids",Oddbods is an award winning animated series th...,March,2021
2096,TV Show,Inspector Manara (English Subtitled),Unknown,"Guido Caprino, Roberta Giarrusso, Anna Safronc...",United States,2021-03-30,2011,TV-NR,2 Seasons,"International, Suspense",In Italian with English subtitles. In season 2...,March,2021


Nettoyage des colonnes textuelles

In [40]:
amazon_df['title'] = amazon_df['title'].str.strip()
amazon_df['director'] = amazon_df['director'].str.strip()
amazon_df['country'] = amazon_df['country'].str.strip()
amazon_df['listed_in'] = amazon_df['listed_in'].str.strip()
amazon_df['description'] = amazon_df['description'].str.strip()

amazon_df.head(5)

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,month_added,year_added
1,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,2021-03-30,2018,13+,110 min,"Drama, International",A Metro Family decides to fight a Cyber Crimin...,March,2021
553,TV Show,The Lucy Show,Unknown,"Lucille Ball, Vivian Vance, Gale Gordon, Jimmy...",United States,2021-03-30,1967,13+,3 Seasons,Comedy,"After her husband's death, Lucy Carmichael, an...",March,2021
1089,TV Show,Scaredy Squirrel,Unknown,"Terry McGurrin, Jonathan Gould, Pat McKenna, L...",United States,2021-03-30,2012,TV-G,4 Seasons,"Animation, Kids",Based on a series of children's books written ...,March,2021
1442,TV Show,Oddbods,Unknown,Unknown,United States,2021-03-30,2020,ALL,2 Seasons,"Animation, Comedy, Kids",Oddbods is an award winning animated series th...,March,2021
2096,TV Show,Inspector Manara (English Subtitled),Unknown,"Guido Caprino, Roberta Giarrusso, Anna Safronc...",United States,2021-03-30,2011,TV-NR,2 Seasons,"International, Suspense",In Italian with English subtitles. In season 2...,March,2021


Création des colonnes dérivées

In [41]:
amazon_df[['duration_num', 'duration_type']] = amazon_df['duration'].str.extract(r'(\d+)\s*(\w+)', expand=True)
amazon_df['duration_num'] = pd.to_numeric(amazon_df['duration_num'], errors='coerce')
amazon_df['country'] = amazon_df['country'].str.split(',').str[0].str.strip()
amazon_df['genre'] = amazon_df['listed_in'].str.split(',').str[0].str.strip()
amazon_df['audience'] = amazon_df['rating'].apply(map_rating_to_audience)
amazon_df['continent'] = amazon_df['country'].apply(get_continent_from_country)

amazon_df.head(5)

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,month_added,year_added,duration_num,duration_type,genre,audience,continent
1,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,2021-03-30,2018,13+,110 min,"Drama, International",A Metro Family decides to fight a Cyber Crimin...,March,2021,110,min,Drama,teens,Asia
553,TV Show,The Lucy Show,Unknown,"Lucille Ball, Vivian Vance, Gale Gordon, Jimmy...",United States,2021-03-30,1967,13+,3 Seasons,Comedy,"After her husband's death, Lucy Carmichael, an...",March,2021,3,Seasons,Comedy,teens,North America
1089,TV Show,Scaredy Squirrel,Unknown,"Terry McGurrin, Jonathan Gould, Pat McKenna, L...",United States,2021-03-30,2012,TV-G,4 Seasons,"Animation, Kids",Based on a series of children's books written ...,March,2021,4,Seasons,Animation,kids,North America
1442,TV Show,Oddbods,Unknown,Unknown,United States,2021-03-30,2020,ALL,2 Seasons,"Animation, Comedy, Kids",Oddbods is an award winning animated series th...,March,2021,2,Seasons,Animation,kids,North America
2096,TV Show,Inspector Manara (English Subtitled),Unknown,"Guido Caprino, Roberta Giarrusso, Anna Safronc...",United States,2021-03-30,2011,TV-NR,2 Seasons,"International, Suspense",In Italian with English subtitles. In season 2...,March,2021,2,Seasons,International,unknown,North America


Conserver certaines colonnes pour D3

In [42]:
amazon_df = amazon_df[COLUMNS_D3]

### Combinaison des dataframes et sauvegarde

In [43]:
netflix_df['platform'] = "Netflix"
disney_df['platform'] = "Disney"
amazon_df['platform'] = "Amazon"
hulu_df['platform'] = "Hulu"

combined_df = pd.concat([netflix_df, disney_df, amazon_df, hulu_df], ignore_index=True)
combined_df.head(5)

Unnamed: 0,type,title,director,country,date_added,release_year,rating,duration_num,duration_type,genre,month_added,year_added,audience,continent,platform
0,Movie,Dick Johnson Is Dead,Kirsten Johnson,United States,2021-09-25,2020,PG-13,90,min,Documentaries,September,2021,unknown,North America,Netflix
1,TV Show,Blood & Water,Unknown,South Africa,2021-09-24,2021,TV-MA,2,Seasons,International TV Shows,September,2021,adults,Africa,Netflix
2,TV Show,Ganglands,Julien Leclercq,United States,2021-09-24,2021,TV-MA,1,Season,Crime TV Shows,September,2021,adults,North America,Netflix
3,TV Show,Jailbirds New Orleans,Unknown,United States,2021-09-24,2021,TV-MA,1,Season,Docuseries,September,2021,adults,North America,Netflix
4,TV Show,Kota Factory,Unknown,India,2021-09-24,2021,TV-MA,2,Seasons,International TV Shows,September,2021,adults,Asia,Netflix


In [44]:
import pandas as pd
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download required NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

def get_similarity(word1, word2):
    synsets1 = wordnet.synsets(word1)
    synsets2 = wordnet.synsets(word2)
    max_similarity = 0
    for s1 in synsets1:
        for s2 in synsets2:
            similarity = s1.path_similarity(s2)
            if similarity and similarity > max_similarity:
                max_similarity = similarity
    return max_similarity

def categorize_genres(genres, similarity_threshold=0.5):
    lemmatizer = WordNetLemmatizer()
    categorized = {}
    
    for genre in genres:
        tokens = word_tokenize(genre.lower())
        lemmas = [lemmatizer.lemmatize(token) for token in tokens]
        found_category = False
        
        for category in categorized:
            similarity = get_similarity(' '.join(lemmas), category)
            if similarity > similarity_threshold:
                categorized[category].append(genre)
                found_category = True
                break
        
        if not found_category:
            categorized[' '.join(lemmas)] = [genre]
    
    return categorized

# Get unique genres
unique_genres = combined_df['genre'].unique()

# Categorize genres
categorized_genres = categorize_genres(unique_genres)

# Create a mapping dictionary
genre_mapping = {genre: category for category, genres in categorized_genres.items() for genre in genres}

# Apply the mapping to create a new column
combined_df['genre_category'] = combined_df['genre'].map(genre_mapping)

# Display the result
print(combined_df[['title', 'genre', 'genre_category']])


[nltk_data] Downloading package punkt to /Users/massamba/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/massamba/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/massamba/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


                       title                   genre         genre_category
0       Dick Johnson Is Dead           Documentaries            documentary
1              Blood & Water  International TV Shows  international tv show
2                  Ganglands          Crime TV Shows          crime tv show
3      Jailbirds New Orleans              Docuseries             docuseries
4               Kota Factory  International TV Shows  international tv show
...                      ...                     ...                    ...
12521             Heroic Age                  Action                 action
12522   Black Blood Brothers                  Action                 action
12523    Doogie Howser, M.D.                  Comedy                 comedy
12524          Lost in Space                  Action                 action
12525           Hikaru no Go                   Anime                  anime

[12526 rows x 3 columns]


In [45]:
combined_df['genre_category'].value_counts()[0:20]

drama                    1949
comedy                   1766
documentary              1104
action & adventure        859
international tv show     773
child & family movie      605
action                    487
action-adventure          450
crime tv show             399
kid ' tv                  385
animation                 361
stand-up comedy           334
horror movie              275
british tv show           252
docuseries                252
anime series              174
animal & nature           173
crime                     141
international movie       128
reality tv                120
Name: genre_category, dtype: int64

In [46]:
import re

def categorize_genre(genre):
    genre = genre.lower()
    
    if re.search(r'drama|tv drama', genre):
        return 'Drama'
    elif re.search(r'comed|stand-up', genre):
        return 'Comedy'
    elif re.search(r'document|docuseries', genre):
        return 'Documentary'
    elif re.search(r'action|adventure', genre):
        return 'Action & Adventure'
    elif 'international' in genre:
        return 'International'
    elif re.search(r'child|family|kid', genre):
        return 'Children & Family'
    elif 'crime' in genre:
        return 'Crime'
    elif 'animation' in genre:
        return 'Animation'
    elif 'horror' in genre:
        return 'Horror'
    elif 'british' in genre:
        return 'British'
    elif 'anime' in genre:
        return 'Anime'
    elif re.search(r'animal|nature', genre):
        return 'Nature & Animals'
    elif 'reality' in genre:
        return 'Reality TV'
    else:
        return 'Other'

In [47]:
combined_df['genre'] = combined_df['genre_category'].apply(categorize_genre)
combined_df.drop(columns='genre_category', inplace=True)

In [48]:
output_path = "../streaming_data.csv"
combined_df.to_csv(output_path)