# Analyse du catalogue Netflix - Projet de Data Cleaning

## Contexte du projet
Ce projet fait partie de mon parcours d'apprentissage en data analysis. L'objectif est de ma√Ætriser les techniques fondamentales de **nettoyage et manipulation de donn√©es** avec pandas, avant d'aborder la visualisation dans mes prochains projets.

## Objectifs p√©dagogiques
- G√©rer les probl√®mes d'encodage lors de l'import de fichiers CSV
- Nettoyer et transformer des colonnes (types, valeurs manquantes, formats)
- Normaliser des donn√©es multi-valu√©es (colonnes contenant des listes s√©par√©es par virgules)
- Cr√©er des features temporelles √† partir de dates
- R√©aliser une analyse exploratoire basique

## Dataset
Le dataset contient 8809 titres Netflix (films et s√©ries) avec des informations sur les r√©alisateurs, le casting, les pays de production, les dates d'ajout, les dur√©es et les cat√©gories.

**Source** : Netflix Titles Dataset  
**P√©riode** : Janvier 2026  
**Niveau** : Apprentissage - Focus sur data cleaning

<h1> Data Cleaning et data modeling</h1>

In [1]:
import pandas as pd
import numpy as np

## Phase 1 : Import et d√©couverte des donn√©es

### √âtape 1 : Import du fichier CSV
Je commence par importer le dataset. Si un probl√®me d'encodage survient, je teste diff√©rents encodages (UTF-8, latin-1, ISO-8859-1).

In [2]:
#Import DataSet Netflix
df = pd.read_csv('netflix_titles.csv', encoding='latin')
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,...,,,,,,,,,,
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,...,,,,,,,,,,
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,...,,,,,,,,,,
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,...,,,,,,,,,,
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,...,,,,,,,,,,


### √âtape 2 : Cr√©ation d'une copie de travail
Je cr√©e une copie du DataFrame original pour conserver les donn√©es brutes intactes.

In [3]:
# Cr√©ation copy
df_wip = df.copy()
df_wip.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,...,,,,,,,,,,
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,...,,,,,,,,,,
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,...,,,,,,,,,,
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,...,,,,,,,,,,
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,...,,,,,,,,,,


### √âtape 3 : Nettoyage initial
Je v√©rifie et supprime les colonnes vides ("Unnamed"), puis je d√©finis `show_id` comme index apr√®s avoir v√©rifi√© qu'il ne contient ni doublons ni valeurs manquantes.

In [4]:
# Suppression colonnes Unnamed
df_wip = df_wip[["show_id", "type", "title", "director", "cast", "country", "date_added", "release_year", "rating", "duration", "listed_in", "description"]]
df_wip.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [5]:
# Verification doublons & valeurs manquantes pour colonne 'Show Id'
print(df['show_id'].isnull().any())
print(df['show_id'].duplicated().any())

False
False


In [6]:
# Definir index "Show Id"
df_wip.set_index('show_id', inplace=True)
df_wip.head()

Unnamed: 0_level_0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [7]:
# Information DataSet
print(df_wip.info())

<class 'pandas.core.frame.DataFrame'>
Index: 8809 entries, s1 to s8809
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   type          8809 non-null   object
 1   title         8809 non-null   object
 2   director      6175 non-null   object
 3   cast          7984 non-null   object
 4   country       7978 non-null   object
 5   date_added    8799 non-null   object
 6   release_year  8809 non-null   int64 
 7   rating        8805 non-null   object
 8   duration      8806 non-null   object
 9   listed_in     8809 non-null   object
 10  description   8809 non-null   object
dtypes: int64(1), object(10)
memory usage: 825.8+ KB
None


In [8]:
# Information DataSet 2
df_wip.shape

(8809, 11)

## Phase 2 : Traitement des colonnes temporelles

### Probl√®me identifi√©
La colonne `date_added` est au format texte ("September 25, 2021") alors qu'elle devrait √™tre en datetime pour permettre des analyses temporelles.

### Solution
J'utilise `pd.to_datetime()` avec le param√®tre `errors='coerce'` pour convertir les dates invalides en NaT (Not a Time) plut√¥t que de g√©n√©rer une erreur.

In [9]:
# Visualisation colonne "date_added"
df_wip.loc[:, 'date_added']

show_id
s1       September 25, 2021
s2       September 24, 2021
s3       September 24, 2021
s4       September 24, 2021
s5       September 24, 2021
                ...        
s8805      November 1, 2019
s8806      January 11, 2020
s8807         March 2, 2019
s8808         April 5, 2024
s8809         April 5, 2024
Name: date_added, Length: 8809, dtype: object

In [10]:
# Transformation colonne "date_added" en format DateTime
df_wip['date_added'] = pd.to_datetime(df_wip['date_added'],
                                      format='%B %d, %Y',
                                      errors='coerce')
df_wip.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8809 entries, s1 to s8809
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   type          8809 non-null   object        
 1   title         8809 non-null   object        
 2   director      6175 non-null   object        
 3   cast          7984 non-null   object        
 4   country       7978 non-null   object        
 5   date_added    8711 non-null   datetime64[ns]
 6   release_year  8809 non-null   int64         
 7   rating        8805 non-null   object        
 8   duration      8806 non-null   object        
 9   listed_in     8809 non-null   object        
 10  description   8809 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(9)
memory usage: 1.1+ MB


## Phase 3 : Restructuration de la colonne Duration

### Challenge
La colonne `duration` m√©lange deux types d'informations :
- Pour les films : dur√©e en minutes (ex: "90 min")
- Pour les s√©ries : nombre de saisons (ex: "2 Seasons")

Cette structure rend impossible toute analyse statistique sur les dur√©es. Je dois s√©parer ces informations en deux colonnes distinctes.

### Approche
1. V√©rifier que la colonne `type` ne contient que 'Movie' et 'TV Show'
2. Analyser le format des valeurs dans `duration`
3. Cr√©er deux nouvelles colonnes num√©riques :
   - `duration_movies` : dur√©e en minutes (float) pour les films
   - `duration_tv_show` : nombre de saisons (float) pour les s√©ries
4. Supprimer l'ancienne colonne `duration`

In [11]:
# Verification valeurs colonne "type"
df_wip['type'].unique()

array(['Movie', 'TV Show'], dtype=object)

In [12]:
# Analyse colonne 'duration'

df_wip['duration'].head(10)

show_id
s1        90 min
s2     2 Seasons
s3      1 Season
s4      1 Season
s5     2 Seasons
s6      1 Season
s7        91 min
s8       125 min
s9     9 Seasons
s10      104 min
Name: duration, dtype: object

In [13]:
df_wip.groupby("duration")['type'].value_counts()

duration    type   
1 Season    TV Show    1794
10 Seasons  TV Show       7
10 min      Movie         1
100 min     Movie       108
101 min     Movie       116
                       ... 
95 min      Movie       137
96 min      Movie       130
97 min      Movie       146
98 min      Movie       120
99 min      Movie       118
Name: count, Length: 220, dtype: int64

In [14]:
df_wip.groupby("duration")['type'].unique()

duration
1 Season      [TV Show]
10 Seasons    [TV Show]
10 min          [Movie]
100 min         [Movie]
101 min         [Movie]
                ...    
95 min          [Movie]
96 min          [Movie]
97 min          [Movie]
98 min          [Movie]
99 min          [Movie]
Name: type, Length: 220, dtype: object

## Phase 4 : Normalisation des colonnes multi-valu√©es

### Probl√©matique
Certaines colonnes comme `country`, `cast` et `listed_in` contiennent plusieurs valeurs s√©par√©es par des virgules dans une seule cellule. Par exemple :
- country : "United States, United Kingdom, France"
- cast : "Adam Sandler, Drew Barrymore, Rob Schneider"

Cette structure emp√™che de r√©pondre √† des questions comme "Quels sont les acteurs les plus pr√©sents ?" ou "Quel pays produit le plus de documentaires ?".

### Solution : la m√©thode `.explode()`
Je vais cr√©er des DataFrames annexes o√π chaque valeur occupe une ligne distincte. Cela permet de faire des analyses par pays, acteur ou cat√©gorie.

**Processus** :
1. Transformer les cha√Ænes en listes avec `str.split(',')`
2. Utiliser `.explode()` pour cr√©er une ligne par √©l√©ment de liste
3. Cr√©er 3 DataFrames auxiliaires : `countries_exploded`, `categories_exploded`, `cast_list`

In [15]:
# Cr√©ation colonne "duration_movies" avec valeur num√©rique dur√©e des films.
df_wip['duration_movies'] = df_wip[df_wip['type'] == 'Movie']['duration'].str.extract('(\d+)').astype(float)
df_wip['duration_movies']

show_id
s1        90.0
s2         NaN
s3         NaN
s4         NaN
s5         NaN
         ...  
s8805     88.0
s8806     88.0
s8807    111.0
s8808      NaN
s8809    110.0
Name: duration_movies, Length: 8809, dtype: float64

In [16]:
# Cr√©ation colonne 'duration_tv_show' avec valeur num√©rique du nombre de saisons.
df_wip['duration_tv_show'] = df_wip[df_wip['type'] == 'TV Show']['duration'].str.extract('(\d+)').astype(float)
df_wip['duration_tv_show']

show_id
s1       NaN
s2       2.0
s3       1.0
s4       1.0
s5       2.0
        ... 
s8805    NaN
s8806    NaN
s8807    NaN
s8808    1.0
s8809    NaN
Name: duration_tv_show, Length: 8809, dtype: float64

In [17]:
# Suppression colonne 'duration'
df_wip.drop(columns = 'duration', inplace=True)
df_wip.head()

Unnamed: 0_level_0,type,title,director,cast,country,date_added,release_year,rating,listed_in,description,duration_movies,duration_tv_show
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,2021-09-25,2020,PG-13,Documentaries,"As her father nears the end of his life, filmm...",90.0,
s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",,2.0
s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,2021-09-24,2021,TV-MA,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,,1.0
s4,TV Show,Jailbirds New Orleans,,,,2021-09-24,2021,TV-MA,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",,1.0
s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,,2.0


In [18]:
# Visualisation

df_wip['country'].head(10).to_frame()

Unnamed: 0_level_0,country
show_id,Unnamed: 1_level_1
s1,United States
s2,South Africa
s3,
s4,
s5,India
s6,
s7,
s8,"United States, Ghana, Burkina Faso, United Kin..."
s9,United Kingdom
s10,United States


In [37]:
# Cr√©ation DataFrame 'countries_exploded' sur "countries"
df_wip['countries'] = df_wip['country'].str.split(',')

# Nettoyage des espaces dans chaque √©l√©ment de la liste
df_wip['countries'] = df_wip['countries'].apply(
    lambda x: [i.strip() for i in x] if isinstance(x, list) else x
)

countries_exploded = df_wip.explode('countries')
countries_exploded = countries_exploded[['countries']]

In [21]:
# Cr√©ation DataFrame 'categories_exploded' sur "listed_in"
df_wip['listed_in_list'] = df_wip['listed_in'].str.split(',')
categories_exploded = df_wip.explode('listed_in_list')
categories_exploded = categories_exploded['listed_in_list']
categories_exploded.unique()

array(['Documentaries', 'International TV Shows', ' TV Dramas',
       ' TV Mysteries', 'Crime TV Shows', ' International TV Shows',
       ' TV Action & Adventure', 'Docuseries', ' Reality TV',
       ' Romantic TV Shows', ' TV Comedies', 'TV Dramas', ' TV Horror',
       'Children & Family Movies', 'Dramas', ' Independent Movies',
       ' International Movies', 'British TV Shows', 'Comedies', ' Dramas',
       ' Docuseries', ' Comedies', ' Crime TV Shows', 'TV Comedies',
       ' Spanish-Language TV Shows', 'Thrillers', ' Romantic Movies',
       ' Music & Musicals', 'Horror Movies', ' Sci-Fi & Fantasy',
       ' TV Thrillers', "Kids' TV", ' Thrillers', 'Action & Adventure',
       ' TV Sci-Fi & Fantasy', ' Classic Movies', ' Horror Movies',
       ' Anime Features', 'Reality TV', ' Sports Movies', 'Anime Series',
       " Kids' TV", 'International Movies', ' Korean TV Shows',
       'Sci-Fi & Fantasy', ' Science & Nature TV', ' Teen TV Shows',
       ' Cult Movies', 'Classic Movies

In [22]:
# Cr√©ation DataFrame 'cast_list' sur "cast"
df_wip['cast_list'] = df_wip['cast'].str.split(',')
cast_list = df_wip.explode('cast_list')
cast_list = cast_list['cast_list']
cast_list.unique()

array([nan, 'Ama Qamata', ' Khosi Ngema', ..., ' Petr Drozda',
       ' John Comer', ' Benedetta Degli Innocenti'],
      shape=(39324,), dtype=object)

In [23]:
# Sauvegarde DataFrame to CSV
countries_exploded.to_csv('countries_exploded.csv')
categories_exploded.to_csv('categories_exploded.csv')
cast_list.to_csv('cast_list.csv')

In [24]:
# Supression colonne 'country', 'cast', 'listed_in'
df_wip.drop(columns=['countries','listed_in_list','cast_list'], inplace=True)
df_wip.head()

Unnamed: 0_level_0,type,title,director,cast,country,date_added,release_year,rating,listed_in,description,duration_movies,duration_tv_show
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,2021-09-25,2020,PG-13,Documentaries,"As her father nears the end of his life, filmm...",90.0,
s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",,2.0
s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,2021-09-24,2021,TV-MA,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,,1.0
s4,TV Show,Jailbirds New Orleans,,,,2021-09-24,2021,TV-MA,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",,1.0
s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,,2.0


### Extraction de features temporelles

Maintenant que `date_added` est au format datetime, je peux extraire des informations temporelles sous forme de colonnes s√©par√©es. Ces nouvelles features permettront d'analyser les patterns d'ajout de contenu sur Netflix.

**Features cr√©√©es** :
- `year_added` : Ann√©e d'ajout (pour analyser les tendances annuelles)
- `month_added` : Mois d'ajout (pour d√©tecter des saisonnalit√©s)
- `day_of_the_week_added` : Jour de la semaine d'ajout en format num√©rique (0=lundi, 6=dimanche)

Ces colonnes permettent de r√©pondre √† des questions comme "Netflix ajoute-t-il plus de contenu certains jours de la semaine ?" ou "Y a-t-il eu une acc√©l√©ration des ajouts ces derni√®res ann√©es ?".

**M√©thode** : J'utilise les accesseurs datetime de pandas (`.dt.year`, `.dt.month`, `.dt.dayofweek`) qui extraient automatiquement les composantes temporelles.

In [25]:
df_wip['year_added'] = df_wip['date_added'].dt.year
df_wip['month_added'] = df_wip['date_added'].dt.month
df_wip['day_of_the_week_added'] = df_wip['date_added'].dt.dayofweek

df_wip[['year_added', 'month_added', 'day_of_the_week_added']].head()

Unnamed: 0_level_0,year_added,month_added,day_of_the_week_added
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
s1,2021.0,9.0,5.0
s2,2021.0,9.0,4.0
s3,2021.0,9.0,4.0
s4,2021.0,9.0,4.0
s5,2021.0,9.0,4.0


## üíæ Export du DataFrame nettoy√©

Maintenant que toutes les op√©rations de data cleaning sont termin√©es, j'exporte le DataFrame dans un nouveau fichier CSV. Cela me permet de :
- S√©parer clairement les phases de cleaning et d'analyse
- R√©utiliser ces donn√©es nettoy√©es dans d'autres projets
- Gagner du temps en √©vitant de refaire le preprocessing √† chaque analyse

**Fichier export√©** : `netflix_titles_cleaned.csv`  
**Encodage** : UTF-8 pour garantir la compatibilit√©  
**Index** : Conserv√© car `show_id` sert d'identifiant unique

In [26]:
# Export du DataFrame nettoy√©
df_wip.to_csv('netflix_titles_cleaned.csv', encoding='utf-8', index=True)

# V√©rification de l'export
print(f"Export r√©ussi : {len(df_wip)} lignes et {len(df_wip.columns)} colonnes")
print(f"Colonnes export√©es : {list(df_wip.columns)}")

Export r√©ussi : 8809 lignes et 15 colonnes
Colonnes export√©es : ['type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'listed_in', 'description', 'duration_movies', 'duration_tv_show', 'year_added', 'month_added', 'day_of_the_week_added']


## Phase 5 : Analyse exploratoire des donn√©es

Maintenant que les donn√©es sont nettoy√©es et structur√©es, je peux r√©pondre √† des questions business sur le catalogue Netflix. Cette analyse exploratoire permet de d√©couvrir des tendances et des patterns dans les donn√©es.

**Note** : Cette analyse est purement statistique (sans visualisation). Les graphiques seront ajout√©s dans la version 2 de ce projet, une fois que j'aurai appris matplotlib et seaborn.

### Questions sur la composition du catalogue

In [27]:
# combien de shows sont pr√©sent dans le dataset
len(df_wip)

8809

In [28]:
# R√©partition type "Movie" & "TV Show" dans la colonne "Type"
df_wip['type'].value_counts()

type
Movie      6132
TV Show    2677
Name: count, dtype: int64

### Questions sur les tendances temporelles

In [29]:
# R√©partition des ajouts en fonction de l'ann√©e
df_wip['year_added'].value_counts()

year_added
2019.0    1999
2020.0    1878
2018.0    1625
2021.0    1498
2017.0    1164
2016.0     418
2015.0      73
2014.0      23
2011.0      13
2013.0      10
2012.0       3
2009.0       2
2008.0       2
2024.0       2
2010.0       1
Name: count, dtype: int64

In [30]:
# Quelle est la r√©partition des ajouts en fonction du jour de la semaine ?
df_wip['day_of_the_week_added'].value_counts()

day_of_the_week_added
4.0    2478
3.0    1387
2.0    1276
1.0    1182
0.0     845
5.0     803
6.0     740
Name: count, dtype: int64

### Questions sur le contenu et les acteurs

In [31]:
# Top 5 des cat√©gories de shows les plus ajout√©es
categories_exploded.value_counts().head(5)

listed_in_list
 International Movies    2624
Dramas                   1600
Comedies                 1210
Action & Adventure        859
Documentaries             829
Name: count, dtype: int64

In [32]:
# Top 5 des com√©diens les plus pl√©biscit√©s aux Etats-Unis.

merge_cast = pd.merge(cast_list, df_wip, how='inner', left_index=True, right_index=True)
usa_data = merge_cast[merge_cast['country'].str.contains('United States', na=False)]
usa_data['cast_list'].value_counts().head(5)

cast_list
Adam Sandler        20
 Fred Tatasciore    19
 Molly Shannon      16
 Alfred Molina      15
 Tara Strong        15
Name: count, dtype: int64

In [33]:
# Pays plus gros producteur de documentaires

merge_countries = pd.merge(countries_exploded, df_wip, how='inner', left_index=True, right_index=True)
countries_data = merge_countries[merge_countries['listed_in'].str.contains('Documentaries', na=False)]
countries_data['countries'].value_counts().head(5)

countries
United States     453
United Kingdom    103
 United States     59
Canada             33
France             27
Name: count, dtype: int64

### Questions statistiques

In [34]:
# Moyenne des saisons par les s√©ries
df_wip['duration_tv_show'].mean()

np.float64(1.7646619350018677)

In [35]:
# Distribution des films en fonction de leur dur√©e (quartiles)
df_wip['duration_movies'].describe()

count    6129.000000
mean       99.578887
std        28.288598
min         3.000000
25%        87.000000
50%        98.000000
75%       114.000000
max       312.000000
Name: duration_movies, dtype: float64

In [36]:
# Film avec pour th√©matique la drogue

drug_show = df_wip[df_wip['description'].str.contains('drug')]
print(len(drug_show))

158
