<a href="https://colab.research.google.com/github/karolkruszynski/Netflix_Data_Cleaning/blob/main/Netflix_Data_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Libraries import

In [57]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Data Set Load

In [58]:
df = pd.read_csv('/content/netflix_titles.csv')

# Data Exploration

In [59]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [60]:
df.shape

(8807, 12)

In [61]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [62]:
df.describe()

Unnamed: 0,release_year
count,8807.0
mean,2014.180198
std,8.819312
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


In [63]:
df.duplicated().value_counts()

False    8807
dtype: int64

In [64]:
df.duplicated().sum()

0

# Manipulation with Date Time Data

There are no duplicate values

date_added is object type and needs to be converted to datetime.

Format: mm/dd/yyyy

In [65]:
df['date_added'] = pd.to_datetime(df['date_added'])

In [66]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   show_id       8807 non-null   object        
 1   type          8807 non-null   object        
 2   title         8807 non-null   object        
 3   director      6173 non-null   object        
 4   cast          7982 non-null   object        
 5   country       7976 non-null   object        
 6   date_added    8797 non-null   datetime64[ns]
 7   release_year  8807 non-null   int64         
 8   rating        8803 non-null   object        
 9   duration      8804 non-null   object        
 10  listed_in     8807 non-null   object        
 11  description   8807 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(10)
memory usage: 825.8+ KB


# Separation of Listed Columns

Listed columns should be separated, we must split the columns into multiple columns

drop column i we don't need
show_id,
listed_in,
director

In [67]:
df[['Listed_in_1', 'Listed_in_2', 'Listed_in_3']] = df['listed_in'].str.split(', ', expand=True)

In [68]:
df[['Director_1', 'Director_Others']] = df['director'].str.split(', ', 1 ,expand=True)

  df[['Director_1', 'Director_Others']] = df['director'].str.split(', ', 1 ,expand=True)


In [69]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,Listed_in_1,Listed_in_2,Listed_in_3,Director_1,Director_Others
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",Documentaries,,,Kirsten Johnson,
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",International TV Shows,TV Dramas,TV Mysteries,,
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,Crime TV Shows,International TV Shows,TV Action & Adventure,Julien Leclercq,
3,s4,TV Show,Jailbirds New Orleans,,,,2021-09-24,2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",Docuseries,Reality TV,,,
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,International TV Shows,Romantic TV Shows,TV Comedies,,


# Manipulation with Duration Column

In [70]:
df['duration'].value_counts()

1 Season     1793
2 Seasons     425
3 Seasons     199
90 min        152
94 min        146
             ... 
16 min          1
186 min         1
193 min         1
189 min         1
191 min         1
Name: duration, Length: 220, dtype: int64

duration is object type and needs to be converted

sometimes duration has values in minutes or hours or seasons

In [71]:
columns_to_drop = ['show_id', 'director', 'listed_in']
df = df.drop(columns=columns_to_drop)

I assume that 1 season is equal to 30 episodes and 1 episode is 24 minutes

In [72]:
df['duration_numeric'] = df['duration'].str.extract('(\d+)').astype(float)

Convert durations to minutes based on their units

In [73]:
df.loc[df['duration'].str.contains('Season').fillna(False),'duration_numeric'] *= 30 * 24 * 60

In [74]:
df.loc[df['duration'].str.contains('min').fillna(False),'duration_numeric'] # no conversion needed for minutes

0        90.0
6        91.0
7       125.0
9       104.0
12      127.0
        ...  
8801     96.0
8802    158.0
8804     88.0
8805     88.0
8806    111.0
Name: duration_numeric, Length: 6128, dtype: float64

In [75]:
df['duration_numeric'].isna().sum()

3

drop old duration column

In [76]:
df = df.drop(columns=['duration'])

filtre rows with missing values

In [83]:
rows_to_drop = df.loc[df['duration_numeric'].isna()]

drop rows with missing values

In [84]:
df.drop(rows_to_drop.index, inplace=True)

In [85]:
df['duration_numeric'].isna().sum()

0

In [90]:
df.loc[df['title'].isin(["Peaky Blinders"])]

Unnamed: 0,type,title,cast,country,date_added,release_year,rating,description,Listed_in_1,Listed_in_2,Listed_in_3,Director_1,Director_Others,duration_numeric
3452,TV Show,Peaky Blinders,"Cillian Murphy, Sam Neill, Helen McCrory, Paul...",United Kingdom,2019-10-04,2019,TV-MA,"A notorious gang in 1919 Birmingham, England, ...",British TV Shows,Crime TV Shows,International TV Shows,,,216000.0


In [91]:
df.loc[df['duration_numeric'].between(0, 30)]

Unnamed: 0,type,title,cast,country,date_added,release_year,rating,description,Listed_in_1,Listed_in_2,Listed_in_3,Director_1,Director_Others,duration_numeric
45,Movie,My Heroes Were Cowboys,,,2021-09-16,2021,PG,Robin Wiltshire's painful childhood was rescue...,Documentaries,,,Tyler Greco,,23.0
71,Movie,A StoryBots Space Adventure,"Evan Spiridellis, Erin Fitzgerald, Jeff Gill, ...",,2021-09-14,2021,TV-Y,Join the StoryBots and the space travelers of ...,Children & Family Movies,,,David A. Vargas,,13.0
694,Movie,Aziza,"Caress Bashar, Abdel Moneim Amayri","Lebanon, Syria",2021-06-17,2019,TV-PG,This short film follows a newly displaced Syri...,Comedies,Dramas,Independent Movies,Soudade Kaadan,,13.0
695,Movie,Besieged Bread,"Lama Hakeim, Gabriel Malki, Ehab Shaaban",,2021-06-17,2015,TV-14,"In battle-ridden Syria, a woman trying to smug...",Dramas,International Movies,,Soudade Kaadan,,12.0
811,Movie,Super Monsters: Once Upon a Rhyme,"Elyse Maloway, Vincent Tong, Andrea Libman, Al...",,2021-06-02,2021,TV-Y,"From Goldilocks to Hansel and Gretel, the Supe...",Children & Family Movies,,,Steve Ball,,25.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7787,Movie,Power Rangers: Megaforce: Raising Spirits,"Andrew M. Gray, John Mark Loudermilk, Ciara Ha...",United States,2016-01-01,2013,TV-Y7,"On Halloween, the scariest night of the year, ...",Children & Family Movies,,,,,24.0
7788,Movie,Power Rangers: Megaforce: The Robo Knight Befo...,"Andrew M. Gray, Ciara Hanna, John Mark Louderm...",United States,2016-01-01,2013,TV-Y7,Robo Knight learns the meaning of Christmas fr...,Children & Family Movies,,,James Barr,,24.0
7848,Movie,Refugee,"Cate Blanchett, Lynsey Addario, Omar Victor Di...",,2017-03-10,2016,TV-PG,Five acclaimed photographers travel the world ...,Documentaries,,,Clementine Malpas,Leslie Knott,24.0
7891,Movie,Room on the Broom,"Simon Pegg, Gillian Anderson, Rob Brydon, Mart...","United Kingdom, Germany",2019-07-01,2012,TV-Y7,A gentle witch with a ginger braid offers ride...,Children & Family Movies,Independent Movies,,Max Lang,Jani Lachauer,26.0
