# Data Cleaning on Netlix dataset

In this notebook we perform an data cleaning process on the Netflix dataset obtained from kaggle:
https://www.kaggle.com/shivamb/netflix-shows

-------------------------------------------------------------------------------------------------------------------------------------------------------------

## 0. Import basic libraries

In [1]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd 

print("Libraries imported!!")

Libraries imported!!


----------------------------------------------------------------------------------------
## 1. Load and read the dataset

Here, we read the dataset and we find the shape of it as well as the colum names.

In [2]:
df = pd.read_csv('netflix_titles.csv')
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [3]:
print('There are', df.shape[0], 'rows and', df.shape[1], 'columns in the dataset.')

There are 7787 rows and 12 columns in the dataset.


In [4]:
print('The columns of the dataset are the following: ')
for col in df.columns.tolist() :
    print('-', col)

The columns of the dataset are the following: 
- show_id
- type
- title
- director
- cast
- country
- date_added
- release_year
- rating
- duration
- listed_in
- description


----------------------------------------------------------------------------------------------
## 2. Find and handle missing values

In this step, we find and handle missing values in the dataset.

First, we identify missing values.

In [5]:
missing_data = df.isnull()
missing_data.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,False,False,False,True,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False


Now, we are going to count how many missing values has each column.

In [6]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("#################################")    

show_id
False    7787
Name: show_id, dtype: int64
#################################
type
False    7787
Name: type, dtype: int64
#################################
title
False    7787
Name: title, dtype: int64
#################################
director
False    5398
True     2389
Name: director, dtype: int64
#################################
cast
False    7069
True      718
Name: cast, dtype: int64
#################################
country
False    7280
True      507
Name: country, dtype: int64
#################################
date_added
False    7777
True       10
Name: date_added, dtype: int64
#################################
release_year
False    7787
Name: release_year, dtype: int64
#################################
rating
False    7780
True        7
Name: rating, dtype: int64
#################################
duration
False    7787
Name: duration, dtype: int64
#################################
listed_in
False    7787
Name: listed_in, dtype: int64
#################################


Based on the summary above, each column has 7787 rows of data, five columns containing missing data:

<ol>
    <li>"director": 2389 missing values</li>
    <li>"cast": 718 missing values</li>
    <li>"country": 507 missing values</li>
    <li>"data_added" : 10 missing values</li>
    <li>"rating": 7 missing values</li>
</ol>

So, let's delete director column as it has high amount of missing values and the rows with at least one missing value.

In [7]:
#drop director column
df.drop(['director'],axis = 1,inplace = True)
#drop rows with nan values
df = df.dropna(axis=0, how='any')
#print the new shape of the dataset
print('There are', df.shape[0], 'rows and', df.shape[1], 'columns in the dataset after handling missing values.')

There are 6643 rows and 11 columns in the dataset after handling missing values.


----------------------------------------------------------------------------------------------
## 3. Correct data format

In this last step we check and make sure that all the data is in the correct format. 

First, we check the data types of each column.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6643 entries, 0 to 7785
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       6643 non-null   object
 1   type          6643 non-null   object
 2   title         6643 non-null   object
 3   cast          6643 non-null   object
 4   country       6643 non-null   object
 5   date_added    6643 non-null   object
 6   release_year  6643 non-null   int64 
 7   rating        6643 non-null   object
 8   duration      6643 non-null   object
 9   listed_in     6643 non-null   object
 10  description   6643 non-null   object
dtypes: int64(1), object(10)
memory usage: 622.8+ KB


data_added column is object, so we convert it to DateTime type.

In [10]:
#change the type of the column
df['date_added'] = pd.to_datetime(df['date_added'])
#get the month, and year
df['day_added']=df['date_added'].dt.day
df['month_added']=df['date_added'].dt.month_name()
df['year_added'] = df['date_added'].dt.year
#drop the column date_added
df.drop('date_added',axis=1,inplace=True)
df.head(5)

Unnamed: 0,show_id,type,title,cast,country,release_year,rating,duration,listed_in,description,day_added,month_added,year_added
0,s1,TV Show,3%,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...,14,August,2020
1,s2,Movie,7:19,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...,23,December,2016
2,s3,Movie,23:59,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow...",20,December,2018
3,s4,Movie,9,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi...",16,November,2017
4,s5,Movie,21,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...,1,January,2020


----------------------------------------------------------------------------------------------
## 4. Save the cleaned dataset

In this final step of the notebook we save the cleaned dataset for further analysis.

In [12]:
df.to_csv('netflix_data_cleaned.csv', index=False)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6643 entries, 0 to 7785
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       6643 non-null   object
 1   type          6643 non-null   object
 2   title         6643 non-null   object
 3   cast          6643 non-null   object
 4   country       6643 non-null   object
 5   release_year  6643 non-null   int64 
 6   rating        6643 non-null   object
 7   duration      6643 non-null   object
 8   listed_in     6643 non-null   object
 9   description   6643 non-null   object
 10  day_added     6643 non-null   int64 
 11  month_added   6643 non-null   object
 12  year_added    6643 non-null   int64 
dtypes: int64(3), object(10)
memory usage: 726.6+ KB
