# Netflix Exploratory Data Analysis



## Introduction

This is Exploratory Data Analysis using Python of the "Netflix Movies & TV Shows" dataset. The purpose of this project is to find out and visualize the data's main characteristics and trends using statistical methods and data visualization techniques.



# Phase 1: Ask


## About the data

The Netflix Movies & TV Shows dataset can be found on [Kaggle](https://www.kaggle.com/shivamb/netflix-shows). It contains all TV Shows and Movies metadata available on Netflix. The dataset is updated every month. It contains 8807 records and 12 columns.

[Netflix](http://en.wikipedia.org/wiki/Netflix) was founded on August 29, 1997, as a mail-based rental business. In January 2007, the company launched a streaming media service, introducing video on demand via the Internet.


## Objective

The purpose of this analysis is to answer the following questions:

* Number of movies/tv-shows added to the streaming platform by year
* Which month has the most added movies/tv-shows?
* Which day has the most added movies/tv-shows?
* How many Movies vs. TV-Shows?
* Which year has the most releases Movies/TV-Shows?
* Which is the oldest movie/tv-show on streaming?
* Which countries produce the most movies/tv-shows?
* Cast members with the most content


# Phase 2: Data Preparation


### Importing Libraries

In [None]:
import pandas as pd 
import plotly.express as px

### Loading the dataset with Pandas¶

In [None]:
netflix = pd.read_csv('../input/netflix-shows/netflix_titles.csv')
netflix

### Column names and types

In [None]:
netflix.info()

### Numerical column 'describe'

In [None]:
netflix.describe()

# Phase 3: Process


### Checking for missing data

In [None]:
missing_data = netflix.isna().sum().sort_values(ascending=False)
missing_data

In [None]:
netflix_isna = pd.isna(netflix['director'])
netflix[netflix_isna]

### Checking for duplicates

In [None]:
netflix['show_id'].duplicated().any()

### Changing the date format of the column 'date_added' to 'datetime'


In [None]:
netflix['date_added']= pd.to_datetime(netflix['date_added'].str.strip(), format= "%B %d, %Y") 
netflix

### Checking unique type of data of the column 'Type'

There are only two types of data: Movie and TV Show

In [None]:
type_data = netflix.type.unique()
type_data


# Phase 4 Exploratory Data Analysis


## About Netflix streaming platform



#### Number of movies/tv-shows added to the streaming platform by Year

In [None]:
# year with the most added movies/tv-shows
netflix_release_year = netflix.date_added.dt.year.astype('Int64').value_counts()
netflix_release_year

#### The month with the most added movies/tv-shows

In [None]:
netflix_release_month = netflix.date_added.dt.month.astype('Int64').value_counts()
netflix_release_month

#### Day with the most added movies/tv-shows


In [None]:
netflix_release_day = netflix.date_added.dt.day.astype('Int64').value_counts()
netflix_release_day

#### Number of Movies vs. TV-Shows

In [None]:
netflix_type = netflix.type.value_counts()
netflix_type

## About the Movies/TV-Shows


#### The year with the most releases movies/tv-shows

In [None]:
movietv_release_year = netflix.release_year.value_counts()
movietv_release_year

#### The oldest movie/tv-show on streaming

In [None]:
netflix[netflix['release_year']== 1925]

#### Top 30 Countries producing the most movies/tv-shows

In [None]:
# country unique
#netflix['country'].unique()

In [None]:
# new dataset for the country count
country_count = netflix.copy()
country_count = pd.concat([country_count, netflix['country'].str.split(",", expand=True)], axis=1)
country_count = country_count.melt(id_vars=["type","title"], value_vars=range(12), value_name="Country")
country_count = country_count[country_count["Country"].notna()]
country_count["Country"] = country_count["Country"].str.strip()
country_count

In [None]:
# countries unique
country_count.Country.unique()

In [None]:
# countries with the most number of content streaming
country_count.Country.value_counts()

#### Top 30 Cast members with the most content

In [None]:
#New dataset for the cast count
cast_count = netflix.copy()
cast_count = pd.concat([cast_count, netflix['cast'].str.split(",", expand=True)], axis=1)
cast_count = cast_count.melt(id_vars=["type","title"], value_vars=range(44), value_name="Cast_name")
cast_count = cast_count[cast_count["Cast_name"].notna()]
cast_count["Cast_name"] = cast_count["Cast_name"].str.strip()
cast_count

In [None]:
cast_count.Cast_name.value_counts()[:30]

# Phase 5: Visualization

### Movies vs. TV-Shows

In [None]:
px.histogram(netflix, x= 'type', color= 'type',
             title="Movies vs. TV-Shows",
             color_discrete_sequence= px.colors.sequential.Sunsetdark)

### Number of Movies/TV-Shows added by Year

In [None]:
px.histogram(netflix, x= netflix['date_added'].dt.year, color= netflix['type'],
             title="Netflix number of Movie/TV Show by year",
             color_discrete_sequence= px.colors.sequential.Sunsetdark,  
              labels=dict(x="Year", color= "Type")                     
                   )

### Number of Movies/TV-Shows added by Month

#### Dropping 'NA' records from the column 'date_added'

Dropping 10 records from the column 'date_added' that contain 'NA' values

In [None]:
#counting the number of 'NA' on the column 'date_added'
netflix['date_added'].isna().sum()

In [None]:
#dropping 'NA'
netflix = netflix.dropna(subset=['date_added'])

In [None]:
px.histogram(netflix, x= netflix['date_added'].dt.month, color= netflix['type'],
             color_discrete_sequence= px.colors.sequential.Sunsetdark,
             title="Movies/TV Shows added by Month",
             labels=dict(x="Month")) 

### Number of Movies/TV-Shows added by Day

In [None]:
px.histogram(netflix, x= netflix['date_added'].dt.day,color= netflix['type'],
             color_discrete_sequence= px.colors.sequential.Sunsetdark,            
             title="Movies/TV Shows added by Day",
             labels=dict(x="Day")) 

## About the content

### Number of Movies/TV-Shows by year of release

In [None]:
px.histogram(netflix, x= 'release_year', color= 'type',
             title="Number of Movies/TV-Shows by year of release",
             color_discrete_sequence= px.colors.sequential.Sunsetdark,  
             labels={'release_year':'Year of release'}                     
                   )

### Top 30 Countries with the most streaming content

In [None]:
px.histogram(country_count, x= 'Country', color= 'type',
        title="Top 30 Countries with the most streaming content",
        color_discrete_sequence= px.colors.sequential.Sunsetdark).update_xaxes(
        categoryorder="total descending",range=(0, 30))

### Top 30 Cast members with the most titles

In [None]:
px.histogram(cast_count, x= 'Cast_name', color= 'type',
    title="Top 30 Cast members with the most streaming content",
    color_discrete_sequence= px.colors.sequential.Sunsetdark).update_xaxes(
    categoryorder="total descending",range=(0, 30))




# Findings


### About Netflix

* There are more Movies than TV-Shows available on streaming. 6131 movies and 2676 tv-shows.
* 2019 is the year with the most content addition on the streaming platform, 2016 movie/tv-shows added, followed by 2020 with 1879, and 2018 with 1649 total.
* July and December are the months with the most content addition, 827 and 813 movie/tv-shows added.
* Netflix adds content on the first day of the month more than any other day.

### About the content available

* Among the contents available 1147 of them were originally released in 2018 followed by 2017 with 1032, and 2019 with 1030 total.
* Pioneers: First Women Filmmakers is the oldest content available on streaming. It's a collection of restored films dating from 1925.
* The United States is the country that produces the most of the content with 3690 titles, followed by India 1046 titles and the United Kingdom 806 titles.
* Anupam Kher is the actor with the higher number of titles, 43 films. [Anupam Kher](http://en.wikipedia.org/wiki/Anupam_Kher)  is an Indian actor, director, and producer that has appeared in over 500 films.



