### Introduction
- This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.
- In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

### Data source
- https://www.kaggle.com/shivamb/netflix-shows

### Data description
- This data set consist 6234 rows and 12 columns.
- 'show_id' - given as unique show id.
- 'type' - identifier like movie or tv show.
- 'title' - movie or tv show name.
- 'director' - given person's name who directs a tv show/movie.
- 'cast' - actors and actress involved in the movie.
- 'country' - Country where the movie/show was released
- 'date_added' - content added date on netflix.
- 'release_year' - actual release year.
- 'rating' - TV Rating of the movie/show.
- 'duration' - time duration in minutes or number of seasons.
- 'listed_in' - content category like comedy,action etc.
- 'description' - short narration.
- To analyse tha data using python libraries pandas and numpy and for data visualisation using plotly library.

### Tasks performed in this analysis
- Distribution of rating.
- Distribution of movies and Tv shows.
- Count the total number of content released per year.
- Count the number of tv-shows and movies per year.
- In which month content added the most?
- Which country released most content?
- Distribution of categories available on netflix.
- Top 10 actors on Netflix with most movies.
- Top 10 actors on Netflix with most tv shows.
- Which director directs maximum content in India?
- Top 10 directors with number of content.
- Which countries produce stand-up comedy on Netflix?
- Distribution of Movie Duration.
- Find the number of seasons in tv shows.
- How soon movies are available on Netflix after release?

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import datetime
import warnings
warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv('C:\\Users\\piyush bhardwaj\\Documents\\data_analysis\\netflix-data\\netflix_titles.csv')
print('Total number of rows and columns:')
print(data.shape)
print('\n')
print('checking id contains duplicate values:')
print(data.show_id.duplicated().sum())
print('\n')
print('checking null values in data:')
print(data.isnull().sum(),'\n')
print("check column data type:",'\n',data.dtypes)
data.head(2)

## Distribution of rating.

In [None]:
ratings = data.filter(['show_id','rating']).dropna()
ratings.drop_duplicates(inplace=True)
ratings = pd.DataFrame(ratings.groupby(['rating']).show_id.count())
ratings.reset_index(level=0,inplace=True)
ratings.sort_values(by=['show_id'],ascending=False,inplace=True)
fig = px.bar(x=ratings.rating,y=ratings.show_id,labels={'x':'Ratings','y':'Number of Shows'},title='Distribution of Ratings')
fig.show()

### Observation 
- X-axis represent rating.
- Y-axis represent number of shows.
- The bar chart indicates that the majority of content on Netflix is of the rating TV-MA which means it is intended to be viewed by mature, adult audiences and may be unsuitable for children under 17. Followed by TV-14 this rating of programs may be unsuitable for children below 14 years of age.
- An NC-17 rating means that anyone below the age of 17 is not allowed to watch a film. This may be the reason that Netflix added only 2 shows.

## Distribution of movies and Tv shows.

In [None]:
category = data.filter(['show_id','type']).drop_duplicates()
category = pd.DataFrame(category.groupby(['type']).agg(count_type=('type','count')))
category.reset_index(level=0,inplace=True)
print(data.type.count())
fig = px.pie(category, values='count_type', names='type',title='TV-Shows and Movies',color_discrete_sequence=["blue","pink"])
fig.show()

### Observation
- Total number of content 6234.
- count of movies are 4265.
- count of tv-shows are 1969 
- The number of movies on netflix is more than twice the tv-shows.

## Count the total number of content released per year

In [None]:
content = data.filter(['show_id','release_year']).drop_duplicates()
content = pd.DataFrame(content.groupby(['release_year']).show_id.count())
content.reset_index(level=0,inplace=True)
content['rank_show_id'] = content.show_id.rank(method='first',ascending=False).astype(np.int32) 
# content = content[content.rank_show_id < 11]
fig = px.bar(x=content.show_id, y=content.release_year,labels={'x':'Number of content','y':'Year'},orientation='h',
             title='Movies and TV-Shows are released')
fig.show()

### Observation
- X-axis represent number of content.
- Y-axis represent years.
- Exponential growth in releasing the content from the year 2008.
- In the year 2010 the number of content released is 149 and 2011 the content is 136. Except this, the content per year increases till 2018.
- The maximum number of content was released in year 2018 which is 1063.

## Count the number of tv-shows and movies per year.

In [None]:
year_content = data.filter(['show_id','type','release_year']).drop_duplicates()
year_content = pd.DataFrame(year_content.groupby(['type','release_year']).show_id.count())
year_content.reset_index(level=[0,1],inplace=True)
year_content['rank_show_id'] = year_content.show_id.rank(method='first',ascending=False).astype(np.int32)
# year_content = year_content[year_content.rank_show_id < 11]
fig = px.bar(x=year_content.release_year, y=year_content.show_id, color=year_content['type'], barmode='group',
            labels={'x': 'Year', 'y': 'Number of Movies/Tv-shows'},title='Movies/Tv-shows Released per year')
fig.show()

### Observation
- Bar graph represents number of movies and TV Shows per year.
- X-axis represent years.
- Y-axis represent count of movies and tv shows.
- In the year 2000, the number of content increases but the maximum number of movies released in the year 2017 and the maximum number of TV Show released in 2019.
- Exponential growth in tv shows and movies. Tv shows are still increasing after 2017 but the number of movies decreased. Perhaps, due to increased demand for tv shows or delay in adding new movies to the Netflix platform.

## In which month content added the most?

In [None]:
month_content = data.filter(['show_id','date_added'])
print('Checking null values:','\n',month_content.isnull().sum(),'\n')
print('drop null values','\n')
month_content['new_date_added'] = pd.to_datetime(month_content.date_added, infer_datetime_format=True)
month_content['month'] = pd.DatetimeIndex(month_content.new_date_added).month
month_content = pd.DataFrame(month_content.groupby(['month']).show_id.count())
month_content.reset_index(level=[0],inplace=True)
print('Total rows and columns:',month_content.shape)
fig = px.bar(x=month_content.month, y=month_content.show_id,title='Content added per month',
             labels={'x':'Months','y':'Number of Movies/TV-Shows'})
fig.show()

### Observation
- X-axis represent months.
- Y-axis represent number of content.
- The maximum number of content added in the month of December(696) followed by October(646) and November(612).

## Which country released most content?

In [None]:
country_data =  data.filter(['country', 'show_id']).dropna().drop_duplicates()
new_country_data = country_data.set_index('show_id')
new_country_data = new_country_data.country.str.split(',', expand=True)
new_country_data.reset_index(level=[0], inplace=True)
new_country_data = new_country_data.melt(id_vars='show_id').dropna().drop_duplicates()
print('Checking null values:','\n',new_country_data.isnull().sum(),'\n')
new_country_data.drop(columns=['variable'], inplace=True)
new_country_data.rename(columns={'value':'country_updated'},inplace=True)
print('Total rows and columns:',new_country_data.shape)
content = new_country_data.filter(['show_id','country_updated'])
print('Checking null values:','\n',content.isnull().sum())
content = pd.DataFrame(content.groupby(['country_updated']).show_id.count())
content.reset_index(level=[0],inplace=True)
content['rank_show_id'] = content.show_id.rank(method='first',ascending=False).astype(np.int32)
content = content[content.rank_show_id < 11]
content.sort_values(by=['rank_show_id'],ascending=False,inplace=True)
print('\n','Total rows and columns:',content.shape)
fig = px.bar(x=content.show_id, y=content.country_updated,labels={'x':'Number of Movies/TV-Shows','y':'Country'},
            title='Top 10 countries in content production')
fig.show()

### Observation
- X-axis represent number of movies and tv shows.
- Y-axis represent country.
- The bar graph shows the top 10 countries which produce maximum content.
- Most of the content is released exclusively from the United States. This might be because Netflix has been very popular in the USA for a long time.

## Distribution of categories available on netflix.

In [None]:
df = data.filter(['show_id','listed_in'])
new_df = df.set_index('show_id')
new_df = new_df.listed_in.str.split(',', expand=True)
new_df.reset_index(level=[0], inplace=True)
new_df = new_df.melt(id_vars='show_id').dropna().drop_duplicates()
new_df.drop(columns=['variable'], inplace=True)
new_df.rename(columns={'value':'genre'},inplace=True)
print(new_df.shape)
top_category = new_df.filter(['show_id','genre'])
top_category['updated_genre'] = top_category['genre'].str.strip()
print('Checking null values:','\n',top_category.isnull().sum(),'\n')
top_category = pd.DataFrame(top_category.groupby(['updated_genre']).show_id.count())
top_category.reset_index(level=0,inplace=True)
print('Total rows and columns:',top_category.shape)
top_category.head()
fig = px.bar(y=top_category.updated_genre, x=top_category.show_id,labels={'y':'Categories of Movies/TV-Shows','x':'Number of content'},
            title='Categories on Netflix')
fig.show()

### Observation
- y-axis represent categories.
- x-axis represent number of content
- Total 42 categories in Netflix in which international movies are at the top with 1927 number of content followed by Dramas, Comedies.

## Top 10 actors on Netflix with most movies.

In [None]:
df_actor = data.filter(['show_id','type','cast'])
df_actor = df_actor[df_actor.type == 'Movie']
print('checking null values')
print(df_actor.isnull().sum(),'\n')
df_actor.dropna(inplace=True)
print('Drop null values')
print(df_actor.isnull().sum(),'\n')
new_df_actor = df_actor.set_index('show_id')
new_df_actor = new_df_actor.cast.str.split(',', expand=True)
new_df_actor.reset_index(level=[0], inplace=True)
new_df_actor = new_df_actor.melt(id_vars='show_id').dropna().drop_duplicates()
new_df_actor.drop(columns=['variable'], inplace=True)
new_df_actor.rename(columns={'value':'new_cast'},inplace=True)
print('Total rows and columns:',new_df_actor.shape)
top_actors = new_df_actor.filter(['show_id','new_cast'])
top_actors = pd.DataFrame(top_actors.groupby(['new_cast']).show_id.count())
top_actors.reset_index(level=0,inplace=True)
top_actors['rank_show_id'] = top_actors.show_id.rank(method='first',ascending=False).astype(np.int32)
top_actors = top_actors[top_actors.rank_show_id < 11]
top_actors.sort_values(by=['rank_show_id'],ascending=False,inplace=True)
print('Total rows and columns:',top_actors.shape)
fig = px.bar(x=top_actors.show_id, y=top_actors.new_cast,labels={'x':'Number of Movies','y':'Actors'},
            title='Top 10 actors with most movies')
fig.show()

### Observation
- X-axis represent number of movies.
- Y-axis represent actors name.
- Anupam Kher is an Indian actor top in the list with 29 movies, followed by om puri, shah rukh khan
- Asrani and John Cleese had done same number of movies.

## Top 10 actors on Netflix with most tv shows.

In [None]:
df_tv_actor = data.filter(['show_id','type','cast'])
df_tv_actor = df_tv_actor[df_tv_actor.type == 'TV Show']
print('checking null values:')
print(df_tv_actor.isnull().sum(),'\n')
print('Drop null values:',df_tv_actor.dropna(inplace=True))
new_df_tv_actor = df_tv_actor.set_index('show_id')
new_df_tv_actor = new_df_tv_actor.cast.str.split(',', expand=True)
new_df_tv_actor.reset_index(level=[0], inplace=True)
new_df_tv_actor = new_df_tv_actor.melt(id_vars='show_id').dropna().drop_duplicates()
new_df_tv_actor.drop(columns=['variable'], inplace=True)
print('Total rows and columns:',new_df_tv_actor.shape)
new_df_tv_actor.rename(columns={'value':'new_tv_cast'},inplace=True)
top_tv_actors = new_df_tv_actor.filter(['show_id','new_tv_cast'])
top_tv_actors = pd.DataFrame(top_tv_actors.groupby(['new_tv_cast']).show_id.count())
top_tv_actors.reset_index(level=0,inplace=True)
top_tv_actors['rank_show_id'] = top_tv_actors.show_id.rank(method='first',ascending=False).astype(np.int32)
top_tv_actors = top_tv_actors[top_tv_actors.rank_show_id < 11]
top_tv_actors.sort_values(by=['rank_show_id'],ascending=False,inplace=True)
print('Total rows and columns:',top_tv_actors.shape)
fig = px.bar(x=top_tv_actors.show_id, y=top_tv_actors.new_tv_cast,labels={'x':'Number of TV-Shows','y':'Actors'},
            title='Top 10 actors with most TV Shows')
fig.show()

### Observation
- Bar graph represet top 10 actors who had done most tv shows.
- X-axis represent number of tv shows.
- Y-axis represent actors.
- Takahiro Sakurai at the top and works in 18 TV shows.
- Yuki Kaji and David Attenborough had done 14 tv shows each.
- Ai and Kayano and Daisuke Ono had done same number of shows.
- Ashleigh Ball, Hiroshi Kamiya, and Mamoru Miyano were done the same number of shows.

## Which director directs maximum content in India?

In [None]:
india_director = data.filter(['show_id','director','country'])
print('checking null values')
print(india_director.isnull().sum(),'\n')
print('Drop null values')
india_director.dropna(inplace=True)
india_director = india_director.set_index('show_id')
india_director = india_director.director.str.split(',', expand=True)
india_director.reset_index(level=[0], inplace=True)
india_director = india_director.melt(id_vars='show_id').dropna().drop_duplicates()
print('Total rows and columns:',india_director.shape)
india_director.drop(columns=['variable'], inplace=True)
india_director.rename(columns={'value':'new_director'},inplace=True)
india_director = india_director.merge(new_country_data,on='show_id')
india_director = india_director[india_director.country_updated == 'India']
india_director = pd.DataFrame(india_director.groupby(['new_director']).show_id.count())
india_director.reset_index(level=0,inplace=True)
india_director['rank_show_id'] = india_director.show_id.rank(method='first',ascending=False).astype(np.int32)
india_director = india_director[india_director.rank_show_id < 11]
india_director.sort_values(by=['rank_show_id'],ascending=False,inplace=True)
fig = px.bar(x=india_director.show_id, y=india_director.new_director,labels={'x':'Number of content','y':'Directors'},
            title='Top 10 Indian Directors with most content')
fig.show()

### Observation
- X-axis represent number of content
- Y-axis represent directors.
- David Dhawan directs the maximum number of movies/tv-shows in India, followed by S.S. Rajamouli
- Out of the top 10, the last 7 directors direct the same number of content.

## Top 10 directors with number of content.

In [None]:
directors = data.filter(['show_id','director','country'])
print('checking null values')
print(directors.isnull().sum(),'\n')
print('Drop null values')
directors.dropna(inplace=True)
directors = directors.set_index('show_id')
directors = directors.director.str.split(',', expand=True)
directors.reset_index(level=[0], inplace=True)
directors = directors.melt(id_vars='show_id').dropna().drop_duplicates()
print('Total rows and columns:',directors.shape)
directors.drop(columns=['variable'], inplace=True)
directors.rename(columns={'value':'new_director'},inplace=True)
directors = directors.merge(new_country_data,on='show_id')
directors = pd.DataFrame(directors.groupby(['new_director']).show_id.count())
directors.reset_index(level=0,inplace=True)
directors['rank_show_id'] = directors.show_id.rank(method='first',ascending=False).astype(np.int32)
directors = directors[directors.rank_show_id < 11]
directors.sort_values(by=['rank_show_id'],ascending=False,inplace=True)
fig = px.bar(x=directors.show_id, y=directors.new_director,labels={'x':'Number of content','y':'Directors'},
            title='Top 10 directors with most content')
fig.show()

### Observation
- X-axis represent number of content
- Y-axis represent directors.
- Jan Suter and Raul Campos are on the top with the same number of content(18).
- jay karas and steven spielberg had done the same number of movies/tv-shows.
- Jay Chapman, Marcus Raboy, and Matthew Salleh had done 12 number of content each.

## Which countries produce stand-up comedy on Netflix?

In [None]:
test = new_df.filter(['show_id','genre'])
test = test[test.genre == 'Stand-Up Comedy']
test = test.merge(new_country_data,on='show_id')
print('Checking null values','\n',test.isnull().sum())
test = pd.DataFrame(test.groupby(['country_updated']).show_id.count())
test.reset_index(level=0,inplace=True)
fig = px.bar(x=test.country_updated, y=test.show_id,labels={'x':'Countries','y':'Number of shows'},
            title='Stand-up comedy per country')
fig.show()

### Observation
- The bar graph represents the Stand-up comedy category per country.
- Total 17 countries
- X-axis represent countries
- Y-axis represent number of shows.
- United states at the top with 181 shows.
- The United States was more than the sum of the remaining 16 countries content.

## Distribution of Movie Duration.

In [None]:
movies = data.filter(['show_id','type','duration'])
print(movies.isnull().sum(),'\n')
movies = movies[movies['type'] == 'Movie']
movies.replace('min','',regex=True, inplace = True)
movies['time_duration'] = movies.duration.str.strip().astype(np.int32)
print(movies.dtypes)
fig = px.histogram(x=movies.time_duration)
fig.update_xaxes(title_text='Time in minutes')
fig.update_yaxes(title_text='Total number of movies')
fig.show()

## Find the number of seasons in tv shows.

In [None]:
tv = data.filter(['show_id','type','duration'])
tv = tv[tv['type'] == 'TV Show']
tv.duration.replace('[Season,Seasons]','',regex=True, inplace = True)
tv['updated_duration'] = tv.duration.str.strip().astype(np.int32)
tv = pd.DataFrame(tv.groupby(['updated_duration']).show_id.count()).astype(np.int32)
tv.reset_index(level=[0],inplace=True)
print(tv.dtypes)
fig = px.bar(x=tv.updated_duration,y=tv.show_id,labels={'x':'Number of Seasons','y':'Number of Shows'},
            title='TV Shows with Seasons')
fig.show()

### Observation
- X-axis represent season count.
- Y-axis represent show count.
- There are total 15 seasons.
- A large number of TV Shows(1321) have only 1 season..
- 186 TV Shows have more than 4 seasons. 
- 'Supernatural' is a TV Show that has 14 seasons.
- 'Grey's Anatomy' and 'NCIS' both have 15 seasons.

## How soon movies are available on netflix after release?

In [None]:
available = data.filter(['show_id','type','date_added','release_year'])
print(available.isnull().sum())
print('\n','drop null values')
available.dropna(inplace=True)
available = available[available.type == 'Movie']
available['new_date_added'] = pd.to_datetime(available.date_added, infer_datetime_format=True)
available['date_added_year'] = pd.DatetimeIndex(available.new_date_added).year
available['difference'] = available.date_added_year - available.release_year
available.drop(columns = ['date_added','new_date_added'],inplace=True)
available = pd.DataFrame(available.groupby(['date_added_year']).agg({'difference': ['mean', 'median']}))
available.reset_index(level=[0], inplace=True)
available.columns = ['year_date_added', 'difference_mean', 'difference_median']
fig = go.Figure()
fig.add_trace(go.Bar(
    x=available.year_date_added,
    y=available.difference_mean,
    name='mean',
    marker_color='indianred'
))
fig.add_trace(go.Bar(
    x=available.year_date_added,
    y=available.difference_median,
    name='median',
    marker_color='green'
))
fig.update_layout(barmode='group', xaxis_tickangle=-45, title='Movies available on Netflix after release')
fig.update_xaxes(title_text="Years")
fig.update_yaxes(title_text="Mean/Median count")
fig.show()