- Recently I got a lot of feedback from my dear friends who just change or about the change their career towards to Data Analysis/ Data Science and Machine Learning areas about the lack of material between beginning the analysis journey and the advanced techniques.

- They are looking for detailed but at the same time beginner friendly, not so much complicated (with different regression, normalization techniques, etc.) explained Explanatory Data Analysis examples, which show them how to start and most importantly how to read the descriptive statistics and graphs.

- After getting these feedbacks, I have decided to make some kind of series of EDA’s from different datasets, without making so complicated for the people at their first steps of DS/ML journey.

### This notebook is part of the 9 Beginner Friendly EDAs. If these EDAs would be helpful to anyone, I would be more than happy.



### **INTRO**

- In this study, we are going to make Exploratory Data Analysis (EDA) with the Netflix original films dataset. 
- Study aims to be beginner friendly and give as much as possible explanation for each step on the way.
- Study's dataset has top 584 Netflix original films  on the different genre. 
- Each films has language, release time, runtime and IMDB Score.

- First, let's import the required libraries.
- We will use Plotly's interactive environment for visualization.

In [1]:
import pandas as pd
import numpy as np


import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

### Overview Stage

- Read the csv
- Look for basic information about the dataset

In [2]:
df= pd.read_csv('../input/netflix-original-films-imdb-scores/NetflixOriginals.csv')

In [3]:
df.head()

Unnamed: 0,Title,Genre,Premiere,Runtime,IMDB Score,Language
0,Enter the Anime,Documentary,"August 5, 2019",58,2.5,English/Japanese
1,Dark Forces,Thriller,"August 21, 2020",81,2.6,Spanish
2,The App,Science fiction/Drama,"December 26, 2019",79,2.6,Italian
3,The Open House,Horror thriller,"January 19, 2018",94,3.2,English
4,Kaali Khuhi,Mystery,"October 30, 2020",90,3.4,Hindi


In [4]:
df.shape

(584, 6)

- We have 584 films and 6 attributes

In [5]:
df.isnull().sum()

Title         0
Genre         0
Premiere      0
Runtime       0
IMDB Score    0
Language      0
dtype: int64

- Yeah, it is very hard to find that kind of clean data in the real life. 
- No missing values. Hurray !!!

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 584 entries, 0 to 583
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Title       584 non-null    object 
 1   Genre       584 non-null    object 
 2   Premiere    584 non-null    object 
 3   Runtime     584 non-null    int64  
 4   IMDB Score  584 non-null    float64
 5   Language    584 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 27.5+ KB


- We need to make an adjustment on the Premiere feature,it should be datetime object.
- Other than that, everything Seems OK.

In [7]:
df['date'] = pd.to_datetime(df['Premiere'])
df['date']

0     2019-08-05
1     2020-08-21
2     2019-12-26
3     2018-01-19
4     2020-10-30
         ...    
579   2018-12-31
580   2015-10-09
581   2018-12-16
582   2020-12-08
583   2020-10-04
Name: date, Length: 584, dtype: datetime64[ns]

- OK, it is much better.
- Let's make use of it and make columns out of it, such as, year, month, day.

In [8]:
df['year_month']= df['date'].dt.strftime('%Y-%m')
df['year'] = df['date'].dt.year
df['month']= df['date'].dt.month
df['day_of_week']=df['date'].dt.dayofweek

- Now we are ready to move on to the analysis part.

### Analysis Part

#### **Genre**

In [9]:
df['Genre'].nunique()

115

In [10]:
df['Genre'].value_counts(normalize=True)

Documentary                 0.272260
Drama                       0.131849
Comedy                      0.083904
Romantic comedy             0.066781
Thriller                    0.056507
                              ...   
Anime / Short               0.001712
Romantic thriller           0.001712
Science fiction thriller    0.001712
Urban fantasy               0.001712
Action thriller             0.001712
Name: Genre, Length: 115, dtype: float64

- We have 115 different genre
- Let's look at the first 20 genre

In [11]:
genre = df['Genre'].value_counts()[:20]
genre

Documentary                 159
Drama                        77
Comedy                       49
Romantic comedy              39
Thriller                     33
Comedy-drama                 14
Crime drama                  11
Horror                        9
Biopic                        9
Action                        7
Aftershow / Interview         6
Concert Film                  6
Romance                       6
Action comedy                 5
Animation                     5
Romantic drama                5
Science fiction/Thriller      4
Animation / Short             4
Science fiction               4
Variety show                  4
Name: Genre, dtype: int64

- 27.2% of the movies on the Documentary genre, then 13% of the movies on Drama genre.
- Majority of the movies  come from different genres and each genre shares at around 1% each.

In [12]:
fig = px.bar(genre, x= genre.index, y=genre.values, labels={'y':'Number of Movies from the Genre', 'index':'Genres'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

#### Languages

In [13]:
df['Language'].nunique()

38

In [14]:
top_10_languages_used= df['Language'].value_counts()[:10]
top_10_languages_used

English       401
Hindi          33
Spanish        31
French         20
Italian        14
Portuguese     12
Indonesian      9
Korean          6
Japanese        6
German          5
Name: Language, dtype: int64

In [15]:
fig = px.bar(top_10_languages_used, x= top_10_languages_used.index, y=top_10_languages_used.values, labels={'y':'Count', 'index':'Language'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- English is the most used language in the programs, Hindi and Spanish follows it.

#### **Runtime**

In [16]:
df['Runtime'].describe()

count    584.000000
mean      93.577055
std       27.761683
min        4.000000
25%       86.000000
50%       97.000000
75%      108.000000
max      209.000000
Name: Runtime, dtype: float64

- We have at around 93-97 minutes runtime for the programs in Netflix.
- Based on the given descriptive info, we can expect outliers from both the maximum side and the minimum side. 
- Since mean score is lower than median score; we can expect left skewed distribution and  we will see more runtime values on the minimum side.

In [17]:
fig = px.histogram(df, x= 'Runtime', title='Runtime of the Programs in Netflix')

fig.show()

In [18]:
fig = px.box(df, x= 'Runtime', hover_data = df[['Title','Genre']])
fig.update_traces(quartilemethod="inclusive")
fig.show()

- As we expected, we have left skewed distribution with multiple outliers are on the both side, but much more are on the left side-minimum side.

- Movie with the maximum runtime  is 'Irishman', yeah, agreed, it was quite a long movie. But no complaints. I loved to see Al Pacino and Robert De Niro at the same movie.

- Minimum runtime was 4 minute animation 'Sol Levante'

#### IMDB Score

In [19]:
df['IMDB Score'].describe()

count    584.000000
mean       6.271747
std        0.979256
min        2.500000
25%        5.700000
50%        6.350000
75%        7.000000
max        9.000000
Name: IMDB Score, dtype: float64

- Before going further, I have to admit that, I am regular follower of IMDB website. Most of the time, I agreed with their rating scores.

- Programs in the Netflix, got around 6.3 average rating. Max 9 and minum was 2.5.

- Mean and median values are close to each other. Since median is bigger than mean score, we can expect left skewed distribution with several outliers are on the left side-minimum side.

In [20]:
fig = px.histogram(df, x= 'IMDB Score', title='IMDB Score of the Programs in Netflix')

fig.show()

In [21]:
fig = px.box(df, x= 'IMDB Score', hover_data = df[['Title','Genre']])
fig.update_traces(quartilemethod="inclusive")
fig.show()

- **David Attenborough** and his documentaries, I love him. He is true hero and excellent documentary producer-presenter. it is very normal for me to see, his documentary got 9 maximum point in the list.

- Minimum rating is 'Enter the Anime'.

- Interestingly both maximum and minimum rating programs are from documentary genre.

#### Correlation Between Runtime and IMDB Ratings

In [22]:
df[['IMDB Score','Runtime']].corr()

Unnamed: 0,IMDB Score,Runtime
IMDB Score,1.0,-0.040896
Runtime,-0.040896,1.0


In [23]:
fig = px.scatter(df, x='IMDB Score', y='Runtime')
fig.show()

- There is no significant relationship between runtime and IMDB score.

#### **Year**

In [24]:
Year = df['year'].value_counts()
Year

2020    183
2019    125
2018     99
2021     71
2017     66
2016     30
2015      9
2014      1
Name: year, dtype: int64

In [25]:
fig = px.bar(Year, x= Year.index, y=Year.values, labels={'y':'Count of Movies in Each Year', 'index':'Year'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- As one can expect, each year number of programs in the Netflix increase.
- Since we don't have full data on the 2021, difference between 2020 and 2021 is normal.

#### **Month**

In [26]:
Month= df['month'].value_counts(sort=False)
Month

1     37
2     39
3     48
4     63
5     53
6     35
7     34
8     37
9     53
10    77
11    57
12    51
Name: month, dtype: int64

In [27]:
months = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

fig = px.bar(Month, x= months, y=Month.values, labels={'y':'Count of Movies in Each Month', 'x':'Month'})
fig.show()

- Number of program releases differs by months. October and April are the months which have the highest number of program releases.

- During the summer time, Jun-Aug, the least number of movie is released.

#### **Day**

In [28]:
days= df['day_of_week'].value_counts(sort=False)
days

0     17
1     29
2     82
3     59
4    383
5      5
6      9
Name: day_of_week, dtype: int64

In [29]:
day = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

fig = px.bar(days, x= day, y=days.values, labels={'y':'Count of Movies in Each Day', 'x':'Day'})
fig.show()

- Friday has the maximum number of new releases.

- Saturday and Sunday have the lowest number of releases.

#### **Top 10 Ratings by Genre**

In [30]:
top_10_ratings_by_genre = df.groupby('Genre')['IMDB Score'].mean().sort_values(ascending=False)[:10]
top_10_ratings_by_genre

Genre
Animation/Christmas/Comedy/Adventure    8.200000
Musical / Short                         7.700000
Concert Film                            7.633333
Anthology/Dark comedy                   7.600000
Animation / Science Fiction             7.500000
Making-of                               7.450000
Action-adventure                        7.300000
Historical drama                        7.200000
Drama-Comedy                            7.200000
Coming-of-age comedy-drama              7.200000
Name: IMDB Score, dtype: float64

#### Top 10 Rating Genres

In [31]:
fig = px.bar(top_10_ratings_by_genre, x= top_10_ratings_by_genre.index, y=top_10_ratings_by_genre.values, labels={'y':'Average Rating Score', 'x':'Genre'})
fig.show()

- Top rating score is on the Animation-Christmas-Comedy-Adventure Genre then Musical/short and Concert Film.

#### **Lowest 10 Ratings by Genre**

In [32]:
bottom_10_ratings_by_genre = df.groupby('Genre')['IMDB Score'].mean().sort_values()[:10]
bottom_10_ratings_by_genre

Genre
Heist film/Thriller        3.700000
Musical/Western/Fantasy    3.900000
Horror anthology           4.300000
Political thriller         4.300000
Superhero-Comedy           4.400000
Science fiction/Drama      4.533333
Romance drama              4.600000
Mystery                    4.650000
Horror thriller            4.700000
Anime / Short              4.700000
Name: IMDB Score, dtype: float64

#### Bottom 10 Ratings

In [33]:
fig = px.bar(bottom_10_ratings_by_genre, x= bottom_10_ratings_by_genre.index, y=bottom_10_ratings_by_genre.values, labels={'y':'Average Rating Score', 'x':'Genre'})
fig.show()

- Lowest rating movies are from Heist film/Thriller, Musical/Wester/Fantasy and Horror Anthology genres.

#### **Top 20 High Rating Movies** 

In [34]:
top_20 = df[['IMDB Score','Title','Genre','year','Language']].sort_values(['IMDB Score'], ascending=False)[:20]
top_20

Unnamed: 0,IMDB Score,Title,Genre,year,Language
583,9.0,David Attenborough: A Life on Our Planet,Documentary,2020,English
582,8.6,Emicida: AmarElo - It's All For Yesterday,Documentary,2020,Portuguese
581,8.5,Springsteen on Broadway,One-man show,2018,English
580,8.4,Winter on Fire: Ukraine's Fight for Freedom,Documentary,2015,English/Ukranian/Russian
579,8.4,Taylor Swift: Reputation Stadium Tour,Concert Film,2018,English
578,8.4,Ben Platt: Live from Radio City Music Hall,Concert Film,2020,English
577,8.3,Dancing with the Birds,Documentary,2019,English
576,8.3,Cuba and the Cameraman,Documentary,2017,English
573,8.2,Klaus,Animation/Christmas/Comedy/Adventure,2019,English
571,8.2,13th,Documentary,2016,English


In [35]:
fig = px.scatter(top_20, y= 'Title', x='IMDB Score', 
                 hover_data = top_20[['Genre','year','Language']], color='Genre', 
                 title = "Top 20 High Rated Programs")
fig.show()

- 16 out 20 top rated movies come from Documentary genre.

#### **20 Lowest Rated Movies** 

In [36]:
bottom_20 = df[['IMDB Score','Title','Genre','year','Language']].sort_values(['IMDB Score'])[:20]
bottom_20

Unnamed: 0,IMDB Score,Title,Genre,year,Language
0,2.5,Enter the Anime,Documentary,2019,English/Japanese
1,2.6,Dark Forces,Thriller,2020,Spanish
2,2.6,The App,Science fiction/Drama,2019,Italian
3,3.2,The Open House,Horror thriller,2018,English
4,3.4,Kaali Khuhi,Mystery,2020,Hindi
5,3.5,Drive,Action,2019,Hindi
6,3.7,Leyla Everlasting,Comedy,2020,Turkish
7,3.7,The Last Days of American Crime,Heist film/Thriller,2020,English
8,3.9,Paradox,Musical/Western/Fantasy,2018,English
9,4.1,Sardar Ka Grandson,Comedy,2021,Hindi


In [37]:
fig = px.scatter(bottom_20, y= 'Title', x='IMDB Score', 
                 hover_data = bottom_20[['Genre','year','Language']], color='Genre', 
                 title = "20 Lowest Rated Programs")
fig.show()

- We can see lowest rated movies from every different genre.

## This notebook is a part of the 9 Beginner Friendly EDAs
## If you like this one, you can also check out other notebooks in the Beginner Friendly EDAs series!

* [Data Analyst Jobs - EDA](https://www.kaggle.com/kaanboke/plotly-data-analyst-jobs)
* [Top Games on Google Play Store](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-top-games)
* [Hollywood Top Movies- EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-top-movies)
* [UDEMY Courses EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-udemy)
* [World Happiness Report - EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-eda)
* [Countries Life Expectancy](https://www.kaggle.com/kaanboke/plotly-beginner-friendly)
* [Amazon Top 50 Bestselling Books EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-amazon)
* [London Bike Sharing - EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-london-bike)


- Thanks for the dataset contibutor for this data. I really enjoyed working on it.

- It was a quite pleasure to share with you this detailed, beginner friendly EDA. Thanks for your time.

- All the best 