> # Netfilix Films EDA 

### _Data features_
🔘Release_Date: Date when the movie was released.

🔘Title: Name of the movie.

🔘Overview: Brief summary of the movie.

🔘Popularity: It is a very important metric computed by TMDB developers based on the number of views per day, votes per day, number of users  
  marked it as "favorite" and "watchlist" for the data, release date and more other metrics.

🔘Vote_Count: Total votes received from the viewers.

🔘Vote_Average: Average rating based on vote count and the number of viewers out of 10.

🔘Original_Language: Original language of the movies. Dubbed version is not considered to be original language.

🔘Genre: Categories the movie it can be classified as.

🔘Poster_Url: Url of the movie poster.

> # Wrangling and Exploring

In [1]:
# import libraries
import pandas as pd 
import numpy as np 
import warnings 
import plotly.express as px 
pd.set_option('display.max_columns', None) 
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv("mymoviedb.csv",lineterminator='\n' ) # without lineterminator='\n' it will give error

In [3]:
df.head()

Unnamed: 0,Release_Date,Title,Overview,Popularity,Vote_Count,Vote_Average,Original_Language,Genre,Poster_Url
0,2021-12-15,Spider-Man: No Way Home,Peter Parker is unmasked and no longer able to...,5083.954,8940,8.3,en,"Action, Adventure, Science Fiction",https://image.tmdb.org/t/p/original/1g0dhYtq4i...
1,2022-03-01,The Batman,"In his second year of fighting crime, Batman u...",3827.658,1151,8.1,en,"Crime, Mystery, Thriller",https://image.tmdb.org/t/p/original/74xTEgt7R3...
2,2022-02-25,No Exit,Stranded at a rest stop in the mountains durin...,2618.087,122,6.3,en,Thriller,https://image.tmdb.org/t/p/original/vDHsLnOWKl...
3,2021-11-24,Encanto,"The tale of an extraordinary family, the Madri...",2402.201,5076,7.7,en,"Animation, Comedy, Family, Fantasy",https://image.tmdb.org/t/p/original/4j0PNHkMr5...
4,2021-12-22,The King's Man,As a collection of history's worst tyrants and...,1895.511,1793,7.0,en,"Action, Adventure, Thriller, War",https://image.tmdb.org/t/p/original/aq4Pwv5Xeu...


In [4]:
df.tail()

Unnamed: 0,Release_Date,Title,Overview,Popularity,Vote_Count,Vote_Average,Original_Language,Genre,Poster_Url
9822,1973-10-15,Badlands,A dramatization of the Starkweather-Fugate kil...,13.357,896,7.6,en,"Drama, Crime",https://image.tmdb.org/t/p/original/z81rBzHNgi...
9823,2020-10-01,Violent Delights,A female vampire falls in love with a man she ...,13.356,8,3.5,es,Horror,https://image.tmdb.org/t/p/original/4b6HY7rud6...
9824,2016-05-06,The Offering,When young and successful reporter Jamie finds...,13.355,94,5.0,en,"Mystery, Thriller, Horror",https://image.tmdb.org/t/p/original/h4uMM1wOhz...
9825,2021-03-31,The United States vs. Billie Holiday,Billie Holiday spent much of her career being ...,13.354,152,6.7,en,"Music, Drama, History",https://image.tmdb.org/t/p/original/vEzkxuE2sJ...
9826,1984-09-23,Threads,Documentary style account of a nuclear holocau...,13.354,186,7.8,en,"War, Drama, Science Fiction",https://image.tmdb.org/t/p/original/lBhU4U9Eeh...


In [5]:
df.describe().T.style.bar(subset=['mean'], color='#205ff2').background_gradient(subset=['std'], cmap='Reds').background_gradient(subset=['50%'], cmap='coolwarm')

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Popularity,9827.0,40.326088,108.873998,13.354,16.1285,21.199,35.1915,5083.954
Vote_Count,9827.0,1392.805536,2611.206907,0.0,146.0,444.0,1376.0,31077.0
Vote_Average,9827.0,6.439534,1.129759,0.0,5.9,6.5,7.1,10.0


🔘 ___Voute count seems odd with high mean and low median. Let's check the distribution of the vote count.___

In [6]:
df.describe(include='object').T

Unnamed: 0,count,unique,top,freq
Release_Date,9827,5893,2022-03-10,16
Title,9827,9513,Beauty and the Beast,4
Overview,9827,9822,Wilbur the pig is scared of the end of the sea...,2
Original_Language,9827,43,en,7570
Genre,9827,2337,Drama,466
Poster_Url,9827,9827,https://image.tmdb.org/t/p/original/1g0dhYtq4i...,1


In [7]:
df.isna().sum()

Release_Date         0
Title                0
Overview             0
Popularity           0
Vote_Count           0
Vote_Average         0
Original_Language    0
Genre                0
Poster_Url           0
dtype: int64

🔘 ___No null values in the dataset.___

In [8]:
df.dtypes

Release_Date          object
Title                 object
Overview              object
Popularity           float64
Vote_Count             int64
Vote_Average         float64
Original_Language     object
Genre                 object
Poster_Url            object
dtype: object

🔘 ___All data types are correct.___

In [9]:
duplicates = df.duplicated(subset= ["Title","Release_Date"])

🔘 ___There is films with same name but in different realisation year. that mean it is not duplicate.___

> ### Cleaning 

In [10]:
# droping unwanted columns 
df.drop(columns=["Overview","Poster_Url"],inplace=True)

#### _"Release_Date" column_

In [11]:
df['Release_Date'] = pd.to_datetime(df['Release_Date'])  # converting to datetime format 

In [12]:
df['decade'] = (df['Release_Date'].dt.year // 10) * 10 # creating decade column 

In [13]:
df["decade"] = df["decade"].astype("category")

In [14]:
# make year column 
df['year'] = df['Release_Date'].dt.year 
# make month column
df['Month'] = df['Release_Date'].dt.strftime('%B') 

#### _"Title" column_

In [15]:
df["Title"].value_counts()

Beauty and the Beast                      4
Alice in Wonderland                       4
The Little Mermaid                        3
The Call                                  3
Halloween                                 3
                                         ..
There's Something About Mary              1
Amores Perros                             1
The Human Centipede 3 (Final Sequence)    1
Newness                                   1
Threads                                   1
Name: Title, Length: 9513, dtype: int64

In [16]:
df.loc[df["Title"] == "Beauty and the Beast"]

Unnamed: 0,Release_Date,Title,Popularity,Vote_Count,Vote_Average,Original_Language,Genre,decade,year,Month
467,1991-10-22,Beauty and the Beast,109.649,8280,7.7,en,"Romance, Family, Animation, Fantasy",1990,1991,October
506,2017-03-16,Beauty and the Beast,104.094,13992,7.0,en,"Family, Fantasy, Romance",2010,2017,March
2996,2014-02-12,Beauty and the Beast,30.17,1763,6.1,fr,"Fantasy, Romance",2010,2014,February
5761,1946-10-29,Beauty and the Beast,19.024,449,7.5,fr,"Drama, Fantasy, Romance",1940,1946,October


🔘 __Again , Films with same name but in different realisation year. that mean it is not duplicate , but with deferent release year and production company.__

In [17]:
df["Title"].isna().sum() # checking for null values 

0

### _Original_Language column_

In [18]:
# changing abbreviations to full form 
keys =list(df["Original_Language"].value_counts().index) # getting the unique values of the column 
values =['English', 'Japanese', 'Spanish', 'French', 'Korean', 'Chinese', 'Italian', 'Mandarin (China)', 'Russian', 'German', 'Portuguese', 'Danish', 'Norwegian', 'Hindi', 'Swedish', 'Dutch', 'Polish', 'Thai', 'Indonesian', 'Turkish', 'Tagalog', 'Telugu', 'Greek', 'Finnish', 'Serbian', 'Czech', 'Persian', 'Hungarian', 'Icelandic', 'Romanian', 'Ukrainian', 'Tamil', 'Arabic', 'Hebrew', 'Catalan', 'Latin', 'Norwegian Bokmål', 'Bengali', 'Malay', 'Latvian', 'Basque', 'Malayalam', 'Estonian']
replaec_dict = dict(zip(keys,values)) # creating a dictionary of the values 
df["Original_Language"] = df["Original_Language"].replace(replaec_dict) # replacing the values

In [19]:
df["Original_Language"].value_counts()

English             7570
Japanese             645
Spanish              339
French               292
Korean               170
Chinese              129
Italian              123
Mandarin (China)     109
Russian               83
German                82
Portuguese            37
Danish                28
Norwegian             26
Hindi                 26
Swedish               23
Dutch                 21
Polish                17
Thai                  17
Indonesian            15
Turkish               15
Tagalog                8
Telugu                 6
Greek                  5
Finnish                5
Serbian                5
Czech                  4
Persian                3
Hungarian              3
Icelandic              2
Romanian               2
Ukrainian              2
Tamil                  2
Arabic                 2
Hebrew                 2
Catalan                1
Latin                  1
Norwegian Bokmål       1
Bengali                1
Malay                  1
Latvian                1


## _Genre column_

In [20]:
df["Genre"] = df["Genre"].str.split(",").str[0] # splitting the genre column and taking the first value


In [21]:
df["Genre"].unique() # checking the unique values 
df["Genre"] = df["Genre"].astype("category") # converting to category type 

In [22]:
df["Genre"].value_counts()

Drama              1791
Action             1570
Comedy             1484
Horror              868
Animation           805
Adventure           586
Thriller            515
Crime               391
Family              350
Romance             304
Science Fiction     296
Fantasy             254
Documentary         184
Mystery             102
War                  89
Music                82
Western              73
History              45
TV Movie             38
Name: Genre, dtype: int64

In [23]:
df["Vote_score"] = (df["Vote_Count"] * df["Vote_Average"]) # creating a new column rating that is the product of vote count and vote average 

# Vote_score more representive of the rating of the movie as it takes into account the number of votes as well as the average vote

In [24]:
# vote score to percentage   
df["Vote_score"] = df["Vote_score"]/df["Vote_score"].max() * 100 

> # _EDA_ 📊

>>### ___What is the year that has the most number of films released?___

In [25]:
release_year_df = df["year"].value_counts().reset_index().rename(columns={"index":"year","year":"count"}) # getting the count of movies released in each year

In [26]:

fig = px.bar(release_year_df, x=release_year_df.year,y='count', color ='count',width=950, height=500) 
fig.update_layout(
    title=' "Num. of movies produced over the years" ',
    xaxis_title="Year",
    yaxis_title="Number  Of Movies",
    font=dict(
        size=14
    )
)
fig.layout.template = 'plotly_dark'
fig.show()

⭐​ ___The year 2021 has the most number of films released with 714 films.___  

>>### ___What is the decate that has the most number of films released?___

In [27]:
release_decade_df = df["decade"].value_counts().reset_index().rename(columns={"index":"decade","decade":"count"})


fig = px.bar(release_decade_df, x=release_decade_df.decade,y='count', color ='count',width=950, height=500,text_auto='True') 
fig.update_layout(
    title=' "Num. of movies produced over the Decades" ',
    xaxis_title="decade",
    yaxis_title="Number  Of Movies",
    font=dict(
        size=14
    )
    ,
)
fig.layout.template = 'plotly_dark'
fig.show()

⭐​ ___The 10s of 21st century has the most number of films released with __3999__ films.___

⭐​ ___Production from 1920 to 1960 is very low.___

>>### ___What is the most month that has the most number of films released evrey year?___

In [28]:
release_Month_df = df["Month"].value_counts().reset_index().rename(columns={"index":"Month","Month":"count"})


fig = px.bar(release_Month_df, x=release_Month_df.Month,y='count', color ='count',width=950, height=500,text_auto='True') 
fig.update_layout(
    title=' "Num. of movies produced each month  over time " ',
    xaxis_title="decade",
    yaxis_title="Number  Of Movies",
    legend_title="Frequency",
    font=dict(
        size=14
    )
    ,
)
fig.layout.template =  'plotly_dark'
fig.show()

⭐​ ___Octuber ,September and December has the most number of films released over time.___

⭐​ ___April , May and June has the least number of films released over time.___

>> ### ___What is the top 30 films with the highest popularity?___ 

In [29]:
popular =df.sort_values(by='Popularity',ascending=False).reset_index()[:31]

fig = px.scatter( popular, x='Title',y='Popularity', color ='Popularity',size='Popularity',width=1000, height=950) 

fig.update_layout(
    title=' "Top 30 Popular Movies" ',
    xaxis_title="Movie Nmaes",
    yaxis_title="Movie Popularity",
    font=dict(
        size=15
    )
)
fig.layout.template = 'plotly_dark'
fig.show()

⭐​ ___Spider-Man:No Way Home movie was the most popular  and in secound place is The Batman movie.___


>> ### ___What is the most genre that has the most number of films released?___ 

In [30]:
release_Genere_df = df["Genre"].value_counts().reset_index().rename(columns={"index":"Genre","Genre":"count"})


fig = px.bar(release_Genere_df, x=release_Genere_df.Genre,y='count', color ='count',width=950, height=500,text_auto='True') 
fig.update_layout(
    title=' "Num. of Genre produced  over time " ',
    xaxis_title="Genre",
    yaxis_title="Number  Of Movies",
    legend_title="Frequency",
    font=dict(
        size=14
    )
    
)
fig.layout.template =  'plotly_dark'
fig.show()

In [31]:
Genres= df["Genre"].value_counts().reset_index().rename(columns={"index":"Genre","Genre":"count"})
fig = px.pie(Genres, values='count', names=Genres.Genre,width=800,height=500)

fig.update_layout(
    title=' "Distribution of Genres" ',
    legend_title="Genres",
    font=dict(
        size=14
    )
)
fig.layout.template = 'plotly_dark'

fig.show()

⭐​ ___Drama is the most genre that has the most number of films released.___ 


>> ### ___What is the most language that has the most number of films released?___ 

In [32]:
lang=  df["Original_Language"].value_counts().reset_index().rename(columns={"index":"Original_Language","Original_Language":"count"})
fig = px.pie(lang, values='count', names=lang.Original_Language,width=800,height=500)
fig.update_layout(
    title=' "Distribution of Languages" ',
    legend_title="Languages",
    font=dict(
        size=14
    )
)
fig.layout.template = 'plotly_dark'
fig.update_traces(textposition='inside',textfont_size=50)
fig.show()

⭐​ ___English is the most language that has the most number of films released with prcentage of 77%.___


>> ### ___What is the top 10 movies with the highest vote score?___

In [33]:
vote_score =df.sort_values(by='Vote_score',ascending=False).reset_index()[:11]

fig = px.scatter( vote_score, x='Title',y='Vote_score', color ='Vote_score',size='Vote_score',width=1000, height=950) 

fig.update_layout(
    title=' Top 10 liked movies  ',
    xaxis_title="Movie Nmaes",
    yaxis_title="Movie Vote Score",
    font=dict(
        size=15
    )
)
fig.layout.template = 'plotly_dark'
fig.show()

⭐​ ___Inception movie was the most vote score and in secound place is Interstellar movie.___


>>> ### ___What if we combine vote score and popularity?___ 🤔

In [34]:
vote_score_pop =df.sort_values(by=['Vote_score','Popularity'],ascending=False).reset_index()[:11]

fig = px.scatter( vote_score_pop, x='Title',y='Vote_score', color ='Popularity',size='Popularity',width=1000, height=950) 

fig.update_layout(
    title=' Top 10 liked movies  ',
    xaxis_title="Movie Nmaes",
    yaxis_title="Movie Vote Score",
    font=dict(
        size=15
    )
)
fig.layout.template = 'plotly_dark'
fig.show()

⭐​ ___Avengers: Infinity War movie was the most popular and vote score and in secound place is Avatar movie.___


>>> ### ___Finally , What is the Distribution of each gebere in each language?___ 

In [38]:
fig = px.sunburst(df, path=['Original_Language','Genre'], width=850, height=750,)
                
fig.layout.template = 'plotly_dark'
fig.show()