### Table of Contents

* [Data](#section1)
* [Data Description](#section2)
* [Data Cleaning](#section3)
    * [Filling Missngs](#section_2_1)
    * [Deletions](#section_2_2)
    * [Changing Format](#section_2_3)
    * [Creating New Table](#section_2_4)
* [Data Visualization](#section4)
    * [Plot 1](#section_4_1)
    * [Plot 2](#section_4_2)
    * [Plot 3](#section_4_3)
* [Highlights](#section5)


### Data <a class="anchor" id="section1"></a>

In [362]:
pip install squarify

Collecting squarifyNote: you may need to restart the kernel to use updated packages.
  Downloading squarify-0.4.3-py3-none-any.whl (4.3 kB)
Installing collected packages: squarify
Successfully installed squarify-0.4.3



In [3]:
pip install plotly




In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import squarify 
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
from plotly.subplots import make_subplots
import plotly.graph_objects as go


In [5]:
pd.set_option('display.max_rows', None)

In [6]:
df = pd.read_csv('netflix_titles.csv')
df.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...


### Data Description<a class="anchor" id="section2"></a>
TV Shows and Movies listed on Netflix
This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In [7]:
print(df.isnull().sum())
print("Number of Duplicates",df.duplicated().sum())
df.dtypes

show_id            0
type               0
title              0
director        2389
cast             718
country          507
date_added        10
release_year       0
rating             7
duration           0
listed_in          0
description        0
dtype: int64
Number of Duplicates 0


show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object

### Data Cleaning <a class="anchor" id="section3"></a>
Despite the nulls I was reluctant to remove or delete those specific rows because a quick search by their title could find the missing information. Instead I'll make a seperate df for the missing rows if they relate to my findings.

#### Filling nulls<a class="anchor" id="section_2_1"></a>

In [8]:
df["director"].fillna("Missing", inplace = True)
df["cast"].fillna("Missing", inplace = True)
df["country"].fillna("Missing", inplace = True)

#### Deletions<a class="anchor" id="section_2_2"></a>

In [9]:
del df['description']
df.head(1)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in
0,s1,TV Show,3%,Missing,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &..."


#### ReFormating<a class="anchor" id="section_2_3"></a>

In [10]:
df['date_added'] = pd.to_datetime(df['date_added'])
df.info()
df.head(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   show_id       7787 non-null   object        
 1   type          7787 non-null   object        
 2   title         7787 non-null   object        
 3   director      7787 non-null   object        
 4   cast          7787 non-null   object        
 5   country       7787 non-null   object        
 6   date_added    7777 non-null   datetime64[ns]
 7   release_year  7787 non-null   int64         
 8   rating        7780 non-null   object        
 9   duration      7787 non-null   object        
 10  listed_in     7787 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(9)
memory usage: 669.3+ KB


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in
0,s1,TV Show,3%,Missing,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,2020-08-14,2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &..."


In [11]:
nc = df.columns[df.isnull().any()]
print(df[df["date_added"].isnull()][nc])

     date_added rating
258         NaT  TV-MA
549         NaT  TV-PG
2263        NaT  TV-PG
2288        NaT  TV-14
2555        NaT  TV-14
3374        NaT   TV-Y
3492        NaT  TV-14
3946        NaT  TV-MA
5137        NaT     NR
6065        NaT  TV-Y7


In [12]:
df.iloc[258]

show_id                                                      s259
type                                                      TV Show
title                 A Young Doctor's Notebook and Other Stories
director                                                  Missing
cast            Daniel Radcliffe, Jon Hamm, Adam Godley, Chris...
country                                            United Kingdom
date_added                                                    NaT
release_year                                                 2013
rating                                                      TV-MA
duration                                                2 Seasons
listed_in                British TV Shows, TV Comedies, TV Dramas
Name: 258, dtype: object

#### Added Columns<a class="anchor" id="section_2_4"></a>

In [13]:
df['Count'] = 1

### Data Visualization <a class="anchor" id="section4"></a>

In [14]:
M = df
M['country'] = df['country'].str.split(',',expand=True)
finalCountry = M.groupby('country').sum()
del finalCountry['release_year']
df_Country = finalCountry.reset_index(level=['country'])


#### Plot 1 <a class="anchor" id="section_4_1"></a>

In [15]:
fig_tree = px.treemap(df_Country, path=[px.Constant("Country title count"),'Count','country'])
fig_tree.update_layout(title='Countries that have the most Titles',
                  margin=dict(t=40, b=0, l=70, r=40),
                  plot_bgcolor='#fff', paper_bgcolor='#fff',
                  title_font=dict(size=25, color='#000', family="Lato, sans-serif"),
                  font=dict(color='#8a8d93'),
                  hoverlabel=dict(bgcolor="#444", font_size=13, font_family="Lato, sans-serif"))

In [16]:
#oldest = min(df['date_added'])
#newest = max(df['date_added'])
years = []
yearSum= []
yearName= []
x = range(2008, 2022, 1)
for n in x:
    dd = 'Year' + str(n)
    dd= df[df['date_added'].dt.year == n]    
    years.append(dd)
    yearName.append('Year '+str(n))

    #2008-2021 counts
y = range(0, 14,1)
for i in y:
      yearSum.append(years[i]['Count'].sum())

#### Plot 2 <a class="anchor" id="section_4_2"></a>

In [17]:

fig = go.Figure(data=[go.Pie(labels=yearName, values=yearSum, textinfo='label')])
fig.update_layout(
    title_text="Count of Titles per Year",)
fig.show()

In [18]:
df.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,Count
0,s1,TV Show,3%,Missing,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,2020-08-14,2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",1
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016-12-23,2016,TV-MA,93 min,"Dramas, International Movies",1


In [49]:
months = []
monthsSum= []
monthsName= ['January','February','March','April',
             'May','June','July','August',
             'September','October','November','December']
b = range(1, 13, 1)
for a in b:
    mm = df[df['date_added'].dt.month == a] 
    months.append(mm)
y = range(0, 12,1)
for i in y:
      monthsSum.append(months[i]['Count'].sum())

#### Plot 3 <a class="anchor" id="section_4_3"></a>

In [62]:
colors = ['grey'] * 12
colors[11] = 'red'

fig = go.Figure([go.Bar(x=monthsName, y=monthsSum,marker_color=colors)])
fig.update_layout(title_text='Top months for releases')
fig.show()

### Highlights <a class="anchor" id="section5"></a>
##### First visual shows us the top countries that have the most titles are USA, India and Canada.
#####  Second visual displays the top years for releases were 2019 coming in first, 2020 coming in second & 2018 coming in third.
##### Lastly the most popular month for releases on netflix seems to be Decemeber.