## Data Analysis of Netflix Movies and TV Shows

1. Understanding what content is available in different countries.
2. Identifying similar content by matching text-based features.
3. Network Analysis of Actors/Directors and find interesting insights.
4. Does Netflix have an increasing focus on TV rather than movies in recent years?

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
pd.__version__

'0.20.1'

### Read Data

In [41]:
# Making a list of missing value types
missing_values = ['n/a', 'na', '--']
netflix = pd.read_csv('netflix_titles.csv', na_values=missing_values)
netflix.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...


##### Summary of Dataset

In [42]:
print('Rows     :', netflix.shape[0])
print('Columns  :', netflix.shape[1])
print('\nFeatures :\n     :', netflix.columns.tolist())
print('\nMissing values    :', netflix.isnull().values.sum())
print('\nUnique values :\n  ', netflix.nunique())
print('\nData Types :   \n', netflix.dtypes)

Rows     : 6234
Columns  : 12

Features :
     : ['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description']

Missing values    : 3036

Unique values :
   show_id         6234
type               2
title           6172
director        3301
cast            5469
country          554
date_added      1524
release_year      72
rating            14
duration         201
listed_in        461
description     6226
dtype: int64

Data Types :   
 show_id          int64
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object


#### Changing to appropriate data types

In [43]:
netflix['date_added'] = pd.to_datetime(netflix['date_added'])
netflix['year_added'] = netflix['date_added'].dt.year
# netflix['year_added'] = netflix['year_added'].astype(int)
# not able to change type as there are NaN values in year_added
netflix[netflix['year_added'].isnull()].head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added
6223,70204989,TV Show,Gunslinger Girl,,"Yuuka Nanri, Kanako Mitsuhashi, Eri Sendai, Am...",Japan,NaT,2008,TV-14,2 Seasons,"Anime Series, Crime TV Shows","On the surface, the Social Welfare Agency appe...",
6224,70304979,TV Show,Anthony Bourdain: Parts Unknown,,Anthony Bourdain,United States,NaT,2018,TV-PG,5 Seasons,Docuseries,This CNN original series has chef Anthony Bour...,
6225,70153412,TV Show,Frasier,,"Kelsey Grammer, Jane Leeves, David Hyde Pierce...",United States,NaT,2003,TV-PG,11 Seasons,"Classic & Cult TV, TV Comedies",Frasier Crane is a snooty but lovable Seattle ...,
6226,70243132,TV Show,La Familia P. Luche,,"Eugenio Derbez, Consuelo Duval, Luis Manuel Áv...",United States,NaT,2012,TV-14,3 Seasons,"International TV Shows, Spanish-Language TV Sh...","This irreverent sitcom featues Ludovico, Feder...",
6227,80005756,TV Show,The Adventures of Figaro Pho,,"Luke Jurevicius, Craig Behenna, Charlotte Haml...",Australia,NaT,2015,TV-Y7,2 Seasons,"Kids' TV, TV Comedies","Imagine your worst fears, then multiply them: ...",


In [44]:
# rows with 'year_added' NaN values were due to date_added NaN values
# instead of dropping row with NaN in date, I chose to use float type
netflix['year_added'] = netflix['year_added'].astype(float)
netflix.dtypes

show_id                  int64
type                    object
title                   object
director                object
cast                    object
country                 object
date_added      datetime64[ns]
release_year             int64
rating                  object
duration                object
listed_in               object
description             object
year_added             float64
dtype: object

#### Dealing with missing data

In [45]:
netflix.isnull().sum()

show_id            0
type               0
title              0
director        1969
cast             570
country          476
date_added        11
release_year       0
rating            10
duration           0
listed_in          0
description        0
year_added        11
dtype: int64

In [46]:
# rating only has 10 missing values
# what are the unique values in rating, perhaps make the missing ones into something that already exists, like 'not rated'
netflix['rating'].unique()

array(['TV-PG', 'TV-MA', 'TV-Y7-FV', 'TV-Y7', 'TV-14', 'R', 'TV-Y', 'NR',
       'PG-13', 'TV-G', 'PG', 'G', nan, 'UR', 'NC-17'], dtype=object)

In [47]:
# The 'Not Rated' film rating is used to indicate that a film was not submitted 
# for a rating or is an uncut version. Therefore, changing the unknown value to 
# 'NR' would be inappriate as we don't know if it was submitted or not. 
# However, nan still needs to be changed, so it shall be changed to 'Unknown'. 
netflix['rating'].fillna('Unknown', inplace=True)
netflix['rating'].unique()

array(['TV-PG', 'TV-MA', 'TV-Y7-FV', 'TV-Y7', 'TV-14', 'R', 'TV-Y', 'NR',
       'PG-13', 'TV-G', 'PG', 'G', 'Unknown', 'UR', 'NC-17'], dtype=object)

In [48]:
# NR (Not Rated) and UR (Unrated) are often used interchangeably. Therefore, 
# we will change UR to NR. 
netflix['rating'].replace(to_replace='UR', value='NR', inplace=True)
netflix['rating'].unique()

array(['TV-PG', 'TV-MA', 'TV-Y7-FV', 'TV-Y7', 'TV-14', 'R', 'TV-Y', 'NR',
       'PG-13', 'TV-G', 'PG', 'G', 'Unknown', 'NC-17'], dtype=object)

In [49]:
# missing data for 'director'
netflix['director'].isnull().sum()

1969

In [50]:
# let's take a look
netflix[netflix['director'].isnull()].sample(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added
677,80136787,TV Show,La Femme,,"Zoe Tay, Ann Kok, Tiffany Leong, Tay Ping Hui,...",,2017-10-16,2016,TV-14,1 Season,"International TV Shows, TV Dramas",Personal desires guide the lives of a marriage...,2017.0
1228,81105522,TV Show,No Time for Shame,,Santiago Artemis,Argentina,2019-11-19,2019,TV-MA,1 Season,"International TV Shows, Reality TV, Spanish-La...","Follow Santiago Artemis, a Buenos Aires fashio...",2019.0
5261,80161826,Movie,2015 Dream Concert,,"4Minute, B1A4, BtoB, ELSIE, EXID, EXO, Got7, I...",South Korea,2017-04-28,2015,TV-PG,107 min,"International Movies, Music & Musicals",The world's biggest K-pop festival marked its ...,2017.0
2079,80222788,TV Show,Day and Night,,"Pan Yueming, Wang Longzheng, Liang Yuen, Lü Xi...",China,2018-03-23,2017,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Dramas",A detective assists with an investigation into...,2018.0
5597,80113201,TV Show,Skylanders Academy,,"Justin Long, Ashley Tisdale, Jonathan Banks, C...","South Korea, United States",2018-09-28,2018,TV-Y7,3 Seasons,"Kids' TV, TV Comedies",Travel the vast Skylander universe in this ani...,2018.0


In [51]:
# TV Shows often have multiple directors, due to them directing different 
# episodes. Are the missing values for directors TV Shows?
netflix['director'].isnull().groupby(netflix['type']).sum()

type
Movie       128.0
TV Show    1841.0
Name: director, dtype: float64

In [55]:
# Of the 1,969 missing values in 'director', 1,841 (94%) of them are TV Shows 
# which likely have multiple directors associated to each show and, therefore, 
# cannot name one as "the" director for this field. We will change the NaN values
# for TV Shows to "Various". 
# netflix[(netflix['type'] == 'TV Show')].fillna('Various')


##### What types do we have?

In [None]:
netflix.type.value_counts()

In [None]:
netflix.type.value_counts().plot(kind='pie', autopct='%1.f%%', startangle=90, colors=['cornflowerblue','burlywood'])

In [None]:
netflix.type.value_counts().plot(kind='barh', color=['cornflowerblue','burlywood'])

In [None]:
netflix['year_added'].dropna().astype(int).value_counts().sort_index().plot()
netflix['date_added'].max()

**There is not a sharp decline of titles in 2020 as there is only partial information available from 2020 in the dataset.**

In [None]:
#netflix['year_added'].dropna().astype(int).value_counts().sort_index()


##### What ratings do we have?

In [None]:
netflix.rating.value_counts()

**Separate the ratings for movies and TV**

In [None]:
netflix.rating.str.contains('TV-')

**The following is an inaccurate listing of countries.**
For example, "United Kingdom, United States" = 50. This entry, found 50 times, should be adding 50 to the count of 'United States' and 50 to the count of 'United Kingdom'. Every time 'United States' is found it should be added to the count for 'United States'. 

In [None]:
netflix.groupby('country')['show_id'].count().sort_values(ascending=False)