The dataset used for analysis can be obtained through the following link: https://www.kaggle.com/shivamb/netflix-shows.

A small description of the dataset is as follows (Excerpt taken from the above link):
This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from imdb import IMDb

<h1> Load dataset and get general information about the dataset. </h1>

In [3]:
# Load the existing netflix data csv file.
netflix_df = pd.read_csv('netflix_titles.csv')
netflix_df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...


In [4]:
netflix_df.shape

(6234, 12)

In [5]:
netflix_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6234 entries, 0 to 6233
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       6234 non-null   int64 
 1   type          6234 non-null   object
 2   title         6234 non-null   object
 3   director      4265 non-null   object
 4   cast          5664 non-null   object
 5   country       5758 non-null   object
 6   date_added    6223 non-null   object
 7   release_year  6234 non-null   int64 
 8   rating        6224 non-null   object
 9   duration      6234 non-null   object
 10  listed_in     6234 non-null   object
 11  description   6234 non-null   object
dtypes: int64(2), object(10)
memory usage: 584.6+ KB


<h1> Gathering relevant IMDB data for each movie in the dataset </h1>

In [None]:
# Loop through all the movies in the netflix dataset and obtain the ratings from imdb for each movie.
ratings = np.array([], dtype = object)
count = 0
for movie in netflix_df.title:
    # If there is any error while searching for a movie, the rating for that movie will be represented as 
    # a None value.
    try:
        im = IMDb()
        search_result = im.search_movie(movie)
        movie_id = search_result[0].movieID
        movie_info = im.get_movie(movie_id)
        #If there is an error while searching for the rating, then only set the rating value to None
        try:
            ratings = np.append(ratings, movie_info['rating'])
        except:
            ratings = np.append(ratings, None)
    except:
        ratings = np.append(ratings, None)

In [13]:
netflix_df['imdb_rating'] = ratings
netflix_df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,imdb_rating
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...,3.1
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...,5.2
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob...",7.8
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...,6.0
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...,5.2


In [19]:
# Save new dataframe to a new csv file which will then be used for cleaning.
netflix_df.to_csv('netflix_data.csv', index = False)

<h1> Clean Data </h1>

In [2]:
netflix_df = pd.read_csv('netflix_data.csv')
netflix_df_clean = netflix_df.copy()
netflix_df_clean.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,imdb_rating
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...,3.1
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...,5.2
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob...",7.8
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...,6.0
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...,5.2


<b> Issue #1 </b>: Lets deal with missing values and None's first.

In [3]:
netflix_df_clean.isna().sum()

show_id            0
type               0
title              0
director        1969
cast             570
country          476
date_added        11
release_year       0
rating            10
duration           0
listed_in          0
description        0
imdb_rating      323
dtype: int64

<b> The date_added column is not important to our analysis so that can be dropped. While we're at that, the description column can be dropped too since it isnt important to our analysis. </b>

In [4]:
netflix_df_clean.drop(columns = ['date_added', 'description'], inplace = True)
netflix_df_clean.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country',
       'release_year', 'rating', 'duration', 'listed_in', 'imdb_rating'],
      dtype='object')

<b> Before making a decision on whether the null values should be dropped or not, let us first analyze them to understand why they are null and if they should be dropped. </b>

In [5]:
netflix_df_clean[netflix_df_clean.isnull().any(axis = 1)]

Unnamed: 0,show_id,type,title,director,cast,country,release_year,rating,duration,listed_in,imdb_rating
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,2016,TV-MA,94 min,Stand-Up Comedy,5.2
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,2013,TV-Y7-FV,1 Season,Kids' TV,7.8
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,2016,TV-Y7,1 Season,Kids' TV,6.0
5,80163890,TV Show,Apaches,,"Alberto Ammann, Eloy Azorín, Verónica Echegui,...",Spain,2016,TV-MA,1 Season,"Crime TV Shows, International TV Shows, Spanis...",6.9
8,80117902,TV Show,Fire Chasers,,,United States,2017,TV-MA,1 Season,"Docuseries, Science & Nature TV",6.6
...,...,...,...,...,...,...,...,...,...,...,...
6229,80000063,TV Show,Red vs. Blue,,"Burnie Burns, Jason Saldaña, Gustavo Sorola, G...",United States,2015,NR,13 Seasons,"TV Action & Adventure, TV Comedies, TV Sci-Fi ...",8.3
6230,70286564,TV Show,Maron,,"Marc Maron, Judd Hirsch, Josh Brener, Nora Zeh...",United States,2016,TV-MA,4 Seasons,TV Comedies,7.7
6231,80116008,Movie,Little Baby Bum: Nursery Rhyme Friends,,,,2016,,60 min,Movies,5.7
6232,70281022,TV Show,A Young Doctor's Notebook and Other Stories,,"Daniel Radcliffe, Jon Hamm, Adam Godley, Chris...",United Kingdom,2013,TV-MA,2 Seasons,"British TV Shows, TV Comedies, TV Dramas",7.9


Since there are a lot of null values in the dataset the best approach is to not drop the values. If we were to do that, then we would be losing about 2000+ rows of information which is not the right way to go. A lot of the rows that have the director column empty for example, still contain a lot of important information like the netflix rating, category information, release dates etc which will be essential to our analysis and visualizations later. Thus the decision here is to not drop any of the rows which contains null values. These null values will be ignored when visualizing certain column information instead.

<b>Issue #2</b>: Next step is to categorize movies by imdb ratings as this will make our analysis easier. Since IMDB itself does not have a way of categorizing the quality of a movie by its imdb rating, we will make use of the IGN rating scale for categorizing movies. The scale is as follows:

[1,2) - Unbearable <br>
[2,3) - Painful<br>
[3,4) - Awful<br>
[4,5) - Bad<br>
[5,6) - Mediocre<br>
[6,7) - Okay<br>
[7,8) - Good<br>
[8,9) - Great<br>
[9,10) - Amazing<br>
10 - Masterpiece<br>

In [20]:
bins = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
labels = ['Unbearable', 'Painful', 'Awful', 'Bad', 'Mediocre', 'Okay', 'Good', 'Great', 'Amazing', 'Masterpiece']
netflix_df_clean['imdb_rating'] = pd.cut(netflix_df_clean['imdb_rating'], bins = bins, labels = labels, include_lowest = True)

In [21]:
netflix_df_clean.head()

Unnamed: 0,show_id,type,title,director,cast,country,release_year,rating,duration,listed_in,imdb_rating
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Awful
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,2016,TV-MA,94 min,Stand-Up Comedy,Mediocre
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,2013,TV-Y7-FV,1 Season,Kids' TV,Good
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,2016,TV-Y7,1 Season,Kids' TV,Mediocre
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,2017,TV-14,99 min,Comedies,Mediocre


In [22]:
# Now lets check the type of the imdb_rating column
netflix_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6234 entries, 0 to 6233
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   show_id       6234 non-null   int64   
 1   type          6234 non-null   object  
 2   title         6234 non-null   object  
 3   director      4265 non-null   object  
 4   cast          5664 non-null   object  
 5   country       5758 non-null   object  
 6   release_year  6234 non-null   int64   
 7   rating        6224 non-null   object  
 8   duration      6234 non-null   object  
 9   listed_in     6234 non-null   object  
 10  imdb_rating   5911 non-null   category
dtypes: category(1), int64(2), object(8)
memory usage: 493.6+ KB


<b> As can be seen above, the imdb_rating column has automatically been converted to a categorical data type. </b>

In [23]:
# Save resulting dataset to a csv file.
netflix_df_clean.to_csv('clean_netflix.csv', index = False)

<h1> Analyze and visualize data </h1>

In [24]:
# Read in netflix data
netflix_df = pd.read_csv('clean_netflix.csv')
netflix_df.head()

Unnamed: 0,show_id,type,title,director,cast,country,release_year,rating,duration,listed_in,imdb_rating
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Awful
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,2016,TV-MA,94 min,Stand-Up Comedy,Mediocre
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,2013,TV-Y7-FV,1 Season,Kids' TV,Good
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,2016,TV-Y7,1 Season,Kids' TV,Mediocre
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,2017,TV-14,99 min,Comedies,Mediocre
