# Netflix Movies and TV Shows Analysis

The original dataset consists of tv shows and movies available on <strong>Netflix</strong> as of 2019

Dataset available on https://www.kaggle.com/shivamb/netflix-shows

In [1]:
import pandas as pd

# Loading and checking dataset

In [2]:
netflix = pd.read_csv('netflix_titles.csv')

In [3]:
netflix.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [4]:
netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       7787 non-null   object
 1   type          7787 non-null   object
 2   title         7787 non-null   object
 3   director      5398 non-null   object
 4   cast          7069 non-null   object
 5   country       7280 non-null   object
 6   date_added    7777 non-null   object
 7   release_year  7787 non-null   int64 
 8   rating        7780 non-null   object
 9   duration      7787 non-null   object
 10  listed_in     7787 non-null   object
 11  description   7787 non-null   object
dtypes: int64(1), object(11)
memory usage: 730.2+ KB


In [5]:
netflix.isna().sum()

show_id            0
type               0
title              0
director        2389
cast             718
country          507
date_added        10
release_year       0
rating             7
duration           0
listed_in          0
description        0
dtype: int64

In [6]:
number_of_shows = netflix[netflix['type'] == 'TV Show'].shape[0]
number_of_movies = netflix[netflix['type'] == 'Movie'].shape[0]
number_of_titles = netflix.shape[0]

print(f'There were {number_of_titles} titles available on Netflix in 2019\n{number_of_movies} Movies and {number_of_shows} TV Shows')

There were 7787 titles available on Netflix in 2019
5377 Movies and 2410 TV Shows


# Data Cleaning

## Dealing with missing information

### Director

In [7]:
netflix.director.value_counts()

Raúl Campos, Jan Suter                   18
Marcus Raboy                             16
Jay Karas                                14
Cathy Garcia-Molina                      13
Youssef Chahine                          12
                                         ..
Joseduardo Giordano, Sergio Goyri Jr.     1
Antonio Díaz                              1
Shanjey Kumar Perumal                     1
Timo Tjahjanto, Kimo Stamboel             1
Amit Saxena                               1
Name: director, Length: 4049, dtype: int64

This information doesn't seem to be relevant, so we'll be dropping this information from our DataFrame

In [8]:
netflix.drop(['director'], axis = 1, inplace = True)

### Rating

There are only 7 NaN Ratings, so we'll be fixing this manually

In [9]:
netflix[netflix.rating.isna()]

Unnamed: 0,show_id,type,title,cast,country,date_added,release_year,rating,duration,listed_in,description
67,s68,Movie,13TH: A Conversation with Oprah Winfrey & Ava ...,"Oprah Winfrey, Ava DuVernay",,"January 26, 2017",2017,,37 min,Movies,Oprah Winfrey sits down with director Ava DuVe...
2359,s2360,TV Show,Gargantia on the Verdurous Planet,"Kaito Ishikawa, Hisako Kanemoto, Ai Kayano, Ka...",Japan,"December 1, 2016",2013,,1 Season,"Anime Series, International TV Shows","After falling through a wormhole, a space-dwel..."
3660,s3661,TV Show,Little Lunch,"Flynn Curry, Olivia Deeble, Madison Lu, Oisín ...",Australia,"February 1, 2018",2015,,1 Season,"Kids' TV, TV Comedies","Adopting a child's perspective, this show take..."
3736,s3737,Movie,Louis C.K. 2017,Louis C.K.,United States,"April 4, 2017",2017,,74 min,Movies,"Louis C.K. muses on religion, eternal love, gi..."
3737,s3738,Movie,Louis C.K.: Hilarious,Louis C.K.,United States,"September 16, 2016",2010,,84 min,Movies,Emmy-winning comedy writer Louis C.K. brings h...
3738,s3739,Movie,Louis C.K.: Live at the Comedy Store,Louis C.K.,United States,"August 15, 2016",2015,,66 min,Movies,The comic puts his trademark hilarious/thought...
4323,s4324,Movie,My Honor Was Loyalty,"Leone Frisa, Paolo Vaccarino, Francesco Miglio...",Italy,"March 1, 2017",2015,,115 min,Dramas,"Amid the chaos and horror of World War II, a c..."


In [10]:
# Ratings according to IMDB and Netflix
fixing_rating = {67: 'TV-PG', 2359: 'TV-14', 3660: 'TV-MA', 3736: 'TV-MA', 3737: 'TV-MA', 3738: 'TV-MA', 4323: 'PG-13'}

In [11]:
rating_column = netflix.columns.get_loc('rating')

for item_id, item_rating in fixing_rating.items():
    netflix.iloc[item_id, rating_column] = item_rating

In [12]:
netflix['rating'].isna().sum()

0

### Cast and Date_Added

We'll be using the 'cast' and 'date_added' in our analysis, so we'll be dropping the NaN values in these rows

In [13]:
netflix = netflix[netflix['cast'].notna() & netflix['date_added'].notna()]


### Country

In [14]:
netflix.country.isna().sum()

410

There are a lot of NaN values in this column, so we'll filling them with the most common country

In [15]:
most_common_country = netflix['country'].value_counts().idxmax()
netflix['country'].fillna(most_common_country, inplace = True)

In [16]:
netflix.country.isna().sum()

0

In [17]:
netflix.country.value_counts()

United States                                     2655
India                                              894
United Kingdom                                     331
Japan                                              221
South Korea                                        180
                                                  ... 
United States, Philippines                           1
Argentina, Uruguay, Serbia                           1
Hong Kong, Taiwan                                    1
Chile, France                                        1
Ireland, Canada, United States, United Kingdom       1
Name: country, Length: 626, dtype: int64

There are some rows that contain more than one country, we'll select only the first country listed

In [18]:
netflix['country'] = netflix['country'].apply(lambda x:x.split(',')[0].strip())

In [19]:
netflix.country.value_counts()

United States     2954
India              927
United Kingdom     499
Canada             234
Japan              231
                  ... 
Georgia              1
Zimbabwe             1
Jamaica              1
Paraguay             1
Belarus              1
Name: country, Length: 79, dtype: int64

## Deleting irrelevant data

In [20]:
netflix.head()

Unnamed: 0,show_id,type,title,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [21]:
netflix.drop(['show_id', 'title', 'duration', 'description', ], axis = 1, inplace = True)
netflix.columns

Index(['type', 'cast', 'country', 'date_added', 'release_year', 'rating',
       'listed_in'],
      dtype='object')

## Adding new columns

We'll be adding 2 new columns:
- month_added --> in which month of the year the movie/show was added to netflix
- year_added --> in which year the movie/show was added to netflix

In [22]:
netflix['Month_Added'] = netflix['date_added'].apply(lambda x:x.split(',')[0].strip().split(' ')[0])

In [23]:
netflix['Year_Added'] = netflix['date_added'].apply(lambda x:x.split(',')[1].strip())

In [24]:
netflix.drop(['date_added'], axis = 1, inplace = True)

## Renaming Columns

In [25]:
netflix.rename(columns={'type': 'Type', 
                        'cast': 'Cast', 
                        'country': 'Country', 
                        'release_year': 'Year_Released', 
                        'rating': 'Rating',
                        'listed_in': 'Genres'}, inplace = True)

# Data Visualization

Objectives:
- Actor/Actress that has appeared most in Netflix movies or TV Shows (maybe -> divide this into 2 categories: movies and tv shows)
- The most common Country (See if the distribution is different for TV Shows and Movies)
- Season of the year with more additions to the catalogue
- Number of releases per year (evolution throughout the years)
- Distribution of Tv rating
- Most common genre (Movie and Tv Show) -> See evolution throughout the years 