### TV Shows and Movies listed on Netflix

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine. 

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
netflix_data = pd.read_csv('netflix_titles.csv')
netflix_data.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


#### The TASKS which we will be solving are:   
    1. Movie Recommendation System
    2. What to watch on Netflix?
    3. How to find best rated Movie in Netflix?
    4. Show me the Ratings!
    5. Top Actors/Actresses,Directors,Genres and Countries.
    
But first, we will try to throw some light on the DATA we have here as part of our **Data Understanding** analysis!

In [9]:
# Let's try to see some UNIQUE VALUES distributed across multiple columns

columns_ = ['show_id','type','director','country','release_year','rating','duration','listed_in']

for val in columns_:
    print('Column {} has the below UNQIUE VALUES: '.format(val))
    print(netflix_data[val].nunique())

Column show_id has the below UNQIUE VALUES: 
7787
Column type has the below UNQIUE VALUES: 
2
Column director has the below UNQIUE VALUES: 
4049
Column country has the below UNQIUE VALUES: 
681
Column release_year has the below UNQIUE VALUES: 
73
Column rating has the below UNQIUE VALUES: 
14
Column duration has the below UNQIUE VALUES: 
216
Column listed_in has the below UNQIUE VALUES: 
492


In [38]:
# So since we have multiple unique values across these columns, why don't we split the values into more distributed manner to get vectors

df1 = netflix_data['cast'].str.split(',', expand=True)
df1.columns = ['cast_{}'.format(x+1) for x in df1.columns]
# creating a column to keep the count of number Casts
df1['cast_count'] = df1.count(axis=1)

# Similarly we will do for the other column!
df2 = netflix_data['listed_in'].str.split(',', expand=True)
df2.columns = ['listed_in_{}'.format(x+1) for x in df2.columns]
df2['listed_in_count'] = df2.count(axis=1)

# In the end we will just concatenate the rest
f1 = pd.concat([netflix_data, df1], axis=1)
f2 = pd.concat([f1, df2], axis=1)
my_netflix_data = f2.copy()

# Now we will drop the original columns!
my_netflix_data = my_netflix_data.drop(columns = ['cast','listed_in'],axis=1)


# We definitely get a LOT of NaN's, so now it might be possible that a movie is of 5 CAST PEOPLE, but other is 15 CAST, so we will fill the NULL VALUES with 0 for better readability

my_netflix_data = my_netflix_data.fillna(0)

In [39]:
my_netflix_data.head()

Unnamed: 0,show_id,type,title,director,country,date_added,release_year,rating,duration,description,...,cast_46,cast_47,cast_48,cast_49,cast_50,cast_count,listed_in_1,listed_in_2,listed_in_3,listed_in_count
0,s1,TV Show,3%,0,Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,In a future where the elite inhabit an island ...,...,0,0,0,0,0,11,International TV Shows,TV Dramas,TV Sci-Fi & Fantasy,3
1,s2,Movie,7:19,Jorge Michel Grau,Mexico,"December 23, 2016",2016,TV-MA,93 min,After a devastating earthquake hits Mexico Cit...,...,0,0,0,0,0,6,Dramas,International Movies,0,2
2,s3,Movie,23:59,Gilbert Chan,Singapore,"December 20, 2018",2011,R,78 min,"When an army recruit is found dead, his fellow...",...,0,0,0,0,0,9,Horror Movies,International Movies,0,2
3,s4,Movie,9,Shane Acker,United States,"November 16, 2017",2009,PG-13,80 min,"In a postapocalyptic world, rag-doll robots hi...",...,0,0,0,0,0,9,Action & Adventure,Independent Movies,Sci-Fi & Fantasy,3
4,s5,Movie,21,Robert Luketic,United States,"January 1, 2020",2008,PG-13,123 min,A brilliant group of students become card-coun...,...,0,0,0,0,0,12,Dramas,0,0,1


In [46]:
# Now, it would not good to figure out the MOVIES/TV SHOWS wehre the Country or DIRECTOR is not present, right!?
# So, Let's drop the rows where these two are NULLS

my_netflix_data = my_netflix_data[my_netflix_data['director']!=0]
my_netflix_data = my_netflix_data[my_netflix_data['country']!=0]

In [62]:
# Now, let's find out the MOVIES with larger cast, so sort the dataframe

top_10_cast_movies = my_netflix_data[my_netflix_data['type'] == 'Movie'].sort_values(by = ['cast_count'],ascending = False).head(10)
top_10_cast_tvshows = my_netflix_data[my_netflix_data['type'] == 'TV Show'].sort_values(by = ['cast_count'],ascending = False).head(10)

print('Top 10 Movies with Huge Cast are:')
for movie in top_10_cast_movies['title'].tolist():
    print(movie)
print()
print('Top 10 TV SHOWS with Huge Cast are:')
for tvshows in top_10_cast_tvshows['title'].tolist():
    print(tvshows)


Top 10 Movies with Huge Cast are:
Arthur Christmas
Michael Bolton's Big, Sexy Valentine's Day Special
Movie 43
The Princess and the Frog
John Carter
Race to Witch Mountain
The Help
Bedtime Stories
Avengers: Infinity War
Star Wars: Episode VIII: The Last Jedi

Top 10 TV SHOWS with Huge Cast are:
Afronta! Facing It!
Gotham
Riverdale
The Underclass
Watership Down
Fullmetal Alchemist: Brotherhood
The Minions of Midas
Sleepless Society: Insomnia
Miraculous: Tales of Ladybug & Cat Noir
Hemlock Grove


In [None]:
# Next we will draw out a PIE CHART to find out the distribution of the MOVIES and TV Shows on their CATEGORIES