# Netflix Dataset
https://www.kaggle.com/shivamb/netflix-shows

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('netflix_titles.csv')

In [3]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...


In [4]:
df.tail()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
6229,80000063,TV Show,Red vs. Blue,,"Burnie Burns, Jason Saldaña, Gustavo Sorola, G...",United States,,2015,NR,13 Seasons,"TV Action & Adventure, TV Comedies, TV Sci-Fi ...","This parody of first-person shooter games, mil..."
6230,70286564,TV Show,Maron,,"Marc Maron, Judd Hirsch, Josh Brener, Nora Zeh...",United States,,2016,TV-MA,4 Seasons,TV Comedies,"Marc Maron stars as Marc Maron, who interviews..."
6231,80116008,Movie,Little Baby Bum: Nursery Rhyme Friends,,,,,2016,,60 min,Movies,Nursery rhymes and original music for children...
6232,70281022,TV Show,A Young Doctor's Notebook and Other Stories,,"Daniel Radcliffe, Jon Hamm, Adam Godley, Chris...",United Kingdom,,2013,TV-MA,2 Seasons,"British TV Shows, TV Comedies, TV Dramas","Set during the Russian Revolution, this comic ..."
6233,70153404,TV Show,Friends,,"Jennifer Aniston, Courteney Cox, Lisa Kudrow, ...",United States,,2003,TV-14,10 Seasons,"Classic & Cult TV, TV Comedies",This hit sitcom follows the merry misadventure...


In [5]:
df.shape

(6234, 12)

In [6]:
print('Number of Columns:', len(df.columns))
print(df.columns)

Number of Columns: 12
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')


##### VARIABLES
show_id -> Unique Key that is tagged to each Movie/TV Show

type -> Either 'Movie' or 'TV Show'

title -> Title of the Movie/TV Show

director -> Director of the Movie/TV Show

cast -> Casts of the Movie/TV Show

Country -> Country where the Movie/TV Show was produced

date_added -> Date Movie/TV Show was added to Netflix

release_year -> Year the Movie/TV Show was released/screened

rating -> TV rating of Movie/TV Show

duration -> Total duration of Movie/TV Show in minutes or number of episodes

listed_in -> Genre of Movie/TV Show/

description -> Movie's/TV Show's synopsis

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6234 entries, 0 to 6233
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       6234 non-null   int64 
 1   type          6234 non-null   object
 2   title         6234 non-null   object
 3   director      4265 non-null   object
 4   cast          5664 non-null   object
 5   country       5758 non-null   object
 6   date_added    6223 non-null   object
 7   release_year  6234 non-null   int64 
 8   rating        6224 non-null   object
 9   duration      6234 non-null   object
 10  listed_in     6234 non-null   object
 11  description   6234 non-null   object
dtypes: int64(2), object(10)
memory usage: 584.6+ KB


In [8]:
df['date_added'] = pd.to_datetime(df['date_added'])

In [9]:
print('Shape of Movies Sub-Dataset:', df[df['type'] == 'Movie'].shape)
print('Shape of TV Shows Sub-Dataset:', df[df['type'] == 'TV Show'].shape)

Shape of Movies Sub-Dataset: (4265, 12)
Shape of TV Shows Sub-Dataset: (1969, 12)


In [10]:
print('Total Number of Shows (Movies/TV Shows) in Dataset:', df.shape[0])
print('Number of Movies in Dataset:',  df.shape[0])
print('Number of TV Shows in Dataset:', df[df['type'] == 'TV Show'].shape[0])

Total Number of Shows (Movies/TV Shows) in Dataset: 6234
Number of Movies in Dataset: 6234
Number of TV Shows in Dataset: 1969


In [11]:
print('Percentage of Movies in Dataset:', str(100 *  df[df['type'] == 'Movie'].shape[0] / df.shape[0])[:5], '%')
print('Percentage of TV Shows in Dataset:', str(100 *  df[df['type'] == 'TV Show'].shape[0] / df.shape[0])[:5], '%')

Percentage of Movies in Dataset: 68.41 %
Percentage of TV Shows in Dataset: 31.58 %


##### INITIAL ANALYSIS
1. Null Values in director, cast, country, date_added and rating.
2. show_id might not be useful except to pin point to the movies.
3. date_added has been changed to datetime object (can do time-series analysis)

##### MOVING FORWARD
1. Split dataset into Movies and TV Show respectively (DONE)
2. Work on Movies > TV Show

#### Splitting Dataset and Saving

In [12]:
#df[df['type'] == 'Movie'].to_csv('netflix_movies.csv')

In [13]:
#df[df['type'] == 'TV Show'].to_csv('netflix_tvshows.csv')

#### Basic Analysis on Sample Dataset (10% of Samples)
Check other notebook

In [14]:
# Using train_test_split to create random sample of observations

#from sklearn.model_selection import train_test_split 
#X = df.drop('director', axis= 1)
#y = df['director']
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1)

In [15]:
sample_df = pd.concat([X_test, y_test], axis= 1)

NameError: name 'X_test' is not defined

In [None]:
#sample_df.to_csv('sample.csv')