## Netflix EDA

Business Problem

Analyze the data and generate insights that could help Netflix in deciding which type of shows/movies to produce and how they can grow the business in different countries

In [1]:
# Loading libraries
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns

In [2]:
df = pd.read_csv("netflix.csv") # reading the data

In [3]:
df.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [4]:
print(df.columns)
print(df.shape)
df.info() # useful to check the dtypes and no.of nan's in the data

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')
(8807, 12)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [5]:
df.nunique() # to check the no.of unique values in a given feature

show_id         8807
type               2
title           8807
director        4528
cast            7692
country          748
date_added      1767
release_year      74
rating            17
duration         220
listed_in        514
description     8775
dtype: int64

### Changing the data in to right data type could reduce the size of the data  

check the size under df.info after type conversion , the size's reduced.

In [6]:
df["date_added"] = pd.to_datetime(df["date_added"])

In [7]:
df["date_added"]

0      2021-09-25
1      2021-09-24
2      2021-09-24
3      2021-09-24
4      2021-09-24
          ...    
8802   2019-11-20
8803   2019-07-01
8804   2019-11-01
8805   2020-01-11
8806   2019-03-02
Name: date_added, Length: 8807, dtype: datetime64[ns]

In [8]:
df["type"].value_counts()

Movie      6131
TV Show    2676
Name: type, dtype: int64

In [9]:
df["type"]=df["type"].astype("category")

In [10]:
df["rating"] = df["rating"].astype("category")

In [11]:
df["country"] = df["country"].astype("category")

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   show_id       8807 non-null   object        
 1   type          8807 non-null   category      
 2   title         8807 non-null   object        
 3   director      6173 non-null   object        
 4   cast          7982 non-null   object        
 5   country       7976 non-null   category      
 6   date_added    8797 non-null   datetime64[ns]
 7   release_year  8807 non-null   int64         
 8   rating        8803 non-null   category      
 9   duration      8804 non-null   object        
 10  listed_in     8807 non-null   object        
 11  description   8807 non-null   object        
dtypes: category(3), datetime64[ns](1), int64(1), object(7)
memory usage: 676.6+ KB


In [23]:
df.isna().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

## string manipulation and missing values

##### there are certain features which have string data that needs to be unpacked like cast,director,listed_in
##### Unpacking these features could  help us to do further analysis

In [22]:
df[["title","cast","director","listed_in"]]



Unnamed: 0,title,cast,director,listed_in
0,Dick Johnson Is Dead,,Kirsten Johnson,Documentaries
1,Blood & Water,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",,"International TV Shows, TV Dramas, TV Mysteries"
2,Ganglands,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",Julien Leclercq,"Crime TV Shows, International TV Shows, TV Act..."
3,Jailbirds New Orleans,,,"Docuseries, Reality TV"
4,Kota Factory,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",,"International TV Shows, Romantic TV Shows, TV ..."
...,...,...,...,...
8802,Zodiac,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",David Fincher,"Cult Movies, Dramas, Thrillers"
8803,Zombie Dumb,,,"Kids' TV, Korean TV Shows, TV Comedies"
8804,Zombieland,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",Ruben Fleischer,"Comedies, Horror Movies"
8805,Zoom,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",Peter Hewitt,"Children & Family Movies, Comedies"


In [108]:
dir_con = df["director"].apply(lambda x: str(x).split(",")).to_list()
df_dir = pd.DataFrame(dir_con,index=df["title"])
df_dir = df_dir.stack()

df_dir = pd.DataFrame(df_dir.reset_index())
df_dir.drop(columns=["level_1"],inplace=True)
df_dir.rename(columns={0:"director"},inplace=True)
df_dir

Unnamed: 0,title,director
0,Dick Johnson Is Dead,Kirsten Johnson
1,Blood & Water,
2,Ganglands,Julien Leclercq
3,Jailbirds New Orleans,
4,Kota Factory,
...,...,...
9607,Zodiac,David Fincher
9608,Zombie Dumb,
9609,Zombieland,Ruben Fleischer
9610,Zoom,Peter Hewitt


In [120]:
# df[["type","title","director"]].stack()

0     type                       Movie
      title       Dick Johnson Is Dead
      director         Kirsten Johnson
1     type                     TV Show
      title              Blood & Water
                          ...         
8805  title                       Zoom
      director            Peter Hewitt
8806  type                       Movie
      title                     Zubaan
      director             Mozez Singh
Length: 23787, dtype: object