Evaluation will be kept lenient, so make sure you attempt this case study.
It is understandable that you might struggle with getting started on this. Just brainstorm, discuss with peers, or get help from TAs.
There is no right or wrong answer. *We have to get used to dealing with uncertainty in business. This is exactly the skill we want to develop.*

#### About NETFLIX
Netflix is one of the most popular media and video streaming platforms. They have over 10000 movies or tv shows available on their platform, as of mid-2021, they have over 222M Subscribers globally. This tabular dataset consists of listings of all the movies and tv shows available on Netflix, along with details such as - cast, directors, ratings, release year, duration, etc.

## Business Problem
### Analyze the data and generate insights that could help Netflix in deciding which type of shows/movies to produce and how they can grow the business in different countries

The dataset provided to you consists of a list of all the TV shows/movies available on Netflix:

* Show_id: Unique ID for every Movie / Tv Show
* Type: Identifier - A Movie or TV Show
* Title: Title of the Movie / Tv Show
* Director: Director of the Movie
* Cast: Actors involved in the movie/show
* Country: Country where the movie/show was produced
* Date_added: Date it was added on Netflix
* Release_year: Actual Release year of the movie/show
* Rating: TV Rating of the movie/show
* Duration: Total Duration - in minutes or number of seasons
* Listed_in: Genre
* Description: The summary description

## Hints

* The exploration should have a goal. As you explore the data, keep in mind that you want to answer which type of shows to produce and how to grow the business.
* Ensure each recommendation is backed by data. The company is looking for data-driven insights, not personal opinions or anecdotes.
* Assume that you are presenting your findings to business executives who have only a basic understanding of data science. Avoid unnecessary technical jargon.
* Start by exploring a few questions: What type of content is available in different countries?
* How has the number of movies released per year changed over the last 20-30 years?
* Comparison of tv shows vs. movies.
* What is the best time to launch a TV show?
* Analysis of actors/directors of different types of shows/movies.
* Does Netflix has more focus on TV Shows than movies in recent years
* Understanding what content is available in different countries

## Evaluation Criteria (100 Points):

1. Defining Problem Statement and Analysing basic metrics (10 Points)

2. Observations on the shape of data, data types of all the attributes, conversion of categorical attributes to 'category' (If required), missing value detection, statistical summary (10 Points)

3. Non-Graphical Analysis: Value counts and unique attributes ​​(10 Points)

4. Visual Analysis - Univariate, Bivariate after pre-processing of the data

Note: Pre-processing involves unnesting of the data in columns like Actor, Director, Country

4.1 For continuous variable(s): Distplot, countplot, histogram for univariate analysis (10 Points)

4.2 For categorical variable(s): Boxplot (10 Points)

4.3 For correlation: Heatmaps, Pairplots (10 Points)

5. Missing Value & Outlier check (Treatment optional) (10 Points)

6. Insights based on Non-Graphical and Visual Analysis (10 Points)

6.1 Comments on the range of attributes

6.2 Comments on the distribution of the variables and relationship between them

6.3 Comments for each univariate and bivariate plot

7. Business Insights (10 Points) - Should include patterns observed in the data along with what you can infer from it

8. Recommendations (10 Points) - Actionable items for business. No technical jargon. No complications. Simple action items that everyone can understand

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns

In [2]:
netflix_raw = pd.read_csv("D:\Scaler DSML\Business cases\\2-Netflix exploration and visualization\\netflix_data.csv")

In [3]:
netflix_raw.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [4]:
netflix_raw.info()
# Columns with NAN values - director, cast, country, date_added, rating, duration 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


* Drop the rows where columns have less then 2% data as null.
    - duration - 3 rows have null 
    - rating - 4 rows have null
    - date_added - 10 rows have null

* For rest of the columns we will do imputation later.

In [5]:
netflix_raw = netflix_raw.dropna(subset=['duration', 'rating', 'date_added'])
netflix_raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8790 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8790 non-null   object
 1   type          8790 non-null   object
 2   title         8790 non-null   object
 3   director      6169 non-null   object
 4   cast          7965 non-null   object
 5   country       7961 non-null   object
 6   date_added    8790 non-null   object
 7   release_year  8790 non-null   int64 
 8   rating        8790 non-null   object
 9   duration      8790 non-null   object
 10  listed_in     8790 non-null   object
 11  description   8790 non-null   object
dtypes: int64(1), object(11)
memory usage: 892.7+ KB


In [6]:
netflix_raw.describe(include='all')

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
count,8790,8790,8790,6169,7965,7961,8790,8790.0,8790,8790,8790,8790
unique,8790,2,8790,4526,7678,748,1765,,14,220,513,8758
top,s1,Movie,Dick Johnson Is Dead,Rajiv Chilaka,David Attenborough,United States,"January 1, 2020",,TV-MA,1 Season,"Dramas, International Movies","Paranormal activity at a lush, abandoned prope..."
freq,1,6126,1,19,19,2809,109,,3205,1791,362,4
mean,,,,,,,,2014.183163,,,,
std,,,,,,,,8.825466,,,,
min,,,,,,,,1925.0,,,,
25%,,,,,,,,2013.0,,,,
50%,,,,,,,,2017.0,,,,
75%,,,,,,,,2019.0,,,,


* Change the dtypes of individual columns to appropriate dtypes .
* Modify the 'duration' column. Remove string part.
    - Movie - mins
    - TV Show - Seasons
* Extract month and year from 'date_added' column

In [7]:
netflix_raw['duration'] = netflix_raw['duration'].str.extract('(\d+)')
netflix_raw.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [8]:
netflix_raw['Year_added'] = pd.DatetimeIndex(netflix_raw['date_added']).year
netflix_raw['Month_added'] = pd.DatetimeIndex(netflix_raw['date_added']).month_name()
netflix_raw.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,Year_added,Month_added
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90,Documentaries,"As her father nears the end of his life, filmm...",2021,September
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",2021,September
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,2021,September
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",2021,September
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,2021,September


In [9]:
netflix_raw.isna().sum()

show_id            0
type               0
title              0
director        2621
cast             825
country          829
date_added         0
release_year       0
rating             0
duration           0
listed_in          0
description        0
Year_added         0
Month_added        0
dtype: int64

In [10]:
netflix_raw = netflix_raw.convert_dtypes()
netflix_raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8790 entries, 0 to 8806
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8790 non-null   string
 1   type          8790 non-null   string
 2   title         8790 non-null   string
 3   director      6169 non-null   string
 4   cast          7965 non-null   string
 5   country       7961 non-null   string
 6   date_added    8790 non-null   string
 7   release_year  8790 non-null   Int64 
 8   rating        8790 non-null   string
 9   duration      8790 non-null   string
 10  listed_in     8790 non-null   string
 11  description   8790 non-null   string
 12  Year_added    8790 non-null   Int64 
 13  Month_added   8790 non-null   string
dtypes: Int64(2), string(12)
memory usage: 1.0 MB


### Unnesting columns
    cast
    director
    country

In [11]:
df_cast = netflix_raw[['show_id','cast']]
df_cast['cast'] =df_cast['cast'].str.split(pat=',')
df_cast = df_cast.explode(column='cast')
df_cast

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cast['cast'] =df_cast['cast'].str.split(pat=',')


Unnamed: 0,show_id,cast
0,s1,
1,s2,Ama Qamata
1,s2,Khosi Ngema
1,s2,Gail Mabalane
1,s2,Thabang Molaba
...,...,...
8806,s8807,Manish Chaudhary
8806,s8807,Meghna Malik
8806,s8807,Malkeet Rauni
8806,s8807,Anita Shabdish


In [12]:
country_split = netflix_raw['country'].str.split(pat=',')
df_director['director'] =df_director['director'].str.split(pat=',')
df_director = df_director.explode(column='director')
df_director

NameError: name 'df_director' is not defined

In [13]:
country_split = netflix_raw['country'].str.split(pat=',')
df_country = pd.concat([netflix_raw[['show_id']],country_split], axis=1)
df_country = df_country.explode(column='country')
df_country

Unnamed: 0,show_id,country
0,s1,United States
1,s2,South Africa
2,s3,
3,s4,
4,s5,India
...,...,...
8802,s8803,United States
8803,s8804,
8804,s8805,United States
8805,s8806,United States


In [14]:
## Using str.split() and .stack()

cast_split1 = netflix_raw['cast'].str.split(pat=',',expand=True)
df1_cast = pd.concat([netflix_raw[['show_id']],cast_split1], axis=1)
df1_cast.set_index('show_id',inplace=True)
df1_cast = df1_cast.stack()
df1_cast = df1_cast.reset_index()
df1_cast.drop('level_1', axis=1,inplace=True)
df1_cast

Unnamed: 0,show_id,0
0,s2,Ama Qamata
1,s2,Khosi Ngema
2,s2,Gail Mabalane
3,s2,Thabang Molaba
4,s2,Dillon Windvogel
...,...,...
64011,s8807,Manish Chaudhary
64012,s8807,Meghna Malik
64013,s8807,Malkeet Rauni
64014,s8807,Anita Shabdish


* Now deal with the 'director', 'cast' and 'country' columns for Nan values

In [15]:
netflix_raw['cast'].head(20)

0                                                  <NA>
1     Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...
2     Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...
3                                                  <NA>
4     Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...
5     Kate Siegel, Zach Gilford, Hamish Linklater, H...
6     Vanessa Hudgens, Kimiko Glenn, James Marsden, ...
7     Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...
8     Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...
9     Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...
10                                                 <NA>
11    Sukollawat Kanarot, Sushar Manaying, Pavarit M...
12    Luna Wedler, Jannis NiewÃ¶hner, Milan Peschel,...
13    Klara Castanho, Lucca Picon, JÃºlia Gomes, Mar...
14                                                 <NA>
15    Logan Browning, Brandon P. Bell, DeRon Horton,...
16                                                 <NA>
17    Luis Ernesto Franco, Camila Sodi, Sergio G

In [16]:
netflix_raw.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,Year_added,Month_added
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90,Documentaries,"As her father nears the end of his life, filmm...",2021,September
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",2021,September
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,2021,September
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",2021,September
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,2021,September


In [17]:
cast_list = netflix_raw[['country','cast']]
cast_list.head()
#full_director_list = netflix_raw['cast'].fillna(netflix_raw.groupby(by='listed_in')['cast'].mode())

Unnamed: 0,country,cast
0,United States,
1,South Africa,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban..."
2,,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi..."
3,,
4,India,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K..."


In [None]:
# cast popularity in each country
df