<h1 style="color: #E50914;", align="center"> Netflix Shows and Movies: Analysis</h1>

**`Short introductory:`** <br>

`This dataset contains over 8,000 unique titles available on Netflix as of September 2021, with 12 columns of information.`

In [2]:
import pandas as pd
import numpy as np

<h3 style="color: #333333;">Importing and observing data:</h3> 

In [3]:
netflix = pd.read_csv('./netflix_titles.csv')
netflix.sample(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
5,s6,TV Show,Midnight Mass,Mike Flanagan,"Kate Siegel, Zach Gilford, Hamish Linklater, H...",,"September 24, 2021",2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries",The arrival of a charismatic young priest brin...
6,s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",,"September 24, 2021",2021,PG,91 min,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...
7,s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...","September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s..."
8,s9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",United Kingdom,"September 24, 2021",2021,TV-14,9 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
9,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,"September 24, 2021",2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...


In [3]:
#Showing information about the datasets, columns, types etc.
netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [4]:
#Renaming listed_in column to genre
netflix.rename(columns={'listed_in': 'genre'}, inplace=True)

In [5]:
#Droping description column as it is not needed
netflix.drop(['description'], axis=1, inplace=True)

In [6]:
#Checking null values in columns(%)
round(netflix.isnull().sum()/(len(netflix))*100, 2)

show_id          0.00
type             0.00
title            0.00
director        29.91
cast             9.37
country          9.44
date_added       0.11
release_year     0.00
rating           0.05
duration         0.03
genre            0.00
dtype: float64

In [7]:
#Checking for duplicate values
netflix.duplicated().sum()

0

In [8]:
#Checking unique values in all columns
netflix.nunique()

show_id         8807
type               2
title           8807
director        4528
cast            7692
country          748
date_added      1767
release_year      74
rating            17
duration         220
genre            514
dtype: int64

<h3 style="color: #333333;">What is the ratio of movies VS shows on Netlfix?</h3>

In [9]:
movies, shows = (netflix['type'].value_counts() / len(netflix)) * 100
print(f'Movies make up {round(movies, 2)}% of content, while tv shows make up remaining {round(shows,2)}% of content on Netflix')

Movies make up 69.62% of content, while tv shows make up remaining 30.38% of content on Netflix


<h3>What amount of content was added each year?</h3> <br>

`First we converted date_added column to datetime and then we extraced the year as type: string. 
To convert the year to integer, we first had to replace NaN values as zeroes, after that we inserted new column year_added in the original dataframe.`

In [10]:
year = netflix['date_added'].astype('datetime64[ns]').dt.strftime('%Y').copy()
year.loc[year.isna()] = 0
year = year.astype(int)
netflix.insert(7, 'year_added', year)

`Then we made a subset copy of the og dataframe. To avoid`**`zeros`**`(which represents NaN values) being grouped, we filtered and droped the rows containing that data. Then we grouped by`**`year_added`**`, and using count method, counted how many movies and tv shows were added each year, sorted in descending order.`

In [11]:
df = netflix[['show_id', 'year_added']].copy()
filt = (df['year_added'] == 0)
df.drop(df[filt].index, inplace=True)
df.groupby('year_added')['show_id'].count().sort_values(ascending=False)

year_added
2019    2016
2020    1879
2018    1649
2021    1498
2017    1188
2016     429
2015      82
2014      24
2011      13
2013      11
2012       3
2008       2
2009       2
2010       1
Name: show_id, dtype: int64

<h3 style="color: #333333;">How much content was added by month in 2019, divided by movies and tv shows.</h3><br>

`We copied the subset of data after extracting the month from date_added and filtering the dataset to include only 2019. After creating two series that contained how much movies and shows were added by month, we then joined the  two series on month index. For nicer view, we renamed month_index by name of the month instead of a number `

In [12]:
month = netflix['date_added'].astype('datetime64[ns]').dt.strftime('%m')

#Inserting month series into original dataframe as month_added
netflix.insert(8, 'month_added', month)

df = netflix[(netflix['year_added'] == 2019)].copy()
df['month_added'] = month
movies_added = df[df['type'] == 'Movie'].groupby('month_added').agg(movies_added=('show_id', 'count'))
shows_added =  df[df['type'] == 'TV Show'].groupby('month_added').agg(shows_added=('show_id', 'count'))
table = pd.concat([movies_added, shows_added], axis=1)

In [13]:
month_map = {'01': 'Jan', '02': 'Feb', '03': 'Mar', '04': 'Apr', '05': 'May', '06': 'Jun',
             '07': 'Jul', '08': 'Aug', '09': 'Sep', '10': 'Oct', '11': 'Nov', '12': 'Dec'}
table.rename(index=month_map)

Unnamed: 0_level_0,movies_added,shows_added
month_added,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,116,37
Feb,103,45
Mar,119,53
Apr,119,43
May,91,48
Jun,122,46
Jul,98,59
Aug,87,44
Sep,86,37
Oct,128,65


<h3 style="color: #333333;">Ratio of movies proudced in 20th century compared to 21th on Netlfix</h3>

In [14]:
#Movies produced in 20th century
tw = netflix[(netflix['type'] == 'Movie') & (netflix['release_year'] < 2001)]['show_id'].count()
tw = (tw/len(netflix[(netflix['type'] == 'Movie')])) * 100

In [15]:
#Movies produced in 21th Century
tf = netflix[(netflix['type'] == 'Movie') & (netflix['release_year']> 2000)]['show_id'].count()
tf = (tf/len(netflix[(netflix['type'] == 'Movie')])) * 100

In [16]:
print(f'20th century movies make up {round(tw, 2)}%, while 21st century movies make up remaining {round(tf,2)}% of movie content on Netflix')

20th century movies make up 8.3%, while 21st century movies make up remaining 91.7% of movie content on Netflix


<h3 style="color: #333333;">What are five oldest Movies and TV shows on Netlix?</h3><br>

`With filtering and sorting, we can show the oldest produced content on Netflix.`

In [17]:
#5 oldest movies on Netflix
filt = (netflix['type'] == 'Movie')
netflix[filt][['title', 'director', 'release_year']].sort_values(by='release_year').head(5)

Unnamed: 0,title,director,release_year
7790,Prelude to War,Frank Capra,1942
8205,The Battle of Midway,John Ford,1942
8660,Undercover: How to Operate Behind Enemy Lines,John Ford,1943
8763,WWII: Report from the Aleutians,John Huston,1943
8739,Why We Fight: The Battle of Russia,"Frank Capra, Anatole Litvak",1943


In [18]:
#5 oldest TV shows on Netflix
filt = (netflix['type'] == 'TV Show')
netflix[filt][['title', 'director', 'release_year']].sort_values(by='release_year').head(5)

Unnamed: 0,title,director,release_year
4250,Pioneers: First Women Filmmakers*,,1925
1331,Five Came Back: The Reference Films,,1945
7743,Pioneers of African-American Cinema,"Oscar Micheaux, Spencer Williams, Richard E. N...",1946
8541,The Twilight Zone (Original Series),,1963
8189,The Andy Griffith Show,,1967


<h3 style="color: #333333;">Top 10 longest movies on Netlix</h3>

In [33]:
netflix.sample()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,year_added,month_added,release_year,rating,duration,genre
3482,s3483,Movie,Deliha 2,Gupse Özay,"Gupse Özay, Eda Ece, Aksel Bonfil, Derya Alabo...",Turkey,"September 27, 2019",2019,9,2018,TV-PG,102 min,"Comedies, International Movies"


In [19]:
filt = ((netflix['type'] == 'Movie') & ~(netflix['duration'].isnull()))
longest_movies = netflix[filt].copy()
longest_movies['duration'] = longest_movies['duration'].str.extract('(\d+)')
longest_movies['duration'] = longest_movies['duration'].astype(int)
longest_movies[['title', 'director', 'release_year', 'duration']].sort_values(by='duration', ascending=False).head(10)

Unnamed: 0,title,director,release_year,duration
4253,Black Mirror: Bandersnatch,,2018,312
717,Headspace: Unwind Your Mind,,2021,273
2491,The School of Mischief,Houssam El-Din Mustafa,1973,253
2487,No Longer kids,Samir Al Asfory,1979,237
2484,Lock Your Girls In,Fouad El-Mohandes,1982,233
2488,Raya and Sakina,Hussein Kamal,1984,230
166,Once Upon a Time in America,Sergio Leone,1984,229
7932,Sangam,Raj Kapoor,1964,228
1019,Lagaan,Ashutosh Gowariker,2001,224
4573,Jodhaa Akbar,Ashutosh Gowariker,2008,214


<h3 style="color: #333333;">Country Analysis: What countries produce the most content?</h3><br>

`Orginal dataset is copied with only two columns`**`[show_id, country]`**`. Null values are replaces  with`**`value=Unknown`**`. The country column can have multiple country listed for production, so with the use of`**`str.split and explode methods`**`, we made sure that one field contained only one country, whilst keeping the same index. Cleaned the trailing whitespace and grouped the data by country, counting show_id, unique for each movie/show.` 

In [20]:
#checking null values 
netflix['country'].isna().sum()

831

In [21]:
#Copying and data cleaning
df = netflix[['show_id', 'country']].copy()
df['country'].fillna('Unknown', inplace=True)
df['country'] = df['country'].str.split(',')
df = df.explode('country')
df['country'] = df['country'].str.strip()

In [22]:
#Grouping data and showing top 10 countries with most proudced content
country_count = df.groupby(['country'])['show_id'].count().sort_values(ascending=False)
country_count.head(10)

country
United States     3690
India             1046
Unknown            831
United Kingdom     806
Canada             445
France             393
Japan              318
Spain              232
South Korea        231
Germany            226
Name: show_id, dtype: int64

<h3 style="color: #333333;">Showing content distribution based on Genre</h3>

In [23]:
df = netflix[['show_id', 'genre', 'type']].copy()
df['genre'] = df['genre'].str.split(',')
df = df.explode('genre')
df['genre'] = df['genre'].str.strip()
genre_distribution = df.groupby(['genre'])['show_id'].count().sort_values(ascending=False)
genre_distribution.head(10)

genre
International Movies        2752
Dramas                      2427
Comedies                    1674
International TV Shows      1351
Documentaries                869
Action & Adventure           859
TV Dramas                    763
Independent Movies           756
Children & Family Movies     641
Romantic Movies              616
Name: show_id, dtype: int64

In [24]:
#5 most popular genres in  movies
movies = df[df['type'] == 'Movie'].groupby(['genre'])['show_id'].count().sort_values(ascending=False)
movies.head(5)

genre
International Movies    2752
Dramas                  2427
Comedies                1674
Documentaries            869
Action & Adventure       859
Name: show_id, dtype: int64

In [25]:
#5 most popular genres in  tv shows
shows = df[df['type'] == 'TV Show'].groupby(['genre'])['show_id'].count().sort_values(ascending=False)
shows.head(5)

genre
International TV Shows    1351
TV Dramas                  763
TV Comedies                581
Crime TV Shows             470
Kids' TV                   451
Name: show_id, dtype: int64

<h3 style="color: #333333;">Showing content distribution based on the rating</h3>

In [26]:
netflix['rating'].value_counts().sort_values(ascending=False)

TV-MA       3207
TV-14       2160
TV-PG        863
R            799
PG-13        490
TV-Y7        334
TV-Y         307
PG           287
TV-G         220
NR            80
G             41
TV-Y7-FV       6
NC-17          3
UR             3
74 min         1
84 min         1
66 min         1
Name: rating, dtype: int64

<h3 style="color: #333333;">TV Show distribution based on the number of season they have</h3>

`After we filtered dataset to contain only TV Shows, we used str.extract metod to get the number of seasons that was cast as type int. The following dataset was grouped by seasons and show_id was used as a metric to count `

In [27]:
filt = (netflix['type'] == 'TV Show')
longest_shows = netflix[filt].copy()
longest_shows['seasons'] = longest_shows['duration'].str.extract('(\d+)').astype(int)
longest_shows.set_index('seasons')
longest_shows.groupby('seasons').agg(tv_shows=('show_id', 'count'))


Unnamed: 0_level_0,tv_shows
seasons,Unnamed: 1_level_1
1,1793
2,425
3,199
4,95
5,65
6,33
7,23
8,17
9,9
10,7


<h3 style="color: #333333;">Top 10 most popular movie directors based on amount of content.</h3>

In [28]:
#checking null values 
filt = ((netflix['type'] == 'Movie') & (netflix['director'].isna()))
filt.isnull().sum()

0

In [29]:
netflix[netflix['type'] == 'Movie'].groupby('director')['show_id'].count().sort_values(ascending=False).head(10)

director
Rajiv Chilaka             19
Raúl Campos, Jan Suter    18
Suhas Kadav               16
Marcus Raboy              15
Jay Karas                 14
Cathy Garcia-Molina       13
Martin Scorsese           12
Jay Chapman               12
Youssef Chahine           12
Steven Spielberg          11
Name: show_id, dtype: int64

<h3 style="color: #333333;">Most popular actors on Netflix</h3>

In [30]:
#Extracting and cleaning data
df = netflix[['show_id', 'cast']].copy()
df.dropna(subset='cast', inplace=True)
df['cast'] = df['cast'].str.split(',')
df = df.explode('cast')
df['cast'] = df['cast'].str.strip()

In [31]:
#Grouping and counting actors's credits
pop_actors = df.groupby('cast')['show_id'].count().sort_values(ascending=False)
pop_actors.head(15)

cast
Anupam Kher         43
Shah Rukh Khan      35
Julie Tejwani       33
Naseeruddin Shah    32
Takahiro Sakurai    32
Rupa Bhimani        31
Om Puri             30
Akshay Kumar        30
Yuki Kaji           29
Paresh Rawal        28
Amitabh Bachchan    28
Boman Irani         27
Vincent Tong        26
Rajesh Kava         26
Kareena Kapoor      25
Name: show_id, dtype: int64

In [32]:
netflix.sample(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,year_added,month_added,release_year,rating,duration,genre
3750,s3751,TV Show,Leila,,"Huma Qureshi, Siddharth, Rahul Khanna, Arif Za...",India,"June 14, 2019",2019,6,2019,TV-MA,1 Season,"British TV Shows, International TV Shows, TV D..."
1848,s1849,TV Show,Half & Half,,"Rachel True, Essence Atkins, Telma Hopkins, Ch...",United States,"October 15, 2020",2020,10,2005,TV-14,4 Seasons,TV Comedies
2618,s2619,Movie,Küçük Esnaf,Bedran Güzel,"İbrahim Büyükak, Zeynep Koçak, Gupse Özay, Cen...",Turkey,"April 28, 2020",2020,4,2016,TV-MA,100 min,"Comedies, International Movies"
8543,s8544,Movie,The Unborn Child,Poj Arnon,"Somchai Kemglad, Pitchanart Sakakorn, Chinarad...",Thailand,"July 30, 2018",2018,7,2011,TV-MA,93 min,"Horror Movies, International Movies"
2247,s2248,Movie,Mama's Boy,Amro Salah,"Hesham Maged, Shikoo, Mohammed Tharwat, Mahmou...",Egypt,"July 10, 2020",2020,7,2018,TV-14,101 min,"Comedies, International Movies"
