#WORLD OPTIMIZATION
#EXERCISE

Name: William Ona


This report dives into the world of data using Pandas and NumPy. We're using these handy tools to pull out important info and create cool charts that reveal trends. Let's explore the data together!

#1.Data loading and get information

In [1]:
import pandas as pd
df = pd.read_csv('netflix1.csv')
df.head()

Unnamed: 0,show_id,type,title,director,country,date_added,release_year,rating,duration,listed_in
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,United States,9/25/2021,2020,PG-13,90 min,Documentaries
1,s3,TV Show,Ganglands,Julien Leclercq,France,9/24/2021,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act..."
2,s6,TV Show,Midnight Mass,Mike Flanagan,United States,9/24/2021,2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries"
3,s14,Movie,Confessions of an Invisible Girl,Bruno Garotti,Brazil,9/22/2021,2021,TV-PG,91 min,"Children & Family Movies, Comedies"
4,s8,Movie,Sankofa,Haile Gerima,United States,9/24/2021,1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies"


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8790 entries, 0 to 8789
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8790 non-null   object
 1   type          8790 non-null   object
 2   title         8790 non-null   object
 3   director      8790 non-null   object
 4   country       8790 non-null   object
 5   date_added    8790 non-null   object
 6   release_year  8790 non-null   int64 
 7   rating        8790 non-null   object
 8   duration      8790 non-null   object
 9   listed_in     8790 non-null   object
dtypes: int64(1), object(9)
memory usage: 686.8+ KB


In [3]:
df.isnull().sum()

show_id         0
type            0
title           0
director        0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
dtype: int64

In [4]:
df.describe()

Unnamed: 0,release_year
count,8790.0
mean,2014.183163
std,8.825466
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


importing NumPy in your Python code is a great choice for efficient numerical operations, working with arrays, and accessing mathematical functions

In [5]:
import numpy as np

In [6]:
df.describe(exclude= np.number)

Unnamed: 0,show_id,type,title,director,country,date_added,rating,duration,listed_in
count,8790,8790,8790,8790,8790,8790,8790,8790,8790
unique,8790,2,8787,4528,86,1713,14,220,513
top,s1,Movie,9-Feb,Not Given,United States,1/1/2020,TV-MA,1 Season,"Dramas, International Movies"
freq,1,6126,2,2588,3240,110,3205,1791,362


#2.Delete unnecessary columns


In [7]:
pd.DataFrame({'count': df.shape[0],'Nulls': df.isnull().sum(),'nulls%': df.isnull().sum()*100,'cardinality':df.nunique()})

Unnamed: 0,count,Nulls,nulls%,cardinality
show_id,8790,0,0,8790
type,8790,0,0,2
title,8790,0,0,8787
director,8790,0,0,4528
country,8790,0,0,86
date_added,8790,0,0,1713
release_year,8790,0,0,74
rating,8790,0,0,14
duration,8790,0,0,220
listed_in,8790,0,0,513


In [8]:
df.drop(columns=['show_id'],inplace=True)

#3.Compare between 2 types

In [11]:
count_types = df['type'].value_counts()
count_types

Movie      6126
TV Show    2664
Name: type, dtype: int64

import all the libraries to plot

In [9]:
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.offline import iplot, plot
from plotly.subplots import make_subplots

In [10]:
colors = ["#8c0404","#f25ed0","#000000","#16A085","#34495E",
           "#21618C ","#512E5F","#45B39D","#AAB7B8 ","#20B2AA",
           "#FF69B4","#00CED1","#FF7F50","#7FFF00","#DA70D6"]

In [12]:
iplot(px.bar(count_types, text_auto=True, color= count_types.index, color_discrete_sequence = colors, title='Compare Between Two types', labels= dict(index='count_types', value='count')))

In [14]:
iplot(px.pie(values=count_types, names=['Movie','TV Show'], color_discrete_sequence= colors[7:9], title='Ram counts').update_traces(textinfo='value+percent'))

#4.Directors

In [15]:
directors = df['director'].value_counts()
directors

Not Given                         2588
Rajiv Chilaka                       20
Alastair Fothergill                 18
Raúl Campos, Jan Suter              18
Suhas Kadav                         16
                                  ... 
Matt D'Avella                        1
Parthiban                            1
Scott McAboy                         1
Raymie Muzquiz, Stu Livingston       1
Mozez Singh                          1
Name: director, Length: 4528, dtype: int64

In [17]:
given_directors = directors.sum() - directors[0]
print(f'give directors= {given_directors}')

give directors= 6202


In [18]:
iplot(px.pie(values=[directors[0], given_directors],
             names=['Given Directors', 'Not Given Directors'],
             title='Given Directors vs Not Given Directors',
             color_discrete_sequence=['#B81D24','#221F1F']).update_traces(textinfo='value+percent'))

In [19]:
px.bar(directors[1:11],
       x=directors[1:11],
       y=directors[1:11].index,
       color=directors[1:11].index,
       text_auto = True,
       labels= dict(x='Number of movies', y='Directors'),
       orientation='h')

#5.Countries

In [20]:
countries = df['country'].value_counts()[:10]
countries

United States     3240
India             1057
United Kingdom     638
Pakistan           421
Not Given          287
Canada             271
Japan              259
South Korea        214
France             213
Spain              182
Name: country, dtype: int64

In [21]:
country_type = df.groupby(['country','type']).size().unstack(fill_value=0).reset_index()
country_type['Total'] = country_type['Movie'] + country_type['TV Show']
country_type = country_type[country_type['country'] != 'Not Given']
country_type = country_type.sort_values(by='Total', ascending= False)
colors=['#B81d24','#221F1F']
fig = px.bar(country_type.head(10), x='country', y=['Movie','TV Show'],
             labels={'value': 'Count', 'variable':'type'},
             title= 'top 10 countries and their streamed movies and tv shows',
             barmode='group',
             color_discrete_map={key:value for key, value in zip(['Movie', 'Tv Show'], colors)})
fig.update_traces(marker=dict(line=dict(width=4)))
fig.show()

#6.Date added

In [22]:
df['date_added']=pd.to_datetime(df["date_added"])
df['date_added'].head(10)

0   2021-09-25
1   2021-09-24
2   2021-09-24
3   2021-09-22
4   2021-09-24
5   2021-09-24
6   2021-09-24
7   2021-05-01
8   2021-09-23
9   2021-05-01
Name: date_added, dtype: datetime64[ns]

In [23]:
print(df['date_added'].min())
print(df['date_added'].max())

2008-01-01 00:00:00
2021-09-25 00:00:00


In [24]:
release_year = df['release_year'].value_counts()
release_year.head(10)

2018    1146
2017    1030
2019    1030
2020     953
2016     901
2021     592
2015     555
2014     352
2013     286
2012     236
Name: release_year, dtype: int64

In [25]:
iplot(px.area(release_year, x= release_year.index, y=release_year))

In [26]:
shows_added_per_year = df.groupby(df['date_added'].dt.year)['type'].count()
shows_added_per_year

date_added
2008       2
2009       2
2010       1
2011      13
2012       3
2013      11
2014      24
2015      82
2016     426
2017    1185
2018    1648
2019    2016
2020    1879
2021    1498
Name: type, dtype: int64

In [27]:
iplot(px.line(shows_added_per_year, title='Number of shows added per year', x=shows_added_per_year.index,y=shows_added_per_year,markers=True, line_shape='linear'))

#7.Rating

In [29]:
rating = df['rating'].value_counts()
rating.head(10)

TV-MA    3205
TV-14    2157
TV-PG     861
R         799
PG-13     490
TV-Y7     333
TV-Y      306
PG        287
TV-G      220
NR         79
Name: rating, dtype: int64

In [31]:
iplot(px.bar(rating, title='Shows Rating on Netflix', color= rating.index, orientation='h', height=720, text_auto= True, labels = dict(index='Rating', value='Frenquency')))

#8.Duration

In [32]:
duration= df['duration'].value_counts()
duration.head(10)

1 Season     1791
2 Seasons     421
3 Seasons     198
90 min        152
97 min        146
93 min        146
94 min        146
91 min        144
95 min        137
96 min        130
Name: duration, dtype: int64

In [33]:
seasons= df[df['duration'].str.contains('Season')]
seasons_count = seasons['duration'].value_counts()
seasons_count

1 Season      1791
2 Seasons      421
3 Seasons      198
4 Seasons       94
5 Seasons       64
6 Seasons       33
7 Seasons       23
8 Seasons       17
9 Seasons        9
10 Seasons       6
15 Seasons       2
13 Seasons       2
12 Seasons       2
17 Seasons       1
11 Seasons       1
Name: duration, dtype: int64

In [34]:
iplot(px.bar(seasons_count, title='Season per tv show', color= seasons_count.index, orientation='h', height=720, text_auto=True, labels=dict(index='Seasons', value='Sum')))

#9.Categories

In [35]:
categories = df['listed_in'].str.split(', ', expand=True)

categories = categories.melt(value_name='category').dropna()['category']

top_categories = categories.value_counts().head(10)

top_categories

International Movies        2752
Dramas                      2426
Comedies                    1674
International TV Shows      1349
Documentaries                869
Action & Adventure           859
TV Dramas                    762
Independent Movies           756
Children & Family Movies     641
Romantic Movies              616
Name: category, dtype: int64

In [36]:
categories = df['listed_in'].str.split(',', expand=True)

categories

Unnamed: 0,0,1,2
0,Documentaries,,
1,Crime TV Shows,International TV Shows,TV Action & Adventure
2,TV Dramas,TV Horror,TV Mysteries
3,Children & Family Movies,Comedies,
4,Dramas,Independent Movies,International Movies
...,...,...,...
8785,International TV Shows,TV Dramas,
8786,Kids' TV,,
8787,International TV Shows,Romantic TV Shows,TV Dramas
8788,Kids' TV,,


In [37]:
categories = categories.melt(value_name='category').dropna()['category']
categories

0                   Documentaries
1                  Crime TV Shows
2                       TV Dramas
3        Children & Family Movies
4                          Dramas
                   ...           
26354                 TV Comedies
26355         Science & Nature TV
26356         Science & Nature TV
26358                TV Thrillers
26367                   TV Dramas
Name: category, Length: 19294, dtype: object

In [38]:
top_categories = categories.value_counts().head(10)

top_categories

 International Movies     2624
Dramas                    1599
Comedies                  1210
Action & Adventure         859
Documentaries              829
 Dramas                    827
International TV Shows     773
 Independent Movies        736
 TV Dramas                 695
 Romantic Movies           613
Name: category, dtype: int64

In [39]:
top_categories_df = pd.DataFrame({'Category': top_categories.index, 'Count': top_categories.values})

fig = px.bar(top_categories_df, x='Count', y='Category', orientation='h',
             title='Top 10 Popular Categories for Movies & TV Shows',
             labels={'Count': 'Number of Shows', 'Category': 'Category'},
             color=top_categories_df.index,
             text='Count')

fig.show()