# Final Project, Part 3.1

Group members: Phoebe Ling, River Liu, Boyu Zhang, and Shaojun Zheng

## Dataset Information

***What is the "name" of the dataset?***  
"raw titles" data in "Netflix Movies and Series"  

***Where did you obtain it?***  
We obtain it from data.world, but its source is Kaggle.  

***Where can we obtain it? (i.e., URL)***  
https://data.world/gonzandrobles/netflix-movies-and-series/workspace/file?filename=raw_titles.csv   

**What is the license of the dataset?***  
The license is Creative Commons Zero (Public Domain). Therefore, we can reuse, modify, and refine it without limitation. This dataset can also be used in a commercial situation.  

***How big is it in file size and in items?***  
The whole dataset is 4.21MB, and we can directly use the URL to get the data without downloading the file and uploading it to GitHub.

## Explore the dataset and remove the null data

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('https://query.data.world/s/7cdylzj6g2k4a7bahcloent6dhb3mf')

In [3]:
type(df)

pandas.core.frame.DataFrame

In [4]:
df.count()

id                      5806
title                   5805
type                    5806
release_year            5806
age_certification       3196
runtime                 5806
genres                  5806
production_countries    5806
seasons                 2047
imdb_id                 5362
imdb_score              5283
imdb_votes              5267
dtype: int64

In [5]:
df.isnull().sum()     # check the null value in each column

id                         0
title                      1
type                       0
release_year               0
age_certification       2610
runtime                    0
genres                     0
production_countries       0
seasons                 3759
imdb_id                  444
imdb_score               523
imdb_votes               539
dtype: int64

In [6]:
# drop the raws which "title",'age_certification', or 'imdb_score' columns is null.
no_null = df.dropna(subset=['title', 'age_certification', 'imdb_score'])    

In [7]:
no_null.isnull().sum()

id                         0
title                      0
type                       0
release_year               0
age_certification          0
runtime                    0
genres                     0
production_countries       0
seasons                 1328
imdb_id                    0
imdb_score                 0
imdb_votes                 9
dtype: int64

In [8]:
no_null.count()

id                      2998
title                   2998
type                    2998
release_year            2998
age_certification       2998
runtime                 2998
genres                  2998
production_countries    2998
seasons                 1670
imdb_id                 2998
imdb_score              2998
imdb_votes              2989
dtype: int64

In [9]:
# remove the raws which genres or production_countries is null
no_null = no_null[(no_null['genres'] != "[]") & (no_null['production_countries'] != "[]")]

In [10]:
no_null[no_null['genres'] == "[]"]

Unnamed: 0,id,title,type,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes


In [11]:
no_null[no_null['production_countries'] == "[]"]

Unnamed: 0,id,title,type,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes


In [12]:
no_null.count()

id                      2937
title                   2937
type                    2937
release_year            2937
age_certification       2937
runtime                 2937
genres                  2937
production_countries    2937
seasons                 1621
imdb_id                 2937
imdb_score              2937
imdb_votes              2928
dtype: int64

In [13]:
no_null.shape[0]

2937

In [14]:
no_null.shape[1]

12

In [15]:
no_null.count(numeric_only = 'True')

release_year    2937
runtime         2937
seasons         1621
imdb_score      2937
imdb_votes      2928
dtype: int64

## Clean Data 

Try to deal with the multiple values in the genres and production_countries columns.

In [16]:
no_null = no_null.replace('\[','', regex=True)
no_null = no_null.replace('\]','', regex=True)
no_null = no_null.replace("\'",'', regex=True)
no_null

Unnamed: 0,id,title,type,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes
1,tm84618,Taxi Driver,MOVIE,1976,R,113,"crime, drama",US,,tt0075314,8.3,795222.0
2,tm127384,Monty Python and the Holy Grail,MOVIE,1975,PG,91,"comedy, fantasy",GB,,tt0071853,8.2,530877.0
3,tm70993,Life of Brian,MOVIE,1979,R,94,comedy,GB,,tt0079470,8.0,392419.0
4,tm190788,The Exorcist,MOVIE,1973,R,133,horror,US,,tt0070047,8.1,391942.0
5,ts22164,Monty Pythons Flying Circus,SHOW,1969,TV-14,30,"comedy, european",GB,4.0,tt0063929,8.8,72895.0
...,...,...,...,...,...,...,...,...,...,...,...,...
5768,ts309235,Christmas Flow,SHOW,2021,TV-MA,50,"music, romance, comedy",FR,1.0,tt15340790,5.8,702.0
5770,ts307816,Korean Cold Noodle Rhapsody,SHOW,2021,TV-PG,49,documentation,KR,1.0,tt15772846,7.3,15.0
5773,tm982470,Stuck Apart,MOVIE,2021,R,96,"comedy, drama",TR,,tt11213372,6.0,10418.0
5785,ts273317,Pitta Kathalu,SHOW,2021,TV-MA,37,"drama, romance",IN,1.0,tt13879000,5.1,727.0


### Get primary genre from multiple genres

In [17]:
no_null['primary_genre'] = no_null['genres'].str.partition(',')[0]

### Get primary country from multiple production_countries¶

In [18]:
no_null['primary_country'] = no_null['production_countries'].str.partition(',')[0]

In [19]:
no_null

Unnamed: 0,id,title,type,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,primary_genre,primary_country
1,tm84618,Taxi Driver,MOVIE,1976,R,113,"crime, drama",US,,tt0075314,8.3,795222.0,crime,US
2,tm127384,Monty Python and the Holy Grail,MOVIE,1975,PG,91,"comedy, fantasy",GB,,tt0071853,8.2,530877.0,comedy,GB
3,tm70993,Life of Brian,MOVIE,1979,R,94,comedy,GB,,tt0079470,8.0,392419.0,comedy,GB
4,tm190788,The Exorcist,MOVIE,1973,R,133,horror,US,,tt0070047,8.1,391942.0,horror,US
5,ts22164,Monty Pythons Flying Circus,SHOW,1969,TV-14,30,"comedy, european",GB,4.0,tt0063929,8.8,72895.0,comedy,GB
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5768,ts309235,Christmas Flow,SHOW,2021,TV-MA,50,"music, romance, comedy",FR,1.0,tt15340790,5.8,702.0,music,FR
5770,ts307816,Korean Cold Noodle Rhapsody,SHOW,2021,TV-PG,49,documentation,KR,1.0,tt15772846,7.3,15.0,documentation,KR
5773,tm982470,Stuck Apart,MOVIE,2021,R,96,"comedy, drama",TR,,tt11213372,6.0,10418.0,comedy,TR
5785,ts273317,Pitta Kathalu,SHOW,2021,TV-MA,37,"drama, romance",IN,1.0,tt13879000,5.1,727.0,drama,IN


# Dashboard

In [20]:
import altair as alt

In [21]:
brush = alt.selection_interval(encodings=['x','y'])

In [22]:
genres = alt.Chart(no_null).mark_bar().encode(
    alt.X('primary_genre', type='ordinal'),
    alt.Y('imdb_score',aggregate='mean',type='quantitative')
).properties(
    title='Average IMDb Scores of Each Genre',
)

text = genres.mark_text().encode(
    text=alt.Y(field='imdb_score', aggregate='mean', type='quantitative', format='.2f')
)

genres+text

After processing the data set, two columns of primary_genre and primary_country data are generated. First, we want to explore the relationship between genre and imdb_score, whether a particular movie genre is more popular among audiences. So, we plotted a histogram with the X-axis representing the genre of the movie and the Y-axis representing the average imdb score of the movie. From the histogram, we can see that certain genres are indeed more popular with audiences. For example, history and war films have an average rating of over 7, while horror films have an average rating of under 6.

In [23]:
age = alt.Chart(no_null).mark_bar().encode(
    alt.X('age_certification', type='ordinal'),
    alt.Y('age_certification',aggregate='count',type='quantitative')
).properties(
    title='Count of Films of Each Rating',
)
text_age = age.mark_text().encode(
    text=alt.Y(field='age_certification', aggregate='count', type='quantitative')
)
age+text_age

Next, we want to know the proportion of Movie Ratings and observe which grades of movies filmmakers like to shoot. So, we drew a histogram of movie ratings on the x-axis and the number of movies with each rating on the y-axis. The figure shows that the counts of TV-MA and R-rated movies are the largest, indicating that filmmakers tend to produce R (Under 17 requires an accompanying parent or adult guardian) and TV-MA (this program is specifically designed to be viewed by adults) movies. The producers and directors don't think too much about the children's movie market.

In [24]:
hist1 = alt.Chart(no_null).mark_rect().encode(
    alt.X("primary_genre", type='ordinal'),
    alt.Y("age_certification", type='ordinal'),
    alt.Color("count()")
).properties(
    width=600,
    title = 'Count of Age Cerification and Genre'
).add_selection(
    brush
)

hist1

In addition, we further understood the relationship between genre and movie rating, so we drew a heat map, the x-axis represents the subject matter of the movie, the y-axis represents the age rating of the movie, and the intersecting part represents both a movie genre and age certification from the number of movies, we can see that most of the drama-themed movies are TV-MA-rated, and most thriller-themed movies are R-rated.

In [25]:
rect = alt.Chart(no_null).mark_rect().encode(
    alt.Y("primary_genre", type='ordinal'),
    alt.X("primary_country", type='ordinal'),
    alt.Color("count()")
).properties(
    title='Count of Genre and Main Production Country',
).add_selection(
    brush
)
rect

In [26]:
# This code is not interactive, I don't know the reason
'''
hist = alt.Chart(no_null).mark_bar().transform_filter(
    brush
).encode(
    y = 'mean_score:Q',
    x = 'release_year:O'
).transform_aggregate(
    mean_score='mean(imdb_score)',
    groupby=["release_year"]
)
'''

'\nhist = alt.Chart(no_null).mark_bar().transform_filter(\n    brush\n).encode(\n    y = \'mean_score:Q\',\n    x = \'release_year:O\'\n).transform_aggregate(\n    mean_score=\'mean(imdb_score)\',\n    groupby=["release_year"]\n)\n'

In [27]:
hist = alt.Chart(no_null).mark_bar().transform_filter(
    brush
).encode(
    x = alt.X(field='imdb_score', aggregate='mean', type='quantitative'),
    y = 'release_year:O'
).properties(
    title='Average IMDb Score of Films Released Year',
)

In [28]:
text = hist.mark_text(
    align='left',
    baseline='middle',
    dx=3  # Nudges text to right so it doesn't appear on top of the bar
).encode(
    text=alt.X(field='imdb_score', aggregate='mean', type='quantitative', format='.2f')
)

In [29]:
dashboard = alt.vconcat(rect,hist+text)
dashboard

In [30]:
dashboard.save('final_interactive.json')

## Explanation

### How to use the Dashboard?  
The dashboard shows the number of shows and movies provided by Netflix by movies' or shows' primary genre and production country. An audience can use a mouse to select a range of countries and genres. The bar chart on the right side will show the average IMDB score of the movie or shows that match the condition the audience selected each released year.

### A list of contextual dataset  
raw_credits.csv (https://data.world/gonzandrobles/netflix-movies-and-series/workspace/file?filename=raw_credits.csv) dataset may help us to tell the story. This shows the actors and directors involved in each movie or show, and we can also plot the relation between IMDB score and actors/directors.

# Conclusion

Netflix is one of the most popular over-the-top streaming services in the world. It is interesting to figure out what kind of movies and TV shows it provides and to infer their target audiences. Customers can also know if Netflix's service is suitable for them by seeing the analysis results of the Netflix dataset. We present the analysis in age, genre, and score.

In the first plot, we explored the quality of movies/shows by showing the average IMDb score of each genre with a histogram. We can see that certain genres are indeed more popular with audiences and get higher scores. Audiences who love history and war films may be satisfied with the films Netflix provides, but audiences who like to watch horror movies may feel disappointed with the films on Netflix, which have an average rating of under 6.

In the second plot, we showed the number of films in each rating. The figure shows that the counts of TV-MA and R-rated movies are the largest, indicating children and teenagers are not their target audiences. Even the family with children may not be the target as well.

In the third and fourth plots, we further explore the relationship between genre and movie rating and the relationship between genre and produced country by showing a heat map. This gives audiences comprehensive aspects of the genre, produced country, and moving rating.

Ultimately, we used a dashboard to show the relationship between primary genre, production country, released year, and the average IMDB score of films. 

Those figures can help audiences comprehensively understand the films' features on Netflix and choose if this service is suitable for them.