## Netflix Movies Data Analysis: EDA and Recommender System

Netflix is the world’s leading entertainment streaming service with over 193 million paid subscribers or memberships in over 190 countries. It consists of movies, TV shows or series, documentaries, and different genres and languages.  This study seeks to understand the behavior and the preference of the members by utilizing the data that can be derived from the parameters in the dataset. With that, this notebook aims to analyze its data by identifying the trends, patterns, anomalis, and data techniques that can be used to extract valuable conclusions.

The researchers will be using the __netflix dataset__ for our data analysis. The following techniques will be used for the data:
1. Exploratory Data Analysis
2. Confidence Intervals
3. Statistical Inference 
4. Recommender System

In addition, this notebook seeks to answer the following research questions: 
1. What behavior can we conclude from the subscribers of Netflix based on the activities of these users?
2. What kind of movies can entertain more subscribers based on the data?
3. What type of movie would a population prefer based on the popularity of genres in a certain country?
4. Is there a certain threshold of rating for a movie to be shown or produced in a country?
5. Is there a significant difference between the rating of critics and fans based on a span of 15 years?

In [None]:
pip install plotly

In [2]:
import pandas as pd
import csv
import numpy
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.offline as py
from plotly.offline import iplot, init_notebook_mode
from scipy.stats import norm
import plotly.graph_objs as go
py.init_notebook_mode(connected = True)

The researchers will be using a total of 3 datasets. First is the __netflix_titles.csv__, it is collected through a third-party Netflix search engine known as Flixable and it consists of all the necessary information of every movie in Netflix. However, it does not consist of the ratings of users on movies. Therefore, the researchers used two other datasets to integrate the necessary information to conduct our study such as the average rating of users in each movies and its number of votes. 

The __title.ratings.tsv__ is a dataset from IMDb, an online database owned by Amazon used for ratings, and fan and critical reviews, which consists of the following parameters:

`tconst` - alphanumeric unique identifier of the title
<br> `averageRating` - weighted average of all the individual user ratings per movie
<br> `numVotes` - the total number of votes of a movie

### Read "title.ratings.tsv" file

In [4]:
title_ratings=pd.read_csv("title.ratings.tsv", sep='\t')

In [5]:
title_ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.6,1647
1,tt0000002,6.1,198
2,tt0000003,6.5,1345
3,tt0000004,6.2,120
4,tt0000005,6.2,2131


The other dataset is called __title.basics.tsv__ which is used to interpret the string value of movies from the __title.ratings.tsv__ dataset by connecting the variable `tconst` from both dataset. 

The dataset contains the following variables:
<br> `titleType` - type of film (movie)
<br> `tconst` - alphanumeric unique identifier of the title
<br> `primaryTitle` - string value of the title
<br> `originalTitle` - title of the movie
<br>`startYear` - year it was produced

### Read "title.basics.tsv" file

In [6]:
title_basics=pd.read_csv("title.basics.tsv", sep='\t')
title_basics=title_basics.drop_duplicates()


Columns (5) have mixed types.Specify dtype option on import or set low_memory=False.



In [7]:
title_basics=title_basics[['titleType','tconst','primaryTitle', 'originalTitle', 'startYear']]
title_basics=title_basics[title_basics.titleType=='movie']
title_basics=title_basics[title_basics.startYear.apply(lambda x: str(x).isnumeric())]
title_basics.head()

Unnamed: 0,titleType,tconst,primaryTitle,originalTitle,startYear
8,movie,tt0000009,Miss Jerry,Miss Jerry,1894
144,movie,tt0000147,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,1897
331,movie,tt0000335,Soldiers of the Cross,Soldiers of the Cross,1900
498,movie,tt0000502,Bohemios,Bohemios,1905
570,movie,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906


#### Merge "title.ratings.tsv" and "title.basics.tsv" 

In [8]:
ratings_and_titles=pd.merge(title_ratings.set_index('tconst'), title_basics.set_index('tconst'), left_index=True, right_index=True, how='inner')
ratings_and_titles=ratings_and_titles.drop_duplicates()

After merging __title.ratings.tsv__ and __title.basics.tsv__, we know then read the __netflix_titles.csv__.

### Read netflix_titles.csv

In [9]:
netflix_titles=pd.read_csv("netflix_titles.csv", index_col="show_id")

Next, we would want to clean our data before using the dataset for further data analysis. With that, the researchers intend to do this by dropping the rows without a value on the column `release year`. We also want to ensure that all values from the columns `release year` and `startYear` are of an integer type. Lastly, we would want to have a uniformed format on the title of movies. To do this, the researchers converted the titles into lowercase. 

#### Drop rows without release_year

In [10]:
netflix_titles=netflix_titles.dropna(subset=['release_year'])

#### Change release_year column to integer

In [11]:
netflix_titles.release_year=netflix_titles.release_year.astype(numpy.int64)

#### Drop rows in ratings_and_titles with non-numeric values for startYear and convert to integer

In [12]:
ratings_and_titles=ratings_and_titles[ratings_and_titles.startYear.apply(lambda x: str(x).isnumeric())]

In [13]:
ratings_and_titles.startYear=ratings_and_titles.startYear.astype(numpy.int64)

#### Convert titles to lowercase

In [14]:
netflix_titles['title']=netflix_titles['title'].str.lower()
ratings_and_titles['originalTitle']=ratings_and_titles['originalTitle'].str.lower()
ratings_and_titles['primaryTitle']=ratings_and_titles['primaryTitle'].str.lower()

After cleaning the data, the researchers can now merge all 3 datasets.

### Join netflix titles with IMDb ratings on title name and release year.

In [15]:
##subset movies
netflix_titles=netflix_titles[netflix_titles.type=='Movie']

In [16]:
netflix_titles_rating=pd.merge(netflix_titles, ratings_and_titles, left_on=['title', 'release_year'], right_on=['primaryTitle', 'startYear'], how='inner')

#### Sort the obtained data frame by averageRating and number of votes

In [17]:
netflix_titles_rating.sort_values(by=['averageRating', 'numVotes'], inplace=True, ascending=False)

In [18]:
netflix_titles_rating_2000=netflix_titles_rating[netflix_titles_rating.numVotes>2000]

The merged dataset is named __netflix_titles_ratings_2000__. The researchers opted to use movies with ratings more than 2000 to have a more accurate data given by the number of ratings. This dataset will be now used for the data analysis.

The following are the parameters of the dataset: 
<br> `type` - a movie or tv show
<br> `title` - the title of a movie or tv show
<br> `director` - director of a movie or tv show
<br> `cast` - actors/actresses involved in the film
<br> `country` - the country where the movie/tv show was produced
<br> `date_added` - the date it was added in Netflix
<br> `release_year` - the actual release date of the movie/tv show
<br> `rating` - the rating of a movie/tv show (TV-MA, TV-14, TV-PG, TV-Y7-FV, TV-17, R)
<br> `duration` - total duration of the movie/tv show in minutes or seasons
<br> `listed_in` - the genre of a movie/tv show
<br> `description` - brief description of a movie/tv show
<br> `averageRating` - weighted average of all the individual user ratings per movie
<br> `numVotes` - the total number of votes of a movie
<br> `titleType` - type of film (movie)
<br> `primaryTitle` - string value of the title
<br> `originalTitle` - title of the movie
<br>`startYear` - year it was produced

In [19]:
netflix_titles_rating_2000.head()

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,averageRating,numVotes,titleType,primaryTitle,originalTitle,startYear
1894,Movie,pulp fiction,Quentin Tarantino,"John Travolta, Samuel L. Jackson, Uma Thurman,...",United States,"January 1, 2019",1994,R,154 min,"Classic Movies, Cult Movies, Dramas",This stylized crime caper weaves together stor...,8.9,1782352,movie,pulp fiction,pulp fiction,1994
1854,Movie,the lord of the rings: the return of the king,Peter Jackson,"Elijah Wood, Ian McKellen, Liv Tyler, Viggo Mo...","New Zealand, United States","January 1, 2020",2003,PG-13,201 min,"Action & Adventure, Sci-Fi & Fantasy",Aragorn is revealed as the heir to the ancient...,8.9,1605940,movie,the lord of the rings: the return of the king,the lord of the rings: the return of the king,2003
2836,Movie,schindler's list,Steven Spielberg,"Liam Neeson, Ben Kingsley, Ralph Fiennes, Caro...",United States,"April 1, 2018",1993,R,195 min,"Classic Movies, Dramas",Oskar Schindler becomes an unlikely humanitari...,8.9,1184746,movie,schindler's list,schindler's list,1993
1813,Movie,inception,Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...","United States, United Kingdom","January 1, 2020",2010,PG-13,148 min,"Action & Adventure, Sci-Fi & Fantasy, Thrillers","In this mind-bending sci-fi thriller, a man ru...",8.8,2006939,movie,inception,inception,2010
740,Movie,the matrix,"Lilly Wachowski, Lana Wachowski","Keanu Reeves, Laurence Fishburne, Carrie-Anne ...",United States,"November 1, 2019",1999,R,136 min,"Action & Adventure, Sci-Fi & Fantasy",A computer hacker learns that what most people...,8.7,1634375,movie,the matrix,the matrix,1999


Since the merged dataset is established, the researchers would like to check for null values from the dataset.

### Check for NaN values

In [20]:
netflix_titles_rating_2000.isnull().any()

type             False
title            False
director          True
cast              True
country           True
date_added       False
release_year     False
rating           False
duration         False
listed_in        False
description      False
averageRating    False
numVotes         False
titleType        False
primaryTitle     False
originalTitle    False
startYear        False
dtype: bool

In [21]:
nan_vars = netflix_titles_rating_2000.columns[netflix_titles_rating_2000.isnull().any()].tolist()
print(nan_vars)

['director', 'cast', 'country']


The data shows that the columns `director`, `cast`, and `country` contains null values. Before applying necessary procedure, the researchers should first know the number of null values on the indicated columns.

In [22]:
for variable in nan_vars:
    print(variable, sum(netflix_titles_rating_2000[variable].isnull()))

director 8
cast 53
country 8


This shows that the number of null values from the columns are very low. With that, the researchers opted to drop the rows with null values.

In [23]:
netflix_titles_rating_2000.dropna(subset['director'], inplace=True)
netflix_titles_rating_2000.dropna(subset['cast'], inplace=True)
netflix_titles_rating_2000.dropna(subset['country'], inplace=True)

1894                    United States
1854       New Zealand, United States
2836                    United States
1813    United States, United Kingdom
740                     United States
                    ...              
1478            Russia, United States
2525                            India
905                             India
765                             India
1915                    United States
Name: country, Length: 1532, dtype: object

Upon observation, the column `listed_in` in __netflix_titles_rating_2000__ consists of multiple genres. With that, the researchers intend to split the genres to be used in analyzing the data for the Exploratory Data Analysis.

### Split Genres 

In [24]:
from itertools import chain

# return list from series of comma-separated strings
def chainer(s):
    return list(chain.from_iterable(s.str.split(',')))

# calculate lengths of splits
lens = netflix_titles_rating_2000['listed_in'].str.split(',').map(len)


# create new dataframe, repeating or chaining as appropriate
res = pd.DataFrame({'title': numpy.repeat(netflix_titles_rating_2000['title'], lens),
                    'averageRating': numpy.repeat(netflix_titles_rating_2000['averageRating'], lens),
                    'listed_in': chainer(netflix_titles_rating_2000['listed_in']),
                    })
res['listed_in']=res['listed_in'].str.strip()

print(res)

                                              title  averageRating  \
1894                                   pulp fiction            8.9   
1894                                   pulp fiction            8.9   
1894                                   pulp fiction            8.9   
1854  the lord of the rings: the return of the king            8.9   
1854  the lord of the rings: the return of the king            8.9   
...                                             ...            ...   
765                                      himmatwala            1.7   
765                                      himmatwala            1.7   
765                                      himmatwala            1.7   
1915                 justin bieber: never say never            1.6   
1915                 justin bieber: never say never            1.6   

                 listed_in  
1894        Classic Movies  
1894           Cult Movies  
1894                Dramas  
1854    Action & Adventure  
1854      Sci-

### Exploratory Data Analysis
Exploratory Data Analysis is an approach used to maximize insight into a data set, uncover underlying structure, extract important variables, detect outliers and anomalies, and test underlying assumptions. 

The researchers aim to use this approach to extract the main characterisitcs of our data. This part contains visual aids to further illustrate significant patterns and trends from the data. 

### Top genres
In here, the researchers would like to determine what type of content Netflix releases the most. To do that, a bar graph is used to show the number of movies in a specific genre and the level of differences between genres.

In [25]:
temp_df = res['listed_in'].value_counts().reset_index()


# create trace1
top_genres = go.Bar(
                x = temp_df['index'],
                y = temp_df['listed_in'],)
fig = go.Figure(data = [top_genres])
fig.show()

This figure shows the total count of movie genre in Netflix. `Drama` being 673 is found to be the highest next to `International Movies` and `Comedies` being 526 and 482 respectively. 

### Top Rated Movies

The reaserchers aim to know what type of genre people would most likely watch based on the number of votes that the movie has. To do this, the researchers used the following columns: `title`, `numVotes`, `listed_in`, and `release_year`. 

In [27]:
df1 = netflix_titles_rating_2000.sort_values("numVotes", ascending = False)
display(df1[['title', "numVotes", 'listed_in', 'release_year']][:10])

Unnamed: 0,title,numVotes,listed_in,release_year
1813,inception,2006939,"Action & Adventure, Sci-Fi & Fantasy, Thrillers",2010
1894,pulp fiction,1782352,"Classic Movies, Cult Movies, Dramas",1994
740,the matrix,1634375,"Action & Adventure, Sci-Fi & Fantasy",1999
1854,the lord of the rings: the return of the king,1605940,"Action & Adventure, Sci-Fi & Fantasy",2003
1855,the lord of the rings: the two towers,1451316,"Action & Adventure, Sci-Fi & Fantasy",2002
1459,inglourious basterds,1231605,Action & Adventure,2009
2836,schindler's list,1184746,"Classic Movies, Dramas",1993
2398,the departed,1161114,"Dramas, Thrillers",2006
1781,american beauty,1050054,Dramas,1999
683,american history x,1015108,Dramas,1998


This figure shows the list of the top 10 rated movies. Inception received the highest number of votes having 2,006,939 next to pulp fiction and the matrix having 1,782,352 and 1,634,375 respectively. Majority of the top 10 films belong to the `Action & Adventure` and `Drama` genre. This suggests that users are most likely to watch these types of genre. 

### Frequency of Rating

Ratings are used to rate a film's suitability for certain audiences based on its content. With this, the researchers aim to determine what kind of movies, based on its rating, Netflix releases for its audiences. To do this, a pie chart is used to classify the percentage of ratings. 

The following are the descriptions of each rating:
1. __G - General Audiences__
<br> All ages admitted. Nothing that would offend parents for viewing by children. 
2. __PG – Parental Guidance Suggested__
<br> Some material may not be suitable for children. Parents urged to give "parental guidance". 
3. __PG-13 – Parents Strongly Cautioned__
<br> Some material may be inappropriate for children under 13. Parents are urged to be cautious. 
4. __R – Restricted__
<br> Under 17 requires accompanying parent or adult guardian. Contains some adult material. 
5. __TV-Y7__
<br> This program is most appropriate for children age 7 and up.
6. __TV-G__
<br> This program is suitable for all ages.
7. __TV-PG__
<br> This program contains material that parents may find unsuitable for younger children. Parental guidance is recommended.
8. __TV-14__
<br> This program may be unsuitable for children under 14 years of age.
9. __TV-MA__
<br> This program is intended to be viewed by mature, adult audiences and may be unsuitable for children under 17.
10. __NR/UR - Not Rated/Unrated__
<br> This program may either have not been submitted for a rating or it is an uncut version. Unrated contains warnings stating that the uncut version of the program contains content different from original release and might not be suitable for minors.

In [28]:
temp_df1 = netflix_titles_rating_2000['rating'].value_counts().reset_index()
trace = go.Pie(labels = temp_df1['index'], values = temp_df1['rating'])
data = [trace]
fig = go.Figure(data = data)
iplot(fig)

This figure shows that majority of the films are rated `R` followed by `TV-MA`. Both genres are only suitable above 17 years old, and contain adult material. This means that nearly half of Netflix's movies caters to mature audiences.  

### Production of films per year 
In here, the researchers want to know if a large production of films per year correlates to the success of those films. To do this, a bar graph is used to illustrate the number of films released per year. 

In [29]:
temp_df2 = netflix_titles_rating_2000['release_year'].value_counts().reset_index()

# create trace1
rating_count = go.Bar(
                x = temp_df2['index'],
                y = temp_df2['release_year'],
                name="Movies",)

fig = go.Figure(data = [rating_count])
fig.show()

This figure shows that most number of films were released in the 2010s. However, based on the ratings given from the top rated movies, half of the top rated movies were released in the 1990s. This shows that there is no correlation between the large production of films released per year and its success. 

### Getting confidence interval of average ratings from three top genres
The data showed that majority of the top rated movies came from the genre `Dramas`, `Action & Adventure`, and `Sci-Fi & Fantasy`. With that, the researchers would like to know the range of ratings a user would most likely give on these genres based on its confidence interval. 

In here, the researchers selected the `title` and `averageRating` of `Drama`, `Action & Adventure`, and `Sci-Fi & Fantasy` movies. This collection represents our __population__ of interest.

In [30]:
three_top_genres = res[(res['listed_in'] == "Dramas") | (res['listed_in'] == "Action & Adventure") | (res['listed_in'] == "Sci-Fi & Fantasy")]
three_top_genres

Unnamed: 0,title,averageRating,listed_in
1894,pulp fiction,8.9,Dramas
1854,the lord of the rings: the return of the king,8.9,Action & Adventure
1854,the lord of the rings: the return of the king,8.9,Sci-Fi & Fantasy
2836,schindler's list,8.9,Dramas
1813,inception,8.8,Action & Adventure
...,...,...,...
2675,bir baba hindu,2.8,Action & Adventure
2668,alien warfare,2.6,Action & Adventure
2668,alien warfare,2.6,Sci-Fi & Fantasy
1478,black rose,2.5,Action & Adventure


We need a certain number of samples to represent the population. With this, the researchers used a sample of 300 from the population, and extracted the summary statistics of the variable `averageRating`. 

In [44]:
sample_top_genres = three_top_genres.sample(300)
agg_top = sample_top_genres.agg({"averageRating": ["mean", "median", "std"]})
agg_top

Unnamed: 0,averageRating
mean,6.616667
median,6.7
std,1.00063


In [52]:
sample_mean = agg_top.loc["mean"][0]
sample_median = agg_top.loc["median"][0]
sample_std = agg_top.loc["std"][0]
sample_mean

6.616666666666662

#### Confidence Interval

The researcher used a 95% confidence level that corresponds to the the middle 95% of the distribution. To do this, we obtained the critical value associated with this area which will correspond to the 97.5th percentile. 

The following is the formula to obtain the confidence interval of a population mean:
$$\bar{x} \pm z^* \frac{s}{\sqrt{n}}$$

In [46]:
z_star_95 = norm.ppf(0.975)
z_star_95

1.959963984540054

In [47]:
margin_of_error = z_star_95 * (sample_std / numpy.sqrt(60))
margin_of_error

0.2531895902495592

The 95% confidence interval is the sample mean $\pm$ the margin of error.

6.534384295062455 ± 0.09097466203136048

The following is the confidence interval expressed as a range (minimum, maximum).

In [53]:
(sample_mean - margin_of_error, sample_mean + margin_of_error)

(6.3634770764171025, 6.869856256916221)

Since we've obtained the confidence interval of the population mean, we would want to know if the true mean value of the population would belong to the given range. To do this, we used the `averageRating` from the three_top_genres and extracted its mean.

In [55]:
three_top_genres.agg({"averageRating": "mean"})

averageRating    6.523955
dtype: float64

##### Conclusion

The result shows that the true mean of the population belongs to the confidence interval. With that, we can say that we’re 95% confident that the true average rating of Drama, Action & Adventure, and Sci-Fi & Fantasy movies lies between the values __6.3635__ and __6.8699__.