## Netflix Movies Data Analysis: EDA and Recommender System
Netflix is the world’s leading entertainment streaming service with over 193 million paid subscribers or memberships in over 190 countries. It consists of movies, TV shows or series, documentaries, and different genres and languages.  This study seeks to understand the behavior and the preference of the members by utilizing the data that can be derived from the parameters in the dataset. With that, this notebook aims to analyze its data by identifying the trends, patterns, anomalis, and data techniques that can be used to extract valuable conclusions.

The researchers will be using the __netflix dataset__ for our data analysis. The following techniques will be used for the data:
1. Exploratory Data Analysis
2. Confidence Intervals
3. Statistical Inference 
4. Recommender System

In addition, this notebook seeks to answer the following research questions: 
1. What behavior can we conclude from the subscribers of Netflix based on the activities of these users?
2. What kind of movies can entertain more subscribers based on the data?
3. What type of movie would a population prefer based on the popularity of genres in a certain country?
4. Is there a certain threshold of rating for a movie to be shown or produced in a country?
5. Is there a significant difference between the average ratings on a span of 15 years?
6. Which set of movies will a user most likely watch based from a movie title?

## Imports

Install **plotly** 

In [1]:
pip install plotly

Note: you may need to restart the kernel to use updated packages.


Import **pandas**, **csv**, **numpy**, **seaborn**, **matplotlib**, **plotly**, and **scipy**.

In [2]:
import pandas as pd
import csv
import numpy
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.offline as py
from plotly.offline import iplot, init_notebook_mode
from scipy.stats import norm
import plotly.graph_objs as go
py.init_notebook_mode(connected = True)

The researchers will be using a total of 3 datasets. First is the __netflix_titles.csv__, it is collected through a third-party Netflix search engine known as Flixable and it consists of all the necessary information of every movie in Netflix. However, it does not consist of the ratings of users on movies. Therefore, the researchers used two other datasets (__title.ratings.tsv__ and __title.basics.tsv__) to integrate the necessary information to conduct our study such as the average rating of users in each movies and its number of votes. 

### Read "title.ratings.tsv" file
The __title.ratings.tsv__ is a dataset from IMDb, an online database owned by Amazon used for ratings, and fan and critical reviews, which consists of the following parameters:

`tconst` - alphanumeric unique identifier of the title
<br> `averageRating` - weighted average of all the individual user ratings per movie
<br> `numVotes` - the total number of votes of a movie

In [3]:
title_ratings=pd.read_csv("title.ratings.tsv", sep='\t')

Show the head of `title_ratings`.

In [4]:
title_ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.6,1647
1,tt0000002,6.1,198
2,tt0000003,6.5,1345
3,tt0000004,6.2,120
4,tt0000005,6.2,2131


### Read "title.basics.tsv" file
The other dataset is called __title.basics.tsv__ which is used to interpret the string value of movies from the __title.ratings.tsv__ dataset by connecting the variable `tconst` from both dataset. 

The dataset contains the following variables:
<br> `titleType` - type of film (movie)
<br> `tconst` - alphanumeric unique identifier of the title
<br> `primaryTitle` - string value of the title
<br> `originalTitle` - title of the movie
<br>`startYear` - year it was produced

In [5]:
title_basics=pd.read_csv("title.basics.tsv", sep='\t')
title_basics=title_basics.drop_duplicates()
title_basics=title_basics[['titleType','tconst','primaryTitle', 'originalTitle', 'startYear']]
title_basics=title_basics[title_basics.titleType=='movie']
title_basics=title_basics[title_basics.startYear.apply(lambda x: str(x).isnumeric())]


Columns (5) have mixed types.Specify dtype option on import or set low_memory=False.



Show the head of `title_basics`.

In [6]:
title_basics.head()

Unnamed: 0,titleType,tconst,primaryTitle,originalTitle,startYear
8,movie,tt0000009,Miss Jerry,Miss Jerry,1894
144,movie,tt0000147,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,1897
331,movie,tt0000335,Soldiers of the Cross,Soldiers of the Cross,1900
498,movie,tt0000502,Bohemios,Bohemios,1905
570,movie,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906


### Merge "title.ratings.tsv" and "title.basics.tsv" 
Create a new dataframe `ratings_and_titles` by merging the datasets `title.ratings.tsv` and `title.basics.tsv`.

In [7]:
ratings_and_titles=pd.merge(title_ratings.set_index('tconst'), title_basics.set_index('tconst'), left_index=True, right_index=True, how='inner')
ratings_and_titles=ratings_and_titles.drop_duplicates()

### Read netflix_titles.csv
We then read the dataset `netflix_titles.csv`

In [8]:
netflix_titles=pd.read_csv("netflix_titles.csv", index_col="show_id")

Next, we would want to clean our data before using the dataset for further data analysis. With that, the researchers intend to do this by dropping the rows without a value on the column `release year`. We also want to ensure that all values from the columns `release year` and `startYear` are of an integer type. Lastly, we would want to have a uniformed format on the title of movies. To do this, the researchers converted the titles into lowercase. 

Drop observations from `netflix_titles` without `release_year`. 

In [9]:
netflix_titles=netflix_titles.dropna(subset=['release_year'])

Change the data type of `release_year` column to integer.

In [10]:
netflix_titles.release_year=netflix_titles.release_year.astype(numpy.int64)

Drop observations in `ratings_and_titles` with non-numeric values for `startYear` and convert to integer.

In [11]:
ratings_and_titles=ratings_and_titles[ratings_and_titles.startYear.apply(lambda x: str(x).isnumeric())]
ratings_and_titles.startYear=ratings_and_titles.startYear.astype(numpy.int64)

Convert the observations in the `title` column to lowercase.

In [12]:
netflix_titles['title']=netflix_titles['title'].str.lower()
ratings_and_titles['originalTitle']=ratings_and_titles['originalTitle'].str.lower()
ratings_and_titles['primaryTitle']=ratings_and_titles['primaryTitle'].str.lower()

After cleaning the data, the researchers can now merge all 3 datasets.

### Join netflix titles with IMDb ratings on title name and release year.
We will create a new dataframe `netflix_titles_rating` by merging `netflix_titles` and `ratings_and_titles`.

In [13]:
netflix_titles=netflix_titles[netflix_titles.type=='Movie']
netflix_titles_rating=pd.merge(netflix_titles, ratings_and_titles, left_on=['title', 'release_year'], right_on=['primaryTitle', 'startYear'], how='inner')

### Sort the obtained data frame by averageRating and number of votes
The merged dataset is named `netflix_titles_ratings_2000`. The researchers opted to use movies with ratings more than 2000 to have a more accurate data given by the number of ratings. This dataset will be now used for the data analysis.

The following are the parameters of the dataset: 
<br> `type` - a movie or tv show
<br> `title` - the title of a movie or tv show
<br> `director` - director of a movie or tv show
<br> `cast` - actors/actresses involved in the film
<br> `country` - the country where the movie/tv show was produced
<br> `date_added` - the date it was added in Netflix
<br> `release_year` - the actual release date of the movie/tv show
<br> `rating` - the rating of a movie/tv show (TV-MA, TV-14, TV-PG, TV-Y7-FV, TV-17, R)
<br> `duration` - total duration of the movie/tv show in minutes or seasons
<br> `listed_in` - the genre of a movie/tv show
<br> `description` - brief description of a movie/tv show
<br> `averageRating` - weighted average of all the individual user ratings per movie
<br> `numVotes` - the total number of votes of a movie
<br> `titleType` - type of film (movie)
<br> `primaryTitle` - string value of the title
<br> `originalTitle` - title of the movie
<br>`startYear` - year it was produced

In [14]:
netflix_titles_rating.sort_values(by=['averageRating', 'numVotes'], inplace=True, ascending=False)
netflix_titles_rating_2000=netflix_titles_rating[netflix_titles_rating.numVotes>2000]

Show the head of `netflix_titles_rating_2000`.

In [15]:
netflix_titles_rating_2000.head()

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,averageRating,numVotes,titleType,primaryTitle,originalTitle,startYear
1894,Movie,pulp fiction,Quentin Tarantino,"John Travolta, Samuel L. Jackson, Uma Thurman,...",United States,"January 1, 2019",1994,R,154 min,"Classic Movies, Cult Movies, Dramas",This stylized crime caper weaves together stor...,8.9,1782352,movie,pulp fiction,pulp fiction,1994
1854,Movie,the lord of the rings: the return of the king,Peter Jackson,"Elijah Wood, Ian McKellen, Liv Tyler, Viggo Mo...","New Zealand, United States","January 1, 2020",2003,PG-13,201 min,"Action & Adventure, Sci-Fi & Fantasy",Aragorn is revealed as the heir to the ancient...,8.9,1605940,movie,the lord of the rings: the return of the king,the lord of the rings: the return of the king,2003
2836,Movie,schindler's list,Steven Spielberg,"Liam Neeson, Ben Kingsley, Ralph Fiennes, Caro...",United States,"April 1, 2018",1993,R,195 min,"Classic Movies, Dramas",Oskar Schindler becomes an unlikely humanitari...,8.9,1184746,movie,schindler's list,schindler's list,1993
1813,Movie,inception,Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...","United States, United Kingdom","January 1, 2020",2010,PG-13,148 min,"Action & Adventure, Sci-Fi & Fantasy, Thrillers","In this mind-bending sci-fi thriller, a man ru...",8.8,2006939,movie,inception,inception,2010
740,Movie,the matrix,"Lilly Wachowski, Lana Wachowski","Keanu Reeves, Laurence Fishburne, Carrie-Anne ...",United States,"November 1, 2019",1999,R,136 min,"Action & Adventure, Sci-Fi & Fantasy",A computer hacker learns that what most people...,8.7,1634375,movie,the matrix,the matrix,1999


### Check for NaN values
Since the merged dataset is established, the researchers would like to check for null values from the dataset.

In [16]:
netflix_titles_rating_2000.isnull().any()

type             False
title            False
director          True
cast              True
country           True
date_added       False
release_year     False
rating           False
duration         False
listed_in        False
description      False
averageRating    False
numVotes         False
titleType        False
primaryTitle     False
originalTitle    False
startYear        False
dtype: bool

Let's put the columns from `netflix_titles_rating_2000` with null values into a list.

In [17]:
nan_vars = netflix_titles_rating_2000.columns[netflix_titles_rating_2000.isnull().any()].tolist()
print(nan_vars)

['director', 'cast', 'country']


The list shows that the columns `director`, `cast`, and `country` contains null values. 

Before applying necessary procedure, the researchers should first know the number of null values on the indicated columns.

In [18]:
for variable in nan_vars:
    print(variable, sum(netflix_titles_rating_2000[variable].isnull()))

director 8
cast 53
country 8


This shows that the number of null values from the columns are very low. With that, the researchers opted to drop the rows with null values.

In [19]:
netflix_titles_rating_2000.dropna(subset=['director'], inplace=True)
netflix_titles_rating_2000.dropna(subset=['cast'], inplace=True)
netflix_titles_rating_2000.dropna(subset=['country'], inplace=True)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



After dropping the subsets that has NaN values, we check again to verify that the columns are clear.

In [20]:
netflix_titles_rating_2000.isnull().any()

type             False
title            False
director         False
cast             False
country          False
date_added       False
release_year     False
rating           False
duration         False
listed_in        False
description      False
averageRating    False
numVotes         False
titleType        False
primaryTitle     False
originalTitle    False
startYear        False
dtype: bool

Upon observation, the column `listed_in` in `netflix_titles_rating_2000` consists of multiple genres. With that, the researchers intend to split the genres to be used in analyzing the data for the Exploratory Data Analysis.

### Split Genres 
The researchers utilized `chain` from `itertools`, which will split the genres and return a list from series of comma-separated strings.

In [21]:
from itertools import chain
def chainer(s):
    return list(chain.from_iterable(s.str.split(',')))

The series `lens` calculate the lengths of splits from `listed_in`. 

In [22]:
lens = netflix_titles_rating_2000['listed_in'].str.split(',').map(len)

The dataframe `res` which contains the columns `title`, `averageRating`, and `listed_in`, will be created through repeating or chaining. Then the excess spaces at the beginning and end of strings in the `listed_in` column will be removed.

In [23]:
res = pd.DataFrame({'title': numpy.repeat(netflix_titles_rating_2000['title'], lens),
                    'averageRating': numpy.repeat(netflix_titles_rating_2000['averageRating'], lens),
                    'listed_in': chainer(netflix_titles_rating_2000['listed_in']),
                    })
res['listed_in']=res['listed_in'].str.strip()
res

Unnamed: 0,title,averageRating,listed_in
1894,pulp fiction,8.9,Classic Movies
1894,pulp fiction,8.9,Cult Movies
1894,pulp fiction,8.9,Dramas
1854,the lord of the rings: the return of the king,8.9,Action & Adventure
1854,the lord of the rings: the return of the king,8.9,Sci-Fi & Fantasy
...,...,...,...
765,himmatwala,1.7,Action & Adventure
765,himmatwala,1.7,Comedies
765,himmatwala,1.7,International Movies
1915,justin bieber: never say never,1.6,Documentaries


# Exploratory Data Analysis
Exploratory Data Analysis is an approach used to maximize insight into a data set, uncover underlying structure, extract important variables, detect outliers and anomalies, and test underlying assumptions. 

The researchers aim to use this approach to extract the main characterisitcs of our data. This part contains visual aids to further illustrate significant patterns and trends from the data. 

### Top genres
In here, the researchers would like to determine what type of content Netflix releases the most. To do that, a bar graph is used to show the number of movies in a specific genre and the level of differences between genres.

In [24]:
temp_df = res['listed_in'].value_counts().reset_index()

# create trace1
top_genres = go.Bar(
                x = temp_df['index'],
                y = temp_df['listed_in'],)
fig = go.Figure(data = [top_genres])
fig.show()

This figure shows the total count of movie genre in Netflix. `Drama` being 673 is found to be the highest next to `International Movies` and `Comedies` being 526 and 482 respectively. 

### Top Rated Movies

The reaserchers aim to know what type of genre people would most likely watch based on the number of votes that the movie has. To do this, the researchers used the following columns: `title`, `numVotes`, `listed_in`, and `release_year`. 

In [25]:
df1 = netflix_titles_rating_2000.sort_values("numVotes", ascending = False)
display(df1[['title', "numVotes", 'listed_in', 'release_year']][:10])

Unnamed: 0,title,numVotes,listed_in,release_year
1813,inception,2006939,"Action & Adventure, Sci-Fi & Fantasy, Thrillers",2010
1894,pulp fiction,1782352,"Classic Movies, Cult Movies, Dramas",1994
740,the matrix,1634375,"Action & Adventure, Sci-Fi & Fantasy",1999
1854,the lord of the rings: the return of the king,1605940,"Action & Adventure, Sci-Fi & Fantasy",2003
1855,the lord of the rings: the two towers,1451316,"Action & Adventure, Sci-Fi & Fantasy",2002
1459,inglourious basterds,1231605,Action & Adventure,2009
2836,schindler's list,1184746,"Classic Movies, Dramas",1993
2398,the departed,1161114,"Dramas, Thrillers",2006
1781,american beauty,1050054,Dramas,1999
683,american history x,1015108,Dramas,1998


This figure shows the list of the top 10 rated movies. `Inception` received the highest number of votes having 2,006,939 next to `pulp fiction` and `the matrix` having 1,782,352 and 1,634,375 respectively. Majority of the top 10 films belong to the `Action & Adventure` and `Drama` genre. This suggests that users are most likely to watch these types of genre. 

### Frequency of Age-Based Rating

Ratings are used to rate a film's suitability for certain audiences based on its content. With this, the researchers aim to determine what kind of movies, based on its rating, Netflix releases for its audiences. To do this, a pie chart is used to classify the percentage of ratings. 

The following are the descriptions of each rating:
<br>`G - General Audiences` - All ages admitted. Nothing that would offend parents for viewing by children. 
<br>`PG – Parental Guidance Suggested` - Some material may not be suitable for children. Parents urged to give "parental guidance". 
<br>`PG-13 – Parents Strongly Cautioned` - Some material may be inappropriate for children under 13. Parents are urged to be cautious. 
<br>`R – Restricted` - Under 17 requires accompanying parent or adult guardian. Contains some adult material. 
<br>`TV-Y7` - This program is most appropriate for children age 7 and up.
<br>`TV-G` - This program is suitable for all ages.
<br>`TV-PG` - This program contains material that parents may find unsuitable for younger children. Parental guidance is recommended.
<br>`TV-14` - This program may be unsuitable for children under 14 years of age.
<br>`TV-MA` - This program is intended to be viewed by mature, adult audiences and may be unsuitable for children under 17.
<br>`NR/UR - Not Rated/Unrated` - This program may either have not been submitted for a rating or it is an uncut version. Unrated contains warnings stating that the uncut version of the program contains content different from original release and might not be suitable for minors.

In [26]:
temp_df1 = netflix_titles_rating_2000['rating'].value_counts().reset_index()
trace = go.Pie(labels = temp_df1['index'], values = temp_df1['rating'])
data = [trace]
fig = go.Figure(data = data)
iplot(fig)

This figure shows that majority of the films are rated `R` followed by `TV-MA`. Both genres are only suitable above 17 years old, and contain adult material. This means that nearly half of Netflix's movies caters to mature audiences.  

### Production of films per year 
In here, the researchers want to know if a large production of films per year correlates to the success of those films. To do this, a bar graph is used to illustrate the number of films released per year. 

In [27]:
temp_df2 = netflix_titles_rating_2000['release_year'].value_counts().reset_index()

# create trace1
rating_count = go.Bar(
                x = temp_df2['index'],
                y = temp_df2['release_year'],
                name="Movies",)

fig = go.Figure(data = [rating_count])
fig.show()

This figure shows that most number of films were released in the 2010s. However, based on the ratings given from the top rated movies, half of the top rated movies were released in the 1990s. This shows that there is no correlation between the large production of films released per year and its success. 

### Getting confidence interval of average ratings from three top genres
The data from the __Top Rated Movies__ shows that majority of the top rated movies came from the genre `Dramas`, `Action & Adventure`, and `Sci-Fi & Fantasy`. With that, the researchers would like to know the range of ratings a user would most likely give on these genres based on its confidence interval. 

In here, the researchers selected the `title` and `averageRating` of `Drama`, `Action & Adventure`, and `Sci-Fi & Fantasy` movies. This collection represents our __population__ of interest.

In [28]:
three_top_genres = res[(res['listed_in'] == "Dramas") | (res['listed_in'] == "Action & Adventure") | (res['listed_in'] == "Sci-Fi & Fantasy")]
three_top_genres

Unnamed: 0,title,averageRating,listed_in
1894,pulp fiction,8.9,Dramas
1854,the lord of the rings: the return of the king,8.9,Action & Adventure
1854,the lord of the rings: the return of the king,8.9,Sci-Fi & Fantasy
2836,schindler's list,8.9,Dramas
1813,inception,8.8,Action & Adventure
...,...,...,...
2675,bir baba hindu,2.8,Action & Adventure
2668,alien warfare,2.6,Action & Adventure
2668,alien warfare,2.6,Sci-Fi & Fantasy
1478,black rose,2.5,Action & Adventure


We need a certain number of samples to represent the population. With this, the researchers used a sample of 300 from the population, and extracted the summary statistics of the variable `averageRating`. 

In [29]:
sample_top_genres = three_top_genres.sample(300)
agg_top = sample_top_genres.agg({"averageRating": ["mean", "median", "std"]})
agg_top

Unnamed: 0,averageRating
mean,6.454
median,6.5
std,1.099415


The sample mean from `averageRating` is 6.454.

In [30]:
sample_mean = agg_top.loc["mean"][0]
sample_median = agg_top.loc["median"][0]
sample_std = agg_top.loc["std"][0]
sample_mean

6.454000000000002

#### Confidence Interval

The researcher used a 95% confidence level that corresponds to the the middle 95% of the distribution. To do this, we obtained the critical value associated with this area which will correspond to the 97.5th percentile. 

The following is the formula to obtain the confidence interval of a population mean:
$$\bar{x} \pm z^* \frac{s}{\sqrt{n}}$$

In [31]:
z_star_95 = norm.ppf(0.975)
z_star_95

1.959963984540054

The 95% confidence interval is the sample mean $\pm$ the margin of error.

In [32]:
margin_of_error = z_star_95 * (sample_std / numpy.sqrt(60))
margin_of_error

0.2781852315322055

The following is the confidence interval expressed as a range (minimum, maximum).

In [33]:
(sample_mean - margin_of_error, sample_mean + margin_of_error)

(6.175814768467797, 6.732185231532208)

Since we've obtained the confidence interval of the population mean, we would want to know if the true mean value of the population would belong to the given range. To do this, we used the `averageRating` from the `three_top_genres` and extracted its mean.

In [34]:
three_top_genres.agg({"averageRating": "mean"})

averageRating    6.526567
dtype: float64

The result shows that the true mean of the population belongs to the confidence interval. With that, we can say that we’re 95% confident that the true average rating of `Drama`, `Action & Adventure`, and `Sci-Fi & Fantasy` movies lies between the values __6.1758__ and __6.7322__.

# Movie ratings across different timeframes

Based on the results from the EDA, it showed that the production of films do not correlate to the success of its movies. With that, we want to know if there is a significant difference between the average of user ratings in terms of the years. 

To do this, we will group movies that were released in an interval of 15 years starting from __1941__ up to __2020__. By using `pandas.cut()` we will be able to bin the `release_year` into an intervals of 15.

### Group `release_year` into an interval of 15 years

In the dataset `netflix_titles_rating_2000`, perform binning based on `release_year` column with an interval of 15 years from 1941 to 2020.

In [35]:
bins = [1941,1957,1973,1989,2005,2020]
labels = ['1941-1956','1957-1972','1973-1988','1989-2004','2005-2020']
netflix_titles_rating_2000['year_interval_15'] = pd.cut(netflix_titles_rating_2000['release_year'], bins=bins, labels=labels)
netflix_titles_rating_2000['year_interval_15'].value_counts()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



2005-2020    1181
1989-2004     231
1973-1988      40
1957-1972      18
1941-1956       4
Name: year_interval_15, dtype: int64

The data above shows the number of movies released on a specific range of years.

After grouping the `release_year`, we will now get the average of the `averageRatings` per `year_interval_15`.

### Compare means of `averageRating` per year range
In here, we want to know if there is a significant difference of between the ratings of recent movies and older movies in span of 15 years interval.

To do this, we will get the point estimate of the group year range from __1989-2004__ and __2005-2020__.

In [36]:
netflix_titles_rating_2000.groupby("year_interval_15").agg({"averageRating": ["mean","std"]})

Unnamed: 0_level_0,averageRating,averageRating
Unnamed: 0_level_1,mean,std
year_interval_15,Unnamed: 1_level_2,Unnamed: 2_level_2
1941-1956,7.575,0.125831
1957-1972,7.405556,0.653022
1973-1988,7.255,0.879671
1989-2004,6.654545,0.969609
2005-2020,6.291702,1.063652


Get the difference between year range from 1989-2004 to 2005-2020

In [37]:
6.654545 - 6.291702

0.3628429999999998

Based on our data, we see that there is 0.36 difference of ratings of movies between __1989-2004__ and __2005-2020__.


To see if there is a significant difference between the two grouped year intervals, we will use T-test unpaired observation.


We set up our hypotheses as follows:

$H_0$ (null hypothesis): There is no true difference between the two grouped year intervals.

$H_A$ (alternative hypothesis): There is a true difference between the two grouped year intervals.

Now, we can use a $t$-test to compare the two means from the unpaired groups. We set the `equal_var` parameter to `False` because we don't want to assume that the population has equal variances.

### Find the statistics of two groups (1989-2004 and 2005-2020)
Using t-test, we then compare the two groups:

In [38]:
from scipy.stats import ttest_ind
ttest_ind(netflix_titles_rating_2000[netflix_titles_rating_2000["year_interval_15"] == "2005-2020"]["averageRating"],
          netflix_titles_rating_2000[netflix_titles_rating_2000["year_interval_15"] == "1989-2004"]["averageRating"],
          equal_var = False)

Ttest_indResult(statistic=-5.117153927143217, pvalue=5.14440408750733e-07)

With a 95% confidence level, the result shows that the p-value is less than 0.05. This means that we accept the null hypothesis, which is there is no significant difference between the two grouped year intervals.

However, if we look back to the count of the grouped year intervals' values used, we can say that the values differ from each other because there is more movies produced and rated in the recent years than the older ones.

# Recommender Systems - Content-based Filtering on multiple factors
In this part of the Notebook is mostly adopted from the [Recommendation System (Content Based)](https://www.kaggle.com/niharika41298/netflix-visualizations-recommendation-eda/notebook) Kaggle Notebook of Niharika Pandit, tweaked to match the needs and variables of our CSMODEL Case Study Notebook.

Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback. 

With that, the researchers intend to use this technique in order to formulate similar movies based on the user's preference.

First, we need to copy the dataframe `netflix_titles_rating_2000` into another variable named `recom_netflix_2000` in order to avoid disruption of data since we will be cleaning in a different way to manipulate the dataframe.

In [39]:
recom_netflix_2000 = netflix_titles_rating_2000.copy()

We define a function that cleans the data which removes the spaces in the words.

In [40]:
def clean_data(x):
    if isinstance(x, int):
        return str.lower(str(x).replace(" ", ""))
    else:
        return str.lower(x.replace(" ", ""))

We set up the factors to filter by the `title`, `release_year`, `director`, `cast`, `listed_in`, and `description` of the movies.

In [41]:
features=['title', 'release_year', 'director', 'cast', 'listed_in', 'description']
recom_netflix_2000=recom_netflix_2000[features]

Show the head of `recom_netflix_2000`.

In [42]:
recom_netflix_2000.head()

Unnamed: 0,title,release_year,director,cast,listed_in,description
1894,pulp fiction,1994,Quentin Tarantino,"John Travolta, Samuel L. Jackson, Uma Thurman,...","Classic Movies, Cult Movies, Dramas",This stylized crime caper weaves together stor...
1854,the lord of the rings: the return of the king,2003,Peter Jackson,"Elijah Wood, Ian McKellen, Liv Tyler, Viggo Mo...","Action & Adventure, Sci-Fi & Fantasy",Aragorn is revealed as the heir to the ancient...
2836,schindler's list,1993,Steven Spielberg,"Liam Neeson, Ben Kingsley, Ralph Fiennes, Caro...","Classic Movies, Dramas",Oskar Schindler becomes an unlikely humanitari...
1813,inception,2010,Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...","Action & Adventure, Sci-Fi & Fantasy, Thrillers","In this mind-bending sci-fi thriller, a man ru..."
740,the matrix,1999,"Lilly Wachowski, Lana Wachowski","Keanu Reeves, Laurence Fishburne, Carrie-Anne ...","Action & Adventure, Sci-Fi & Fantasy",A computer hacker learns that what most people...


We then clean the dataframe with the factors defined using the `clean_data` function defined earlier.

In [43]:
for feature in features:
    recom_netflix_2000[feature] = recom_netflix_2000[feature].apply(clean_data)

Show the head of cleaned dataframe `recom_netflix_2000`.

In [44]:
recom_netflix_2000[feature].head()

1894    thisstylizedcrimecaperweavestogetherstoriesfea...
1854    aragornisrevealedastheheirtotheancientkingsash...
2836    oskarschindlerbecomesanunlikelyhumanitarian,sp...
1813    inthismind-bendingsci-fithriller,amanrunsanesp...
740     acomputerhackerlearnsthatwhatmostpeopleperceiv...
Name: description, dtype: object

We define a function that concatenates the data into one string and separate them into spaces called `create_soup()`.

In [45]:
def create_soup(x):
    return x['title']+ ' ' + x['release_year'] + ' ' + x['director'] + ' ' + x['cast'] + ' ' +x['listed_in']+' '+ x['description']

We apply the defined function `create_soup()`  to the dataframe.

In [46]:
recom_netflix_2000['soup'] = recom_netflix_2000.apply(create_soup, axis=1)
recom_netflix_2000['soup'].head()

1894    pulpfiction 1994 quentintarantino johntravolta...
1854    thelordoftherings:thereturnoftheking 2003 pete...
2836    schindler'slist 1993 stevenspielberg liamneeso...
1813    inception 2010 christophernolan leonardodicapr...
740     thematrix 1999 lillywachowski,lanawachowski ke...
Name: soup, dtype: object

We import functions `CountVectorizer` and `cosine_similarity` to compute for the cosine similarity.

In [47]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

The variable `cosine_sim2` will be used to compute for the cosine similarity of a movie title.

In [48]:
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(recom_netflix_2000['soup'])
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

Create a series `indices` from dataframe `recom_netflix_2000` which contains movie titles.

In [49]:
recom_netflix_2000=recom_netflix_2000.reset_index()
indices = pd.Series(recom_netflix_2000.index, index=recom_netflix_2000['title'])

We define a function `get_recommendations_new()` that will return the top 10 recommended movies based on the multiple factors that we used for content-based filtering

In [50]:
def get_recommendations_new(title, cosine_sim=cosine_sim2):
    title=title.replace(' ','').lower()
    idx = indices[title]

    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return recom_netflix_2000[['title', 'release_year']].iloc[movie_indices]

Suppose that we want to get the top 10 movie recommendations to 3 Idiots. Let's use the `get_recommendations_new()` function.

In [51]:
get_recommendations_new('inception', cosine_sim2)

Unnamed: 0,title,release_year
696,dragonheart,1996
1037,clashofthetitans,2010
235,limitless,2011
271,bleachthemovie:hellverse,2010
465,thebookofeli,2010
1179,æonflux,2005
1038,residentevil:afterlife,2010
1410,skyline,2010
790,transcendence,2014
1450,singularity,2017


The function returned the following movie recommendations for 3 Idiots: 
1. Dragonheart (1996)
2. Clash of the Titans (2010)
3. Limitless (2011)
4. Bleach the Movie: Hell Verse (2010)
5. The Book of Eli (2010)
6. Æon Flux (2005)
7. Resident Evil: Afterlife (2010)
8. Skyline (2010)
9. Transcendence (2014)
10. Singularity (2017)

Suppose that we want to get the top 10 movie recommendations to Pulp Fiction. Let's use the `get_recommendations_new()` function once again.

In [52]:
get_recommendations_new('pulp fiction', cosine_sim2)

Unnamed: 0,title,release_year
111,thehatefuleight,2015
210,jackiebrown,1997
287,coachcarter,2005
293,meanstreets,1973
304,alicedoesn'tlivehereanymore,1974
25,taxidriver,1976
244,carrie,1976
247,truegrit,1969
82,catonahottinroof,1958
411,theinterview,1998


The function returned the following movie recommendations for Pulp Fiction: 
1. The Hateful Eight (2015)
2. Jackie Brown (1997)
3. Coach Carter (2005)
4. Mean Streets (1973)
5. Alice Doesn't Live Here Anymore (1974)
6. Taxi Driver (1976)
7. Carrie (1976)
8. True Grit (1969)
9. Cat on a Hot Tin Roof (1958)
10. The Interview (1998)

# Conclusion
To sum the Notebook up, the researchers initially merged the three datasets and performed data cleaning. In the Exploratory Data Analysis portion, the researchers visualized the top genres, top-rated movies, frequency of age-based ratings, and the production of films per year through various graphs. In the confidence interval portion, the researchers checked if there’s a significant difference between the average rating of the following genres: Drama, Action & Adventure, and Sci-Fi & Fantasy. In the statistical inference portion, the researchers computed if there’s a significant difference between the average rating in 1989-2004 and 2005-2020 by using t-test. And lastly, in the recommender system portion, the researchers utilized content-based approach in recommending movies based on a given movie title which is computed through cosine similarity.