#**Recommender System with the help of Correlation**

###We have built a basic recommendation system by suggesting movies that are most similar to a particular movie. This is not a very robust recommendation system, it just gives movies most similar to your movie choice.

In [None]:

import numpy as np
import pandas as pd

**Import the datasets**

In [None]:
movies_data=pd.read_csv('../input/movies.csv')
movies_data.head()

In [None]:
ratings_data=pd.read_csv('../input/ratings.csv')
ratings_data.head()

####**Merge both the dataframes** 

In [None]:
movie_ratings_combined=pd.merge(ratings_data,movies_data,on='movieId')
movie_ratings_combined

##**Lets visualize the data we have**

In [None]:
import matplotlib.pyplot as plot
import seaborn as sns
sns.set_style('white')
%matplotlib inline

**Lets get the average ratings for each movie which might be useful going forth**

In [None]:
movie_ratings_combined.groupby('title')['rating'].mean().sort_values(ascending=False).head()

**Calculating Number of ratings given for each movie is the key element here which will be used to filter out movies with good ratings but have only few number of voters who have voted.**

*A movie with 20 voters and 4 out of 5 average ratings must be preferred over a 
movie with 2 voters and 5 on 5 average ratings.*


---


We will perform filtering in later steps. For now let's focus what's more 
important i.e. combining ratings and no of ratings tables into one. 


---



In [None]:
movie_ratings_combined.groupby('title')['rating'].count().sort_values(ascending=False).head()

In [None]:
ratings=pd.DataFrame(movie_ratings_combined.groupby('title')['rating'].mean())
ratings.head()

##**Create a ratings dataframe that will keep a log of movie ratings and number of ratings for all the movies**

In [None]:
ratings['Number of Ratings']=pd.DataFrame(movie_ratings_combined.groupby('title')['rating'].count())
ratings.head()

**This is how our 'Number of Ratings' Distribution looks**

In [None]:
plot.figure(figsize=(10,5))
ratings['Number of Ratings'].hist(bins=30);

**We will get the ratings histogram.**

###What we find that most users have rated between 3 and 4 for the movies available in the dataset.

---

It follows a normal distribution curve

In [None]:
plot.figure(figsize=(10,5))
ratings['rating'].hist(bins=70);

###We can get a jointplot for our dataset which gives both the scatter plot as well as the histrogram 

**Notice that wherever the scatterplot area is dense there's the most 'Number of Ratings' given for the movies.**
Many movies have zero ratings as well.

In [None]:
sns.jointplot(x='rating',y='Number of Ratings',data=ratings,alpha=0.5);

###**Let's create a matrix that has the user ids on Y-axis and the movie title on X-axis. The cell values will be the ratings of the movies(columns) given by the different users(rows) and there are many movies people have not rated/seen which are represented as NaN.**

In [None]:
movies_ratings_pivot=movie_ratings_combined.pivot_table(index='userId',columns='title',values='rating')
movies_ratings_pivot.head()

##We get the top most rated movies by :

In [None]:
ratings.sort_values('Number of Ratings',ascending=False).head(10)

##Now let's get movie recommendations based on the movie of our choice in this case 'Forest Gump'.


---


You can get recommendations based on your choice by replacing 'Forrest Gump' with it.


---

**First we get the user ratings for Forrest Gump from 'movies_ratings_pivot' matrix we created**


In [None]:
forrestgump_user_ratings=movies_ratings_pivot['Forrest Gump (1994)']
forrestgump_user_ratings.head(10)

###Then we use **corrwith() method** to get correlations between movies based on their ratings present in the matrix.

**dataframe.corrwith() is used to compute pairwise correlation between rows or columns of two DataFrames.**

To understand more about correlation please check the [link](https://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/#:~:text=A%20correlation%20coefficient%20of%201,perfect%20correlation%20with%20foot%20length.&text=Zero%20means%20that%20for%20every,a%20positive%20or%20negative%20increase)


---



####With respect to 'forrestgump_user_ratings' we are going to correlate with the matrix data and then we are going to find out the correlation with respect to the userId.

In [None]:
similar_movies_forrestgump=movies_ratings_pivot.corrwith(forrestgump_user_ratings)

###Convert the Series to Dataframe and drop the NaN values

In [None]:
correlation_forrestgump=pd.DataFrame(similar_movies_forrestgump,columns=['Correlation'])
correlation_forrestgump.dropna(inplace=True)
correlation_forrestgump.head(10)

### From above we see that we get the correlation for different movies with respect to 'Forrest Gump'. and if we sort the above dataframe by correlation, we should get the most similar movies on top.


---


**But this dataframe which we obtained below doesn't help much because there are a lot of movies only watched once by users who also watched Forest Gump.**

**So now we will have to filter out the movies with lower number of ratings**


---


Note: This will be performed after we visualize the sorted correlation data for movies in the below step

In [None]:
correlation_forrestgump.sort_values(by='Correlation', ascending=False).head(10)

**We get the number of ratings for movies in the correlation_forrestgump from the 'ratings' dataframe we had created in the earlier steps.**

In [None]:
correlation_forrestgump=correlation_forrestgump.join(ratings['Number of Ratings'])
correlation_forrestgump.head()

**We use '100' as the threshold value for the 'Number of Ratings' to filter out the movies. '100' is chosen based on the histogram we observed earlier.**

## **And we have the final solution of movies similar to the movie Forest Gump.**

In [None]:
correlation_forrestgump[correlation_forrestgump['Number of Ratings']>100].sort_values('Correlation',ascending=False).head(10)