# Recommendation System

If you choose the Recommendation System option, you will be making movie recommendations based on the [MovieLens](https://grouplens.org/datasets/movielens/latest/) dataset from the GroupLens research lab at the University of Minnesota.  Unless you are planning to run your analysis on a paid cloud platform, we recommend that you use the "small" dataset containing 100,000 user ratings (and potentially, only a particular subset of that dataset).

Your task is to:

> Build a model that provides top 5 movie recommendations to a user, based on their ratings of other movies.

The MovieLens dataset is a "classic" recommendation system dataset, that is used in numerous academic papers and machine learning proofs-of-concept.  You will need to create the specific details about how the user will provide their ratings of other movies, in addition to formulating a more specific business problem within the general context of "recommending movies".

#### Collaborative Filtering

At minimum, your recommendation system must use collaborative filtering.  If you have time, consider implementing a hybrid approach, e.g. using collaborative filtering as the primary mechanism, but using content-based filtering to address the [cold start problem](https://en.wikipedia.org/wiki/Cold_start_(computing)).

#### Evaluation

The MovieLens dataset has explicit ratings, so achieving some sort of evaluation of your model is simple enough.  But you should give some thought to the question of metrics.  Since the rankings are ordinal, we know we can treat this like a regression problem.  But when it comes to regression metrics there are several choices: RMSE, MAE, etc.  [Here](http://fastml.com/evaluating-recommender-systems/) are some further ideas.

### imports

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
movies= pd.read_csv('../../../data/movies.csv')

In [None]:
movies.head(10)

In [None]:
links= pd.read_csv('../../../data/links.csv')
links.head()

In [None]:
ratings= pd.read_csv('../../../data/ratings.csv')
ratings.head()

In [None]:
tags= pd.read_csv('../../../data/tags.csv')
tags.head()

### merge ratings and movie title/genre

In [None]:
rated_movies=pd.merge(ratings, movies, on='movieId')
rated_movies.head()

In [None]:
# drop timestamp
rated_movies = rated_movies.drop(['timestamp'],axis=1)
rated_movies.head()

In [None]:
def proj_eda(df): 
    eda_df = {}
    eda_df['null_sum'] = df.isnull().sum()
    eda_df['null_pct'] = df.isnull().mean()
    eda_df['dtypes'] = df.dtypes
    eda_df['count'] = df.count()
    eda_df['mean'] = df.mean()
    eda_df['median'] = df.median()
    eda_df['min'] = df.min()
    eda_df['max'] = df.max()
    
    return pd.DataFrame(eda_df)
proj_eda(rated_movies)

### get average rating and number of ratings

In [None]:
rated = pd.DataFrame(rated_movies.groupby('title')['rating'].mean())
rated.sort_values('rating', ascending=False)

In [None]:
rated['num_rating'] = pd.DataFrame(rated_movies.groupby('title')['rating'].count())
rated.head()

In [None]:
top_20=rated.sort_values('num_rating', ascending=False)[:20]

top_20.head()

### some visualizing

In [None]:
fig,ax = plt.subplots(1,1)
figsize=(20,16)
a = rated['rating']
ax.hist(a, bins = 10)
ax.set_xticks([0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5])
ax.set_title('Distribution of Ratings')
ax.set_xlabel('ratings')
ax.set_ylabel('number of reviews')
plt.show()

In [None]:
fig,ax = plt.subplots(1,1)
figsize=(30,20)

ax.barh(top_20.index, top_20.num_rating)
ax.set_title('20 Most Rated Movies')

ax.set_xlabel('number of reviews')
plt.show()

### matrix 

In [None]:
movie_matrix = rated_movies.pivot_table(index='userId', columns='title', values='rating')
movie_matrix.head()