# Module 4 Final Project

### Problem 2: Recommendation System

If you choose the Recommendation System option, you will be making movie recommendations based on the MovieLens dataset from the GroupLens research lab at the University of Minnesota. Unless you are planning to run your analysis on a paid cloud platform, we recommend that you use the "small" dataset containing 100,000 user ratings (and potentially, only a particular subset of that dataset).

**Your task is to:**

Build a model that provides top 5 movie recommendations to a user, based on their ratings of other movies.

The MovieLens dataset is a "classic" recommendation system dataset, that is used in numerous academic papers and machine learning proofs-of-concept. You will need to create the specific details about how the user will provide their ratings of other movies, in addition to formulating a more specific business problem within the general context of "recommending movies".

**Collaborative Filtering**

At minimum, your recommendation system must use collaborative filtering. If you have time, consider implementing a hybrid approach, e.g. using collaborative filtering as the primary mechanism, but using content-based filtering to address the cold start problem.

**Evaluation**

The MovieLens dataset has explicit ratings, so achieving some sort of evaluation of your model is simple enough. But you should give some thought to the question of metrics. Since the rankings are ordinal, we know we can treat this like a regression problem. But when it comes to regression metrics there are several choices: RMSE, MAE, etc. Here are some further ideas.

# Movie Recommendation System

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# loading csv's into dataframes
ratings = pd.read_csv('ratings.csv')
tags = pd.read_csv('tags.csv')
movies = pd.read_csv('movies.csv')
links = pd.read_csv('links.csv')

### Content of Files

**ratings:** 
- Each row represents one rating of one movie by one user
- Ratings are made on a 5-star scale, with half-star increments 

**tags:** 
- Each row represents one tag applied to one movie by one user
- Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user

**movies:** 
- Each row represents one movie
- Movie titles are entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles

**links:** 
- Identifiers that can be used to link to other sources of movie data; each row represents one movie


In [3]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
type(ratings)

pandas.core.frame.DataFrame

In [None]:
tags.head()

In [None]:
movies.head()

In [None]:
links.head()

In [None]:
# merging movie and rating dataframes 
df = ratings.merge(movies,on='movieId', how='left')

In [None]:
# dropping timestamp column
df.drop(columns = 'timestamp',inplace=True)

In [None]:
# creating year column


### EDA

In [None]:
df.head()

##### genre categories

In [None]:
# creating dictionary of genres 
genres = []
for row in df['genres']: 
    genres.extend(row.split('|'))
genre_dict = {}
for genre in genres:
    if genre in genre_dict: 
        genre_dict[genre] += 1
    else:
        genre_dict[genre] = 1
# creating df 
genres = {'Genre': list(genre_dict.keys()), 'Count': list(genre_dict.values())}
genre_df = pd.DataFrame(genres)
sorted_genres = genre_df.sort_values(by='Count',ascending=False)
sorted_genres

In [None]:
# barplot of genres
x = sorted_genres['Genre']
y = sorted_genres['Count']

fig, ax = plt.subplots(figsize=(20,10))
ax.bar(x,y)
sns.barplot(x,y)

ax.set_xticklabels(x,rotation=80)
ax.set_ylabel("Count of Genre")
ax.set_title("Count of Genres in Dataset");


##### ratings

In [None]:
ratings = df['rating'].value_counts().sort_index(ascending = False)

In [None]:
fig, ax = plt.subplots()
ax = ratings.plot(kind="bar")
ax.set_ylabel('Count')
ax.set_xlabel('Ratings')
ax.set_title('Frequency of Ratings by Users');

### Collaborative Filtering 

In [None]:
!Pip install scikit-surprise

In [5]:
# dropping timestamp column and converting ratings file to a df
ratings = ratings.drop(columns='timestamp')

In [6]:
# reading in the ratings dataset using surprise
from surprise import Reader, Dataset
from surprise.model_selection import train_test_split
reader = Reader()
data = Dataset.load_from_df(ratings, reader)

In [7]:
# checking how many users and items in our dataset 
dataset = data.build_full_trainset()
print('Number of users: ', dataset.n_users, '\n')
print('Number of items: ', dataset.n_items)

Number of users:  610 

Number of items:  9724


In [8]:
# importing libraries for modeling
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV
import numpy as np

In [9]:
# performing gridsearch with SVD
params = {'n_factors': [20, 50, 100],
         'reg_all': [0.02, 0.05, 0.1]}
g_s_svd = GridSearchCV(SVD,param_grid=params,n_jobs=-1)
g_s_svd.fit(data)

In [10]:
# determining best score and params
print(g_s_svd.best_score)
print(g_s_svd.best_params)

{'rmse': 0.868711086651938, 'mae': 0.6675606029080111}
{'rmse': {'n_factors': 100, 'reg_all': 0.05}, 'mae': {'n_factors': 100, 'reg_all': 0.05}}


In [11]:
# cross validating with KNN basic 
knn_basic = KNNBasic(sim_options={'name':'pearson', 'user_based':True})
cv_knn_basic = cross_validate(knn_basic, data, n_jobs=-1)

In [12]:
for i in cv_knn_basic.items():
    print(i)
print('-----------------------')
print(np.mean(cv_knn_basic['test_rmse']))

('test_rmse', array([0.97361981, 0.96966963, 0.97440021, 0.9823685 , 0.96740846]))
('test_mae', array([0.75218065, 0.7478337 , 0.75167886, 0.75956303, 0.746431  ]))
('fit_time', (0.351093053817749, 0.3658719062805176, 0.3702709674835205, 0.36009883880615234, 0.3330380916595459))
('test_time', (1.1610181331634521, 1.1606900691986084, 1.2329440116882324, 1.1385619640350342, 1.1787469387054443))
-----------------------
0.9734933220784328


In [13]:
# cross validating with KNN baseline 
knn_baseline = KNNBaseline(sim_options={'name':'pearson', 'user_based':True})
cv_knn_baseline = cross_validate(knn_baseline,data)

Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.


In [14]:
for i in cv_knn_baseline.items():
    print(i)

np.mean(cv_knn_baseline['test_rmse'])

('test_rmse', array([0.87824802, 0.88788667, 0.8770327 , 0.87187538, 0.87506507]))
('test_mae', array([0.67235431, 0.67386793, 0.67114579, 0.66666312, 0.66795117]))
('fit_time', (0.7323458194732666, 0.6480069160461426, 0.6585133075714111, 0.6702239513397217, 0.6465158462524414))
('test_time', (1.6693470478057861, 1.729867935180664, 1.8060507774353027, 1.7526829242706299, 1.7236599922180176))


0.8780215682451843

### Content-Based Filtering