# Rate 5, Get 5 - Recommendation System Project 

by Leana Critchell, Jacob Prebys and Dann Morr


<img src="../../src/figures/movielens_logo.png" alt="drawing" width="250"/>

## Table of Contents
- [Overview](#Overview)
- [Data Cleaning and Exploratory Data Analysis](#Data-Cleaning-and-Exploratory-Data-Analysis)
- [Models](#Models)
  - [Collaborative Filtering Model](#Collaborative-Filtering-Model)
  - [Content-Based Model](#Content-Based-Model)
- [Final Results](#Final-Results)
- [Future Work](#Future-Work)


## Overview

We aim to create a recommendation system based on the MovieLens dataset from the GroupLens research lab at the University of Minnesota. Furthermore, we would like to deploy a web app that will alloy a user to enter some ratings for movies that they have seen, and then, based on the model we have implemented, it will reccomend movies that align with their interests. 

## Data Cleaning and Exploratory Data Analysis

In [None]:
%load_ext autoreload
%autoreload 2

### Imports

In [None]:
# standard imports 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
# modelling and processing imports

from surprise import Dataset, Reader
from surprise import accuracy

from surprise.model_selection import train_test_split, cross_validate

from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms import SVDpp
from surprise.prediction_algorithms import SlopeOne
from surprise.prediction_algorithms import NMF
from surprise.prediction_algorithms import NormalPredictor
from surprise.prediction_algorithms import KNNBaseline
from surprise.prediction_algorithms import KNNBasic
from surprise.prediction_algorithms import KNNWithMeans
from surprise.prediction_algorithms import KNNWithZScore
from surprise.prediction_algorithms import BaselineOnly
from surprise.prediction_algorithms import CoClustering

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

In [None]:
# plot parameters
plt.rcParams['axes.labelsize'] = 20
plt.rcParams['axes.titlesize'] = 25
plt.rcParams['xtick.labelsize'] = 18
plt.rcParams['ytick.labelsize'] = 18
plt.rcParams['axes.edgecolor'] = 'white'
plt.rcParams['axes.facecolor'] = 'white' # or EAEAF2
plt.rcParams['font.size'] = 16

In [None]:
import os
import sys

module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

parent_dir = '../../'

from src import recommender as rec
from src import content_rec as cr

## Get the Data

The data used for this project is from GroupLens and is called the MovieLens Dataset.  You can find all the details of this dataset and download the appropriate data files yourself [here](https://grouplens.org/datasets/movielens/latest/).  

Alternatively, you can click [this link](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip) to download the zip file of the data files used in this project (1MB).  This zip file contains 4 csv files:  `movies`, `ratings`, `tags` and `links`.  See the README.md in the [data](../../data) folder for more info on how this data is formatted.  On the website provided above,  you also have access to the 'large' dataset which is 256MB and was not used in this project.  Download from their website at your own will.  

The four csv datasets were downloaded to this repo which you can find [here](../../data) - they are labelled `movies.csv`, `links.csv`, `ratings.csv` and `tags.csv`.  If you're following along in this notebook, the cells below will run as we import these csv's using pandas.  Let's get to it!

In [None]:
# load in 4 datasets:
ratings = pd.read_csv(parent_dir + 'data/ratings.csv')

movies = pd.read_csv(parent_dir + 'data/movies.csv')

tags = pd.read_csv(parent_dir + 'data/tags.csv')

links = pd.read_csv(parent_dir + 'data/links.csv')

## Exploratory Data Analysis

It should be noted that each group member performed their own EDA in their own way.  Please refer to each member's individual exploratory notebooks which you can find [here](../exploratory), for more details on individual findings and explorations.  What will be detailed here is a summary of all of our EDA efforts combined.  

We'll start by exploring each dataset and then aggregate as necessary as we go. 

### Ratings Dataset

Let's first start by looking into the `ratings` dataset:

In [None]:
ratings.head()

Check how many unique movies we have:

In [None]:
len(ratings.movieId.unique())

Check how many unique ratings we have to make sure we don't have any weird values:

In [None]:
print(f"Number of ratings: {len(ratings.rating.unique())}")
print(f"Possible rating values:  {ratings.rating.unique()}")

All these numbers are reasonable and expected.  We can see that we have a 10 point scale from 0.5 - 5 so half ratings are included and 0 is not included. 

Let's check data types and null values:

In [None]:
ratings.info()

We do not appear to have any missing values and all data types seem reasonable.  While we won't use the timestamp column for modelling (this will be dropped), we may want to investigate timeseries information later so we will transform this column to a datetime object:

In [None]:
ratings['timestamp'] = pd.to_datetime(ratings['timestamp'])

In [None]:
ratings.head()

As mentioned in the documentation, `timestamp` is seconds since 1970.  

Let's investigate some more details of the dataframe.  How many users do we have in this dataset?

In [None]:
print(f"Number of users: {len(ratings.userId.unique())}")

What's the average rating?

In [None]:
print(f"Average rating:  {ratings.rating.mean()}")

So we have an average rating that is just above the median rating (2.5).  Let's have a look at the distribution of the ratings:

In [None]:
plt.subplots(figsize = (10, 8))
plt.hist(ratings.rating, bins = 10, color = '#789698')
plt.title('Distribution of Ratings')
plt.xlabel('Rating')
plt.ylabel('Num. of Ratings')
plt.savefig(parent_dir + 'reports/figures/dist_ratings.png')
plt.show();

We see again here that the average rating is around 3.5 and the data is left-skewed.  This shows us that there aren't many low ratings between 0.5 and 2.  Perhaps this says something about the motivation for people to rate movies - perhaps people don't bother if the movie is bad...

Let's have a look at the average rating per movies and view this distribution:

In [None]:
rated = pd.DataFrame(ratings.groupby(['movieId'])['rating'].mean())

In [None]:
rated.sort_values('rating', ascending = False, inplace = True)

Let's also find the number of ratings for each movie and add it to our new dataframe:

In [None]:
rated['num_rating'] = pd.DataFrame(ratings.groupby(['movieId'])['rating'].count())
rated.head()

In [None]:
fig = plt.subplots(figsize=(10,8))
plt.hist(rated.rating, bins = 10, color = '#789698')
plt.xticks([0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5])
plt.title('Distribution of Mean Ratings')
plt.xlabel('Rating Scale')
plt.ylabel('Number of Ratings')
plt.savefig(parent_dir + 'reports/figures/dist_mean_ratings.png')
plt.show();

We see a similar shape here again. 

### Investigating the Longtail Problem:

Something that comes up a lot in recommendation system problems is the long tail problem.  This is where we have a fast majority of users and/or items that only have 1 rating associated to them and a small amount of items/users that have a lot of ratings associated with them.  Let's first look into the number of ratings per movie:

In [None]:
# group ratings by movie and count the number of ratings per movie
num_ratings = ratings.groupby('movieId').count().drop('userId', axis = 1)

In [None]:
# sort these ratings
sorted_num_ratings = num_ratings.sort_values(by = 'rating', axis = 0, ascending = False)

In [None]:
sorted_num_ratings

In [None]:
plt.subplots(figsize = (10, 8))
sns.distplot(sorted_num_ratings.index, bins = 500, color = '#789698')
plt.title('Distribution of Number of Ratings per Movie')
plt.xlabel('Num. of Ratings per Movie')
plt.savefig(parent_dir + 'reports/figures/ratings_by_movie.png')
plt.show();

As you can see we do have a long tail problem here where the majority of movies have less than 25 ratings and very few have more than that.  

Let's now look into the number of ratings per user to investigate this long tail problem further:

In [None]:
movies.head()

In [None]:
users = pd.DataFrame(ratings.groupby(['userId'])['rating'].count())

In [None]:
users.shape

In [None]:
users.sort_values('rating', ascending=False)[:20]

The "top 12" users have each rated over 1000 movies.

In [None]:
users.sort_values('rating', ascending=True)[:75]

On the flip side around 75 users have rated 25 movies or fewer

In [None]:
fig = plt.subplots(figsize=(10,8))
plt.hist(users.rating, bins = 200, color = '#789698')
plt.title('Number of Ratings by User')
plt.xlabel('Number of Ratings')
plt.ylabel('Count of Users')
plt.savefig(parent_dir + 'reports/figures/ratings_by_user.png')
plt.show();

Again, we can see the long tail problem playing out here.  This will have to be addressed with regularisation in our modelling.  

Let's now look into the movies dataset:

### Movies Dataset

Let's begin by looking at a preview of our data as always.

In [None]:
movies.head()

And let's inspect the datatypes and null values:

In [None]:
movies.info()

So from here we assume there are 9742 unique movies. But let's check the unique titles:

In [None]:
print(f"Number of unique movie titles:  {len(movies.title.unique())}")

This doesn't agree with the 9742 we saw earlier.  

In [None]:
print(f"Number of unique movie IDs:  {len(movies['movieId'].unique())}")

So there are 9742 unique movieId's but only 9737 unique titles.  This means some movies have 2 different movieIds.  Let's see if we can isolate these movies (there are only 5).  

In [None]:
count_movies = {}
for title in movies['title']:
    count_movies[title] = count_movies.get(title, 0) + 1
len(count_movies)

So let's now see which movies have a count greater than 1:

In [None]:
double_movies = []
for title in count_movies:
    if count_movies[title] > 1:
        print(title, count_movies[title])
        double_movies.append(title)

We've found the duplicates in disguise.  Let's find these in our dataframe to find their movieIds.  

In [None]:
movies[movies['title'].isin(double_movies)]

In [None]:
for title in double_movies:
    print(movies[movies['title'] == title])

I'm going to drop the rows where the genre is only a subset of the duplicate's list of genres.  E.g., I'll drop row 5601 because it only has 'Romance' whereas Romance is included in row 650 of the 'Emma' movie.  

Since there are only 5 rows to drop, I'll manually make a list of their index's to drop them. 

In [None]:
rows_to_drop = [5601, 9468, 4169, 5854, 6932]
movies.drop(rows_to_drop, axis = 0, inplace = True)

Test that it worked as expected:

In [None]:
count_movies_again = {}
for title in movies['title']:
#     print(movie)
    count_movies_again[title] = count_movies_again.get(title, 0) + 1
len(count_movies_again)

In [None]:
double_movies_again = []
for title in count_movies_again:
    if count_movies_again[title] > 1:
        print(title, count_movies_again[title])
        double_movies_again.append(title)

In [None]:
len(double_movies_again)

Awesome.  We now don't have any doubled up movies.  

Let's investigate how many unique genre combinations we have:

In [None]:
print(f"Unique genre combinations:  {len(movies['genres'].unique())}")

That's a lot of genres - let's get a dictionary containing the count for each genre:

In [None]:
count_genres = {}
for genre in movies['genres']:
    count_genres[genre] = count_genres.get(genre, 0) + 1

So there are 951 unique genre combinations.  Let's see how many of these only have 1 movie classified as this combination of genres.  Perhaps these are 'less common' or more 'out-there' movies.  Or perhaps their genre could be reduced to be more generalisable.  

This is important because we loose information about people who like movies of the same genre, but if someone is classified as 'not alike' just because a genre combination of their favourite movie was 'Adventure|Children|Romance' and another person's was 'Adventure|Children|Romance|IMAX', this could loose valuable information about those people.  

Perhaps we'll need to make sure 'genre' is handled appropritely and that our model features include the different types of genres included in combination genres. 


In [None]:
only_one = []
for genre in count_genres:
    if count_genres[genre] == 1:
#         print(genre, count_genres[genre])
        only_one.append(genre)
print(f"Number of genres with only 1 movie of this genre combination:  {len(only_one)}")

Let's look into the most common genres and find the top ten genre combinations (that is, the genre with the most amount of movies listed as this genre).

In [None]:
popular_genre = []
for genre in count_genres:
    if count_genres[genre] > 100:
        print(genre, count_genres[genre])
        popular_genre.append(genre)
print(f"\nNumber of genres with more than 100 movies listed as this genre combinations:  {len(popular_genre)}")

In [None]:
sorted_pop_genres = sorted(popular_genre, key=lambda x:x[1], reverse=True)

In [None]:
sorted_pop_genres[:10]

From this list, we can see that there's pretty much 3 genres that the top movies boil down to:
- Drama
- Crime/Thriller
- Comedy 

We'll need to make sure we're filtering by unique combinations and maybe we can extract the single-use genre combinations and get rid of their unique extra genre?  These are things we will need to consider for our content-based models.

Let's visualise these top 10 genres:

In [None]:
genre = pd.DataFrame(movies.groupby('genres')['title'].count())

In [None]:
most_rated_genre = genre.sort_values('title', ascending=False)[:10]

In [None]:
most_rated_genre

In [None]:
fig = plt.subplots(figsize=(10, 8))
plt.barh(most_rated_genre.index, most_rated_genre.title, color = '#789698')
plt.title('10 Most Rated Genres')
plt.xlabel('Number of Ratings')
plt.ylabel('Genre')
plt.savefig(parent_dir + 'reports/figures/top_10_genres.png')
plt.show();

So we can see here that `Drama` is the most highly rated genre, second is `Comedy` and third `Comedy|Drama`.  This along suggests that these could be aggregated some how and should be considered in future investigations.  

For now, I'll read out the df I have that dropped those duplicate rows:

In [None]:
movies_dropped = movies.to_csv(parent_dir + 'data/mod_movies')

### Time Series with `movies` and `ratings` dataframe:

As mentioned earlier, we kept the timestamp coloumn so we could investigate information about the timing of this data.  Let's explore that now by combining the ratings and movies dataframes.

In [None]:
movies['release_year'] = movies.title.apply(lambda x: x.strip()[-5:-1])
movies['release_year'] = pd.to_numeric(movies['release_year'], errors='coerce')

In [None]:
movies_ratings_joined = ratings.join(movies.set_index('movieId'), on='movieId').dropna()

In [None]:
movies_ratings_joined.head()

In [None]:
grouped = movies_ratings_joined.groupby('release_year')['rating'].mean()

In [None]:
grouproll = grouped.rolling(10).mean()
grouproll

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
ax.bar(grouproll[1940:].index, grouproll[1940:], linewidth=2, color='#789698')
plt.title('Ratings by Release Year')
ax.set_ylim([3.2,4])
ax.set_ylabel('Rating')
ax.set_xlabel('Release Year')
ax.set_xlim([1941, 2018])
plt.savefig(parent_dir + 'reports/figures/ratings_by_release_date.png')
plt.show();

From this graph we can see that movies that were released before 1990 tend to have a higher average rating.  From roughly 1990, the average movie rating appears to trend downwards towards the average rating of the dataset (3.5).  Since the rating of these movies have taken place since 1993, this could suggest that people who watched and rated older movies, watched them because they were already a recommended to them as being good movies and so these movies are watched by good referral.  Whereas from 1993, movies could have been watched and rated by people's own motivations rather than personal recommendations.  So perhaps this suggests what we see in the data here.   

Since we didn't end up using the timestamp, let's drop it from our dataframe:

In [None]:
ratings.drop('timestamp', axis = 1, inplace = True)

Read this df out so it's accessible later:

In [None]:
ratings_model = ratings.to_csv(parent_dir + 'data/mod_ratings')

### Links Dataset

Let's get accquainted with the links dataset:

In [None]:
links.head()

In [None]:
links.info()

It looks like there are going to be duplicates again given that there are the same number of `movieId`'s that the movies df had... so let's see if there are duplicate `imdbIds`:

In [None]:
len(links.movieId.unique())

In [None]:
len(links.imdbId.unique())

Ok no, doesn't look like there are duplicates.

Check the na's for tmdbId:

In [None]:
len(links.tmdbId.unique())

In [None]:
links.isna().sum()

In [None]:
links[links['tmdbId'].isna()]

I think it's fine that we're missing tmdbIds since they shouldn't add much value to our modelling.  

Overall, the `links` dataset will be useful for webscraping if we want to get images from IMDB for the movies to add to deployment methods but it won't add any value to our models.  

### Tags Dataset

Let's get accquainted with the tags dataset:

In [None]:
tags.head()

In [None]:
tags.info()

No null rows.

In [None]:
print(f"Number of users who provided tags:  {len(tags.userId.unique())}")

In [None]:
print(f"Number of unique movies with tags:  {len(tags.movieId.unique())}")

In [None]:
print(f"Number of unique tags:  {len(tags.tag.unique())}")

So while there are 3683 entries, there are only 1589 unique tags so we do have some common tags - might be worth finding the top 10-20 most common tags perhaps?

Only 1572 movies have been tagged so if we join these dfs, most will have na values (which is fine).

Only 58 users actually added tags.  This is quiet a small subset of our overall users.

We probably don't need the timestamp column for this dataset.

In [None]:
tags.drop('timestamp', axis = 1, inplace = True)

Let's look into the most common tags and find the top 20 tags.

In [None]:
count_tags = {}
for tag in tags['tag']:
    count_tags[tag] = count_tags.get(tag, 0) + 1

In [None]:
popular_tag = []
for tag in count_tags:
    if count_tags[tag] > 0:
        popular_tag.append(tag)

In [None]:
sorted_pop_tags = sorted(popular_tag, key=lambda x:x[1], reverse=True)

In [None]:
sorted_pop_tags[:20]

We can see there's a lot of double ups here with tags that do and don't use capitilisation such asa 'Ryan Reynolds' vs. 'ryan reynolds'.  As well as similar categories such as 'myth', 'mythology', even 'mystery'.  

Perhaps we could perform some NLP pre-processing on this data to make more consistent tags.  This might not be completely neccessary since it's such a small set of the data that is tagged (only 3000 amongst 100k movie ratings) but something we could experiment with.  

I won't do any of the NLP processing now, but we know it exists and I will export the csv without the timestamp.

In [None]:
tags_model = tags.to_csv(parent_dir + 'data/mod_tags')

# Join the dataframes:

The `ratings` df has over 100k rows, and then `movies` df has just under 10k rows.  So let's first try joining the `ratings` and `movies` dfs together using `movieId` as the key.  We will left join on `ratings`.

First, we'll check the shape of both dfs to be able to compare the joined result.  

In [None]:
ratings.shape

In [None]:
movies.shape

In [None]:
movie_ratings = ratings.set_index('movieId').join(movies.set_index('movieId'))

In [None]:
movie_ratings.head()

This is looking like our desired result.  Let's check the shape:

In [None]:
movie_ratings.shape

So we have not lost rows - this is what we expect. 

In [None]:
movie_ratings.info()

From here we can see that there are 20 movies that we do not know the title or genre for.  Let's see what these movies are:

In [None]:
movie_ratings[movie_ratings['title'].isna()]

From here, out of the 20 that do not have titles or genres, there are actually only 5 movies that are missing their title.  We could pair this with the links df and look up the title and genre on IMDB and manually add this in since it's only 5 records.  This might be worth it since movie 6003 has 15 ratings that we don't know the name of (and hence can't recommend the name in our app). 

Let's now join the links df with this df, again with the movie id as the key:

In [None]:
links.shape

In [None]:
movie_rating_links = movie_ratings.join(links.set_index('movieId'))

In [None]:
movie_rating_links.head()

In [None]:
movie_rating_links.shape

No info loss! 

Let's have a look at those movies that didn't have titles:

### Data Summary 

Overall, our datasets are pretty clean.  There are definitely areas that will need to be addressed in our modelling such as adding regularisation to account for the long tail problem as well as doing some NLP processing to deal with the genres data for doing content-based models.  

## Models

### Collaborative Filtering Model

The key idea behind collaborative filtering is that similar users share similar interests and that users tend to like items that are similar to one another. We plan to use this for our recommendation system. A user will rate 5 movies, that new data will be used to generate recommendations based on the ratings from users in our datset. 



 1. **Determine the model to use**
   - We performed a train test split on our data, then compared several models in their default state to see which would return the best RMSE score. In this test, the best performing model was SVDpp  - The SVD++ algorithm, an extension of SVD taking into account both explicit and implicit ratings.
   
   
 2. **Iterating and tuning the model**
  - After the model was chosen we ran several iterations, tuning the hyperparameters each time to see if we could imporve the score.
  


### First Model

Read in the joined dataframe

In [None]:
df= pd.read_csv('../../data/joined_dfs_lc')

In [None]:
# instantiate the Reader and the rating scale
reader = Reader(rating_scale=(0, 5))

# Load the dataset 
data = Dataset.load_from_df(df[['userId', 'movieId', 'rating']], reader)

# sample random trainset and testset
# test set is made of 25% of the ratings.

trainset, testset = train_test_split(data, test_size=.25, random_state=15)


#### Find the best algorithm to use

Research lead me to an article by Susan Li (see references), who provided a method to test a variety of algorithms at once to determine the best option.

This will instantiate 11 different models, cross validate the results, then save them all in a dataframe called `benchmark` to compare the RMSE.

I'm going to iterate over all the algorithms to see which one returns the best RMSE value.
This will take a while to run...

In [None]:
# thank you to Susan Li for this helpful code
benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), 
                  KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), 
                  BaselineOnly(), CoClustering()]:
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')    

RESULT: SVDpp has the lowest RMSE. This will be the model we use.

    The SVD++ algorithm is an extension of SVD that takes into account implicit ratings.

#### FSM
Running SVDpp at default settings and cross-validating

In [None]:
# Let's pick the algorithm and run the first model on its own
algo = SVDpp()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=True)


#### Iteration and hypertuning
adjusted n_factors to 50, and regularization to 0.05

In [None]:
# Let's tune
algo3 = SVDpp(n_factors=50, reg_all=0.05, verbose=False)

# Train the algorithm on the trainset, and predict ratings for the testset
algo3.fit(trainset)
predictions = algo3.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

Slight improved to RMSE score. Final model used was the 5th iteration. 

#### Final Model
adding an adjusted learning rate of 0.01

In [None]:
# Let's tune
algo5 = SVDpp(n_factors=50, reg_all=0.05, lr_all=0.01, verbose=False)

# Train the algorithm on the trainset, and predict ratings for the testset
algo5.fit(trainset)
predictions = algo5.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=True)

The final model returned a RMSE score of approx 0.855. 

### Content-Based Model

The next type of recommendation system we wanted to explore was a content-based version. Our previous model would look at other users that have similar interests, and it would recommend other titles that they have liked. This system goes the other direction and it takes movies that you like, and, having learned some information about the film, recommends titles that are similar to it.

To do this, we gathered descriptions and genre tags for each film, and then utilized some of Python's natural language processing tools to turn this text information into numerical information. We used the following process:

 1. **TF-IDF Vectorization**
   - Short for Term Frequency - Inverse Document Frequency, this is a method for assigning values to each word based on the amount of times it appear in documents. This specific value takes in to account the number of times a word appears in a single description and also how commonly it appears in all descriptions. In a single description, a word is given a high tf-idf score if it appears many times in one description, but it is relatively uncommon across all descriptions. This is partially meant to filter out words that are common to movies in general.
   
   
 2. **Cosine Similarity**
  - Once each film is represented by a many-dimensional vector, a common method for determining how 'similar' two films are is by caluculating how close to 1 the cosine of the angle between them is.
  
  
 3. **Sorting**
  - Now that we have a measure of similarity between every pair of movies, we can take in a single movie, sort the rest of the movies by how similar they are to our chosen film, and then return the top 10 most similar films.
  
  
We have put together a Python class to demonstrate our content-based recommender, the source code for it can be found in the src folder under the name [content_rec.py](../../src/content_rec.py). Below we initialize the ContentRecommender object and provide some examples of recommendations.

In [None]:
content = cr.ContentRecommender()

In [None]:
content.recommend('Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)')

In [None]:
content.recommend('Thor (2011)')

In [None]:
content.recommend('Journey 2: The Mysterious Island (2012)')

If you would like to see some random recommendations, we have included the following method to generate suggestions based off random titles

In [None]:
random_film = content.random_title()
content.recommend(random_film)

Our system seems to be working out well! We could further improve the recommendations we are seeing by including more descriptive informations. Some additional information might be useful could be cast and crew names.



## Final Results

We had good success with both collaborative and content-based recommendation systems, as well as our Flask deployment. Our final collaborative model ended up with a RMSE of approx 0.855, which is not bad on a 5-point rating scale. Our content based model is showing very good variety in picking movies that are similar in genre and description.

## Future Work

A good place to direct our efforts in the future would be speeding up our model training process so our app deployment can work faster. We should also consider taking parts of our content and collaboration systems to make a hybrid recommender system that makes SUPER GOOD recommendations.