## Introduction

People are usually open to recommendations, they could be for food places, movies, or places to visit. A good recommendation system would mean huge business opportunity, where customers are more likely to increase their spending when recommendation is done right. Recommendation systems are prevalent in almost all of our daily lives: item recommendation when shopping on Amazon, movies recommendation on Netflix and various platforms like Rotten Tomatoes, song recommendation on music streaming services like Spotify. 

<img src="compiled.png"> 
*Recommendation systems are used commonly in the everyday services that we use. *

## Motivation

As consumers ourselves, we are intrigued by the different approaches present in recommendation systems. The commonly quoted \$1million Netflix prize is an example of the relentless strive towards a better recommendation system for customers. We would like to look into the data of user reviews and ratings and create a recommendation system that can predict user preferences on unrated items. We particularly are interested in movie reviews as the data available are rich and they contain both numeric ratings and textual recommendation. 

## Goal

As such, our goal is to develop a recommendation system to predict ratings that users will assign to movies based on past rating history, and subsequently recommend movies to these users.

## Literature Review

From our research, there are several categories of algorithms used in recommender systems: 1) Collaborative Filtering (examples: Nearest neighbors, Matrix Factorization, Restricted Boltzmann Machines), 2) Content-aware Recommendations (examples, Tensor Factorization, Factorization Machines), 3) Deep Learning, 4) Hybrid approaches.

With reference to this step-by-step guide from [website](http://www.salemmarafi.com/) by S.Marafi, item-based filtering in Collaborative Filtering (CF) shows similarities between items’ consumption histories. With the item similarity matrix from item-based filtering, user-based filtering can be applied. User-based CF includes the user consumption histories as well. However, with the presence of huge datasets, CF might not be viable because of the inherent high time complexity to generate the similarity matrix. 

In this [article](www.analyticsvidhya.com/blog/2015/08/beginnersguidelearncontent), content based recommendation system is introduced. The goal of Content Based Recommendation System is to determine the relative importance of an item. Term Frequency (TF), which the frequency of a word in a document. Inverse Document Frequency (IDF) is the inverse of the document frequency among the whole corpus of documents. TF-IDF weighting is used to negate the effect of high frequency words in determining the importance of an item (document). 
Vector Space Model is often used to compute the proximity based on the angle between the vectors. Each item is stored as a vector of its attributes (which are also vectors) in an n-dimensional space. Then, the vectors have to be normalized and then take the sum-product of them, which is the cosine of the angle between the vectors. If the value of cosine increases, then the value of the angle decreases, which signifies more similarity.

This [article](http:// netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf) on the BellKor solution to the Netflix Grand Prize also offers a glimpse into the winning in the Netflix competion. CF was discussed. In addition to that, temporal dynamics of baseline predictors (predictors related to effects associated to either users or movies independent of their interaction) are also modeled.

Another effect for baseline predictors is the changing scale of user ratings over time.

In this paper, it is highlighted that one important metric, frequencies (eg. the number of ratings a user gives on a specific day) is more helpful when used with movie-related but not with user-related parameters. In case of bulk ratings, long after watching an all-time favourite movie, only those with positive approach will mark them as favorites while those disliking them will not mention them. Same is the case with not so good movies.

This paper also discusses sophisticated blending schemes (which combine multiple predictors into a single final solution) as a key for very accurate results. Blending touches many predictors , rather than improving one at a time and is good at handling skewed variables without transformations. Clustering of users/movies having similarity also proves to be beneficial in producing results. 


## Data Exploration

<img src="combinedwordcloud.png"> 
Frequencies of words were inspected in the textual reviews and they were grouped by their respective ratings. The figure on the most left is for ratings of 1, and the rating increases to the right to 5. 

We see that highly rated reviews contain both words with highly positive sentiments (great and good) while lowly rated reviews contain words with mostly negative sentimenets (bad, boring, disappointing, worst).


## Models Overview

Collaborative filtering is first implemented following the steps outlined in the first article mentioned in the Literature Review. Matrix factorization is also used to be compared against the CF model. 

Additional features of the movies are extracted via Amazon API so that the movie similarity matrix can be built on either movie ratings or movie attributes. The two different movie similarity matrices from these two different approaches are then used on the CF model, where their performance is compared against each other. 

The selected performance metrics is RMSE (Sonu please comfirm)

Sonu: please help to explain your thoughts here - also do remember to cite them.
<img src="Image1.png"> 
<img src="Image2.png"> 
<img src="Image3.png"> 

### Reference

1. S.Marafi, "Python Collaborative Filtering Example", 
www.salemmarafi.com/code/collaborativefilteringwithpython/

2. Analytics Vidhya, "High Level Overview of Content Based Recommendations", 
www.analyticsvidhya.com/blog/2015/08/beginnersguidelearncontentbasedrecommendersystems/

3. Y.Koren, "Netflix Prize", http://
netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf