### Business Understanding

Our selected dataset was a part of a github gist post that contained a collection of datasets for recommendation and ratings.  It can be found here: https://gist.github.com/entaroadun/1653794.  Looking at our dataset it was created with the purpose of being a dataset for collaberative filtering for a Czech dating site.  It is technically two datasets, with one containing a userID, the userID of the person they rated, and the rating itself.  The other dataset contains the gender information for each userID.  

A good algorithm for recommender systems should have high precision and recall with a low RMSE.  For the purpose of our dataset we will be comparing four recommender models: a popularity model, ranked factorization model, item-item matrix model, and a standard factorization matrix model.  We will measure the effectiveness of a good algorithm by comparing the precision/recall and RMSE values for each model, while also running a model comparison to compare the models to one another.  The popularity model will act as a baseline comparison for the other three models.  
 
Recommender systems normally utilize a set of items or users and items, along with optional features as part of a linear model to predict ratings or calculate similarities between items. Our training set contains a user id, another user that was rated by our user id and a rating or score. This is an interesting dilemma, often not common to recommenders for music or movies. In our case, we essentially have created a user-user matrix. There is an obvious relationship between the users doing the rating and the users being rated which may imply an item-item (user-user) recommender be used. We will investigate multiple recommender models to determine which works the best.

For the stakeholder, the Czech dating site, the item-item matrix model should work because it will find recommendations for users based on the ratings that they give to other users.  While there are not many side features that we can include in the model, we assume that users are rating other users based on the information contained in their profile.  If this is true, the scores they give should help with a recommender model based on only their ratings.  The dating site's goals are most likely to present the best possible matches for their users.  Finding a good recommender model that created more successful matches would help achieve their goals.


## Item-Item Matrix Model

The item-item recommender model, also known as item-based collaborative filtering, is an algorithm that compares two items and determines the similarity between them.  This can then be used to recommend item-item or user-item pairs in the future.  In GraphLab Create there are three different similarity measures that can be used in the item similarity recommender: Jaccard, Cosine, and Pearson.  A brief summary of the three are below. 

#### Jaccard Similarity

The Jaccard similarity (default for GraphLab) measures the similarity between two items and is calculated with the following equation:

$$\mbox{JS}(i,j)
= \frac{|U_i \cap U_j|}{|U_i \cup U_j|}$$

It is best used for when you only care whether or not the items have been rated or not as the Jaccard similarity does not take into account the score itself.

#### Cosine Similarity

The Cosine similarity measures the similarity between two items for users that have either rated one or both of the items.  The equation for Cosine similarity is as follows:

$$\mbox{CS}(i,j)
= \frac{\sum_{u\in U_{ij}} r_{ui}r_{uj}}
    {\sqrt{\sum_{u\in U_{i}} r_{ui}^2}
     \sqrt{\sum_{u\in U_{j}} r_{uj}^2}}$$

An issue that can arise from using the Cosine similarity is that the mean and variance in scores/ratings are not taken into account in the calculation.  If there are extremely varying means and/or variances the Cosine similarity metric can become skewed.  This is where the Pearson similarity comes in. 

#### Pearson Similarity 

The Pearson similarity, like the Cosine similarity, measures the simialrity between two items for users having rated both or just one of the items.  It is calculated using the following equation:

$$\mbox{PS}(i,j)
= \frac{\sum_{u\in U_{ij}} (r_{ui} - \bar{r}_i)
                            (r_{uj} - \bar{r}_j)}
    {\sqrt{\sum_{u\in U_{ij}} (r_{ui} - \bar{r}_i)^2}
     \sqrt{\sum_{u\in U_{ij}} (r_{uj} - \bar{r}_j)^2}}$$

Different from the Cosine similarity metric, the Pearson similarity removes the mean and variance from it's calculations.  

Based on our dataset it is most likely that the Cosine and Pearson similarity metrics will work best, although depending on the mean and variance of ratings in the dataset one will be better than the other.  

Information was obtained from the following sources: http://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/itembased.html & https://turi.com/products/create/docs/generated/graphlab.recommender.item_similarity_recommender.ItemSimilarityRecommender.html