## Content

**Similarities**<br>
[Cosine](#Cosine)<br>
[Pearson](#Pearson)<br>
[Jaccard](#Jaccard)<br>

**Graphlab**<br>
[Factorization Recommender](#FactorizationRecommender)<br>

Imports

In [46]:
from __future__ import division
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import graphlab

%matplotlib inline

Data

In [2]:
from sklearn.datasets import load_iris
data = load_iris()
y_all = data.target
X_all = data.data

Train test split

In [3]:
from sklearn.cross_validation import train_test_split

X, X_test, y, y_test = train_test_split(X_all, y_all, test_size=0.25, random_state=42)

<a id='Cosine'></a>
### Cosine Similarity
[cosine_similarity](http://scikit-learn.org/dev/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) - sklearn, pairwise<br>
[cosine](http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.cosine.html) - scipy, two datapoints

Defined as: `(u*v) / (||u||*||v||)`

*treat missing ratings as 0*<br>
*cosine plus demeaned data is same as Pearson*

In [18]:
print (X[0].dot(X[1])) / (np.linalg.norm(X[0]) * np.linalg.norm(X[1]))

0.998951666649


In [10]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(X[0:2])

array([[ 1.        ,  0.99895167],
       [ 0.99895167,  1.        ]])

In [11]:
from scipy import spatial

1 - spatial.distance.cosine(X[0], X[1])

0.99895166664881352

<a id='Pearson'></a>
### Pearson Similarity
[pearsonr](http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html) - scipy<br>
[corrcoef](http://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html) - numpy, pairwise

Defined as: `cov(x,y) / (std(x)*std(y))`<br>
`cov(x,y) = x*y - mean(x)*mean(y)`

*same as cosine plus demeaned data*

In [28]:
((X[0]*X[1]).mean() - X[0].mean() * X[1].mean()) / (X[0].std() * X[1].std())

0.9970960693815728

In [25]:
np.corrcoef(X[0:2])

array([[ 1.        ,  0.99709607],
       [ 0.99709607,  1.        ]])

In [22]:
from scipy.stats import pearsonr
pearsonr(X[0], X[1])[0]

0.99709606938157247

<a id='Jaccard'></a>
### Jaccard Similarity
[jaccard_similarity_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html)

Defined as the size of the intersection divided by the size of the union of two label sets.

In [44]:
#requires binary elements
X0 = [0,1,1,0]
X1 = [1,1,1,0]

In [45]:
from sklearn.metrics import jaccard_similarity_score

print jaccard_similarity_score(X0, X1) #I think this is wrong and should be 0.67
print jaccard_similarity_score(X[0]>1, X[1]>1)

0.75
1.0


### Collaborative Filtering

User-based
- similarities between users
- slow for many users
- less stable over time (change in preferences)

Item-based
- similarities between items
- faster with pre-computed item-item similarity
- more stable over time (items often in one category, no changes of preferences, more ratings)

<a id='Graphlab'></a>
### Graphlab
[recommender](https://dato.com/products/create/docs/graphlab.toolkits.recommender.html)

Data

In [60]:
ratings_contents = pd.read_table("data/u.data",
                                names=["user", "movie", "rating", "timestamp"])
data = graphlab.SFrame(ratings_contents)
data.remove_column('timestamp')
data = data[:1000]

<a id='FactorizationRecommender'></a>
### Factorization Recommender
[FactorizationRecommender](https://dato.com/products/create/docs/generated/graphlab.recommender.factorization_recommender.FactorizationRecommender.html#graphlab.recommender.factorization_recommender.FactorizationRecommender)

In [69]:
matrix_fact = graphlab.factorization_recommender.create(data, user_id='user', 
                                                        item_id='movie',target='rating',
                                                        regularization = 1e-4)

How predicting works

In [73]:
one_datapoint_sf = graphlab.SFrame({'user': [1], 'movie': [100]})
matrix_fact.predict(one_datapoint_sf)

dtype: float
Rows: 1
[5.012780416600406]

In [74]:
#get user 1 coefficients
user = matrix_fact.get('coefficients')['user']
user_1_fact = user[user['user'] == 1]['factors']

#get movie 100 coeficients
movie = matrix_fact.get('coefficients')['movie']
movie_100_fact = movie[movie['movie'] == 100]['factors']

#make a dot product
dot = np.dot(np.array(user_1_fact), np.array(movie_100_fact).reshape(8,1))[0][0]

#add 3 intercepts to the dot product (overall, user, movie)
#overall intercept is just mean of data: data['rating'].mean()
pred = dot + movie[movie['movie'] == 100]['linear_terms'] \
    + user[user['user'] == 1]['linear_terms'] \
    + matrix_fact.get('coefficients')['intercept']
print pred[0]

5.01278041656
