## Content

**Similarities**<br>
[Cosine](#Cosine)<br>
[Pearson](#Pearson)<br>
[Jaccard](#Jaccard)<br>

**Graphlab**<br>
[Factorization Recommender](#FactorizationRecommender)<br>
[Item Similarity Recommender](#ItemSimilarityRecommender)<br>
[Popularity Recommender](#PopularityRecommender)<br>
[Ranking Factorization Recommender](#RankingFactorizationRecommender)

Imports

In [1]:
from __future__ import division
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import graphlab as gl

%matplotlib inline

A newer version of GraphLab Create (v2.0) is available! Your current version is v1.10.1.

You can use pip to upgrade the graphlab-create package. For more information see https://dato.com/products/create/upgrade.


This non-commercial license of GraphLab Create is assigned to lucie.klimosova@gmail.com and will expire on February 23, 2017. For commercial licensing options, visit https://turi.com/buy/.


[INFO] graphlab.cython.cy_server: GraphLab Create v1.10.1 started. Logging: /tmp/graphlab_server_1467864405.log


Data

In [2]:
from sklearn.datasets import load_iris
data = load_iris()
y_all = data.target
X_all = data.data

Train test split

In [3]:
from sklearn.cross_validation import train_test_split

X, X_test, y, y_test = train_test_split(X_all, y_all, test_size=0.25, random_state=42)

<a id='Cosine'></a>
### Cosine Similarity
[cosine_similarity](http://scikit-learn.org/dev/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) - sklearn, pairwise<br>
[cosine](http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.cosine.html) - scipy, two datapoints

Defined as: `(u*v) / (||u||*||v||)`

*treat missing ratings as 0*<br>
*cosine plus demeaned data is same as Pearson*

In [4]:
print (X[0].dot(X[1])) / (np.linalg.norm(X[0]) * np.linalg.norm(X[1]))

0.998951666649


In [5]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(X[0:2])

array([[ 1.        ,  0.99895167],
       [ 0.99895167,  1.        ]])

In [6]:
from scipy import spatial

1 - spatial.distance.cosine(X[0], X[1])

0.99895166664881352

<a id='Pearson'></a>
### Pearson Similarity
[pearsonr](http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html) - scipy<br>
[corrcoef](http://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html) - numpy, pairwise

Defined as: `cov(x,y) / (std(x)*std(y))`<br>
`cov(x,y) = x*y - mean(x)*mean(y)`

*same as cosine plus demeaned data*

In [7]:
((X[0]*X[1]).mean() - X[0].mean() * X[1].mean()) / (X[0].std() * X[1].std())

0.9970960693815728

In [8]:
np.corrcoef(X[0:2])

array([[ 1.        ,  0.99709607],
       [ 0.99709607,  1.        ]])

In [9]:
from scipy.stats import pearsonr
pearsonr(X[0], X[1])[0]

0.99709606938157247

<a id='Jaccard'></a>
### Jaccard Similarity
[jaccard_similarity_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html)

**Technically defined on sets, be careful about using it on vectors.**<br>
Defined as the size of the intersection divided by the size of the union of two label sets.

In [10]:
#for 0s and 1s
X0 = np.array([0,1,1,1,0])
X1 = np.array([1,1,1,0,0])

sum([x0 == x1 and x0 == 1 for x0, x1 in zip(X0, X1)]) / sum([x0 == 1 or x1 == 1 for x0, x1 in zip(X0, X1)])

0.5

In [11]:
from sklearn.metrics import jaccard_similarity_score

print jaccard_similarity_score(X0, X1) #This considers two zeroes as match which is not what we want

0.6


### Collaborative Filtering

User-based
- similarities between users
- slow for many users
- less stable over time (change in preferences)

Item-based
- similarities between items
- faster with pre-computed item-item similarity
- more stable over time (items often in one category, no changes of preferences, more ratings)

<a id='Graphlab'></a>
### Graphlab Recommenders
[recommender](https://dato.com/products/create/docs/graphlab.toolkits.recommender.html)

Data

In [12]:
ratings_contents = pd.read_table("data/u.data",
                                names=["user", "movie", "rating", "timestamp"])
data = gl.SFrame(ratings_contents)
data.remove_column('timestamp')
data = data[:1000]

one_datapoint_sf = gl.SFrame({'user': [1], 'movie': [100]})

<a id='FactorizationRecommender'></a>
### Factorization Recommender
[FactorizationRecommender](https://dato.com/products/create/docs/generated/graphlab.recommender.factorization_recommender.FactorizationRecommender.html#graphlab.recommender.factorization_recommender.FactorizationRecommender)<br>
[create()](https://dato.com/products/create/docs/generated/graphlab.recommender.factorization_recommender.create.html#graphlab.recommender.factorization_recommender.create)

Learns latent factors for each user and item and uses them to make rating predictions.

*side features may be provided via the user_data and item_data options when the model is created*<br>
*observation-specific information, such as the time of day when the user rated the item, can also be included*<br>
*strings are treated as categorical variables and integers and floats are treated as numeric variables*

In [13]:
matrix_fact = gl.factorization_recommender.create(data, target='rating',
                                                  user_id='user', item_id='movie',
                                                  user_data=None, item_data=None,
                                                  regularization = 1e-4,
                                                  verbose=False)
print matrix_fact.predict(one_datapoint_sf)
matrix_fact.recommend(users=np.array([1]), k=5)

[4.984307093389332]


user,movie,score,rank
1,566,6.08912532854,1
1,4,5.83309262323,2
1,209,5.61531310606,3
1,220,5.60774915743,4
1,203,5.5997412734,5


In [14]:
#print matrix_fact.list_fields()
print matrix_fact.get('coefficients')

{'movie': Columns:
	movie	int
	linear_terms	float
	factors	array

Rows: 551

Data:
+-------+-----------------+-------------------------------+
| movie |   linear_terms  |            factors            |
+-------+-----------------+-------------------------------+
|  242  |  0.594195842743 | [0.0743168666959, -0.19293... |
|  302  |  0.463811606169 | [-0.117401659489, 0.171064... |
|  377  |  -2.58234286308 | [0.000271603814326, 0.0015... |
|   51  |  -0.94407081604 | [0.0123480930924, 0.013804... |
|  346  |  -1.30766832829 | [-0.25694668293, -0.527391... |
|  474  |  1.49613690376  | [0.0396084226668, 0.210686... |
|  265  |  0.41582262516  | [0.236695408821, 0.3987476... |
|  465  |  1.45149409771  | [0.000364542764146, 0.0014... |
|  451  | -0.799559414387 | [-0.0188290663064, 0.03579... |
|   86  |  0.697341501713 | [-0.174927383661, -0.04498... |
+-------+-----------------+-------------------------------+
[551 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can 

How predicting works

In [15]:
#get user 1 coefficients
user = matrix_fact.get('coefficients')['user']
user_1_fact = user[user['user'] == 1]['factors']

#get movie 100 coeficients
movie = matrix_fact.get('coefficients')['movie']
movie_100_fact = movie[movie['movie'] == 100]['factors']

#make a dot product
dot = np.dot(np.array(user_1_fact), np.array(movie_100_fact).reshape(8,1))[0][0]

#add 3 intercepts to the dot product (overall, user, movie)
#overall intercept is just mean of data: data['rating'].mean()
pred = dot + movie[movie['movie'] == 100]['linear_terms'] \
    + user[user['user'] == 1]['linear_terms'] \
    + matrix_fact.get('coefficients')['intercept']
print pred[0]

4.98430709306


<a id='ItemSimilarityRecommender'></a>
### Item Similarity Recommender
[ItemSimilarityRecommender](https://dato.com/products/create/docs/generated/graphlab.recommender.item_similarity_recommender.ItemSimilarityRecommender.html#graphlab.recommender.item_similarity_recommender.ItemSimilarityRecommender)<br>
[create()](https://dato.com/products/create/docs/generated/graphlab.recommender.item_similarity_recommender.create.html#graphlab.recommender.item_similarity_recommender.create)

Ranks an item according to its similarity to other items observed for the user in question. Computes the similarity between items using the observations of users who have interacted with both items.

*side features currently ignored*<br>
*three choices of similarity metrics to use: ‘jaccard’, ‘cosine’ and ‘pearson’ (cosine does not really work as data is not demeaned)*

In [16]:
item_sim = gl.recommender.item_similarity_recommender.create(data, target='rating',
                                                             user_id='user', item_id='movie',
                                                             #user_data=None, item_data=None, #currently ignored
                                                             similarity_type='pearson',
                                                             verbose=False)
print item_sim.predict(one_datapoint_sf)
item_sim.recommend(users=np.array([1]), k=5)

[4.333333333333333]


user,movie,score,rank
1,1137,5.0,1
1,327,5.0,2
1,603,5.0,3
1,95,5.0,4
1,465,5.0,5


In [17]:
#print item_sim.list_fields()
print item_sim.get('training_rmse')

None


In [18]:
item_sim.get_similar_items()

movie,similar,score,rank
242,289,0.192450106144,1
242,1067,0.166666686535,2
302,250,0.375,1
302,286,0.125,2
302,742,0.1062682271,3
346,132,0.251976311207,1
346,96,0.0325300097466,2
474,486,0.333333313465,1
474,281,0.25,2
474,484,0.25,3


<a id='PopularityRecommender'></a>
### Popularity Recommender
[PopularityRecommender](https://dato.com/products/create/docs/generated/graphlab.recommender.popularity_recommender.PopularityRecommender.html#graphlab.recommender.popularity_recommender.PopularityRecommender)<br>
[create()](https://dato.com/products/create/docs/generated/graphlab.recommender.popularity_recommender.create.html#graphlab.recommender.popularity_recommender.create)

Ranks an item according to its overall popularity. When making recommendations, the items are scored by the number of times it is seen in the training set. The item scores are the same for all users. Hence the recommendations are not tailored for individuals.

*simple and fast, provides a reasonable baseline, can be used as a “background” model for new users*

In [19]:
popularity = gl.recommender.popularity_recommender.create(data, target='rating',
                                                          user_id='user', item_id='movie',
                                                          #user_data=None, item_data=None, #useless
                                                          verbose=False)
print popularity.predict(one_datapoint_sf)
popularity.recommend(users=np.array([1]), k=5)

[4.333333333333333]


user,movie,score,rank
1,1137,5.0,1
1,327,5.0,2
1,603,5.0,3
1,95,5.0,4
1,465,5.0,5


In [20]:
#print popularity.list_fields()
print 'training RMSE:', popularity.get('training_rmse')
print popularity.get('item_predictions')

training RMSE: 0.700375749492
+-------+---------------+
| movie |   prediction  |
+-------+---------------+
|  242  | 3.66666666667 |
|  302  |      3.75     |
|  377  |      1.0      |
|   51  |      2.0      |
|  346  | 2.66666666667 |
|  474  |      4.5      |
|  265  |      3.4      |
|  465  |      5.0      |
|  451  |      3.0      |
|   86  |      4.0      |
+-------+---------------+
[551 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


<a id='RankingFactorizationRecommender'></a>
### Ranking Factorization Recommender
[RankingFactorizationRecommender](https://dato.com/products/create/docs/generated/graphlab.recommender.ranking_factorization_recommender.RankingFactorizationRecommender.html#graphlab.recommender.ranking_factorization_recommender.RankingFactorizationRecommender)<br>
[create()](https://dato.com/products/create/docs/generated/graphlab.recommender.ranking_factorization_recommender.create.html#graphlab.recommender.ranking_factorization_recommender.create)

Learns latent factors for each user and item and uses them to rank recommended items according to the likelihood of observing those (user, item) pairs. This is commonly desired when performing collaborative filtering for implicit feedback datasets or datasets with explicit ratings for which ranking prediction is desired.

*from playing around with this, it appears to be very unstable model (at least on this dataset), chose parameters carefully!*

In [21]:
rank_fact = gl.recommender.ranking_factorization_recommender.create(data, target='rating',
                                                                    user_id='user', item_id='movie',
                                                                    user_data=None, item_data=None,
                                                                    num_factors=32, regularization=1e-09,
                                                                    linear_regularization=1e-09, #solver='sgd',
                                                                    verbose=False)
print rank_fact.predict(one_datapoint_sf)
rank_fact.recommend(users=np.array([1]), k=5)

[2.981244371116161]


user,movie,score,rank
1,79,4.07778113413,1
1,328,3.99390357065,2
1,209,3.97808187532,3
1,197,3.96886139441,4
1,23,3.92106645632,5


In [22]:
#rank_fact.list_fields()
rank_fact.get('coefficients')

{'intercept': 3.518, 'movie': Columns:
 	movie	int
 	linear_terms	float
 	factors	array
 
 Rows: 551
 
 Data:
 +-------+-----------------+-------------------------------+
 | movie |   linear_terms  |            factors            |
 +-------+-----------------+-------------------------------+
 |  242  |  0.156346768141 | [-0.0105757359415, -0.0197... |
 |  302  | 0.0414082817733 | [-0.0993492081761, -0.1024... |
 |  377  |  -1.98437452316 | [0.00470882328227, -0.0116... |
 |   51  | -0.588530600071 | [-0.0338789150119, 0.03737... |
 |  346  |  -1.17732822895 | [0.00348259904422, -0.2153... |
 |  474  | -0.149066910148 | [0.16456541419, -0.2462467... |
 |  265  | -0.240351870656 | [-0.118175491691, 0.244684... |
 |  465  |  0.132390648127 | [-0.000184644013643, 0.068... |
 |  451  | -0.298329472542 | [-0.026411825791, -0.03690... |
 |   86  |  0.191182792187 | [-0.252721220255, -0.01252... |
 +-------+-----------------+-------------------------------+
 [551 rows x 3 columns]
 Note: Only 