# Exercise D: Pairwise Recommender (mixing ML models)

In this part of the exercise, we'll try to re-create the [GraphLab Create Pathways Demo](http://pathways-demo.herokuapp.com/).

A reminder: in this demo, two users select one item each. GraphLab will find a 'path' between these two items - a set of items that ranges from the first to the second item, so that our two users can agree on something in the middle that would satisfy both of them. The similarities between items are based on the result of a train recommender model.

The data is the same as in the basic training exercise:
<ul>
<li>`business.csv.gz`: https://s3.amazonaws.com/dato-datasets/dato-training/business.csv.gz (640.2 KB)
<li>`review.csv.gz`: https://s3.amazonaws.com/dato-datasets/dato-training/review.csv.gz (80.4 MB)
<li>`user.csv.gz`: https://s3.amazonaws.com/dato-datasets/dato-training/user.csv.gz (1.1 MB)
</ul>

## Solution Outline
1. Load the data.
2. Create a recommender model.
3. Get the similarity graph from the recommender model.
4. Using GraphLab's shortest paths model (from the `graph_analytics` toolkit), write a function that accepts two item IDs as parameters, and returns the shortest path between them in the similarity graph.

# Try solving this exercise yourself!
## If you get in trouble, our proposed solution appears below (with the each cell's execution results).

# Proposed Solution

In [1]:
# load the data
import graphlab as gl
import os

review = gl.SFrame.read_csv('review.csv.gz')
review.head(3)

2016-05-20 00:11:40,884 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.9 started. Logging: /tmp/graphlab_server_1463695898.log


This non-commercial license of GraphLab Create is assigned to guy4261@gmail.com and will expire on October 26, 2016. For commercial licensing options, visit https://dato.com/buy/.


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,int,str,str,str,dict,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


business_id,date,review_id,stars,text,type
9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for break ...,review
ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad reviews ...,review
6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I also ...,review

user_id,votes,year,month,day
rLtl8ZkDX5vH5nAx9C3q5Q,"{'funny': 0, 'useful': 5, 'cool': 2} ...",2011,1,26
0a2KyEL0d3Yb1V6aivbIuQ,"{'funny': 0, 'useful': 0, 'cool': 0} ...",2011,7,27
0hT2KtfLiobPvh6cDC8JQg,"{'funny': 0, 'useful': 1, 'cool': 0} ...",2012,6,14


In [2]:
# train and evaluate a recommender model
train, test = review.random_split(0.8)
model = gl.ranking_factorization_recommender.create(observation_data=train, item_id="business_id", target='stars')

In [3]:
e = model.evaluate(test)


Precision and recall summary statistics by cutoff
+--------+------------------+------------------+
| cutoff |  mean_precision  |   mean_recall    |
+--------+------------------+------------------+
|   1    | 0.00737607115739 | 0.00269133591959 |
|   2    | 0.00829808005207 | 0.00637712730621 |
|   3    | 0.00790035072495 | 0.00890106838288 |
|   4    | 0.00755233756373 | 0.0113478720666  |
|   5    | 0.00689879596486 | 0.0128399362111  |
|   6    | 0.00660773041183 | 0.0149578794967  |
|   7    | 0.00635333860196 | 0.0163172250346  |
|   8    | 0.00614220631305 | 0.0180235024672  |
|   9    | 0.00605030673368 | 0.0205076337028  |
|   10   | 0.00599305781538 | 0.0225982194086  |
+--------+------------------+------------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 1.513615993884949)

Per User RMSE (best)
+------------------------+-------+------------------+
|        user_id         | count |       rmse       |
+------------------------+-------+------------------+
| ClPqrnBK2P3m-h

In [4]:
e['rmse_overall']

1.513615993884949

In [5]:
# create a similarity graph
similar_items = model.get_similar_items()
similarity_graph = gl.SGraph().add_edges(similar_items, src_field='business_id', dst_field='similar')
similarity_graph = similarity_graph.add_edges(similar_items, dst_field='business_id', src_field='similar')

similarity_graph.summary()

{'num_edges': 230140, 'num_vertices': 11507}

In [6]:
# pick 2 businesses and find a shortest path between them.
business = gl.SFrame.read_csv('business.csv.gz', verbose=False)
# A 4-stars business with 100 reviews is better than a 5-stars business with 1 review.
# Let's look up two much reviewed, highly-starred businesses.
top2 = business.filter_by([5, 4.5, 4], "stars").topk("review_count", k=2)
top2

business_id,categories,city,full_address,latitude,longitude
VVeogjZya58oiTxK7qUjAQ,"[Pizza, Restaurants]",Phoenix,"623 E Adams St\nPhoenix, AZ 85004 ...",33.4492,-112.065
JokKtdXU7zXHcr20Lrk29A,"[Bars, Food, Breweries, Pubs, Nightlife, Amer ...",Tempe,"1340 E 8th St\nSte 104\nTempe, AZ 85281 ...",33.4195,-111.916

name,open,review_count,stars,state,type
Pizzeria Bianco,1,803,4.0,AZ,business
Four Peaks Brewing Co,1,735,4.5,AZ,business


In [7]:
business_id1 = top2[0]['business_id']
business_id2 = top2['business_id'][1]

sp_tree = gl.graph_analytics.shortest_path.create(similarity_graph, business_id1)
path = sp_tree.get_path(business_id2)
path

[('VVeogjZya58oiTxK7qUjAQ', 0.0),
 ('-4A5xmN21zi_TXnUESauUQ', 1.0),
 ('FCJHirFzEtj4M1VcuaKieg', 2.0),
 ('JokKtdXU7zXHcr20Lrk29A', 3.0)]

In [8]:
business_id_to_name = {b_id: name for b_id, name in zip(business["business_id"], business["name"])}

In [9]:
for (b_id, step) in path:
    print "%d) %s (id=%s)" % (step, business_id_to_name[b_id], b_id)

0) Pizzeria Bianco (id=VVeogjZya58oiTxK7qUjAQ)
1) D'lish (id=-4A5xmN21zi_TXnUESauUQ)
2) Children's Museum Of Phoenix (id=FCJHirFzEtj4M1VcuaKieg)
3) Four Peaks Brewing Co (id=JokKtdXU7zXHcr20Lrk29A)


In [10]:
weighted_sp_tree = gl.shortest_path.create(similarity_graph, business_id1, weight_field='rank', verbose=False)
weighted_path = weighted_sp_tree.get_path(business_id2)

for (b_id, step) in weighted_path:
    print "%d) %s (id=%s)" % (step, business_id_to_name[b_id], b_id) 

0) Pizzeria Bianco (id=VVeogjZya58oiTxK7qUjAQ)
3) Scottsdale Fashion Square (id=Hdi7jkB7pHiM1nyPHcqSdw)
4) Hob Nobs Cafe & Spirits (id=4sW8Z6NLXLRkruSKSKUEUw)
6) Bandera (id=7QSYBp2-AOdyUJXEaLnbgA)
7) Solo Caf Gourmet Coffee & Tea House (id=ABC57h7Dh1Vy9q3cztAm5A)
9) Grazie Pizzeria & Wine Bar (id=k6Si433-EJrY4J7SZxsnjA)
10) Pita Jungle (id=qwmHm3s8p7J12AIY6Co8HQ)
11) Olive & Ivy (id=53YGfwmbW73JhFiemNeyzQ)
12) Four Peaks Brewing Co (id=JokKtdXU7zXHcr20Lrk29A)


In [11]:
filtered_business = business.filter_by([_[0] for _ in weighted_path], "business_id")
small_business_id_to_name = {b_id: name for b_id, name in zip(filtered_business["business_id"], filtered_business["name"])}
for (b_id, step) in weighted_path:
    print "%d) %s (id=%s)" % (step, small_business_id_to_name[b_id], b_id) 

0) Pizzeria Bianco (id=VVeogjZya58oiTxK7qUjAQ)
3) Scottsdale Fashion Square (id=Hdi7jkB7pHiM1nyPHcqSdw)
4) Hob Nobs Cafe & Spirits (id=4sW8Z6NLXLRkruSKSKUEUw)
6) Bandera (id=7QSYBp2-AOdyUJXEaLnbgA)
7) Solo Caf Gourmet Coffee & Tea House (id=ABC57h7Dh1Vy9q3cztAm5A)
9) Grazie Pizzeria & Wine Bar (id=k6Si433-EJrY4J7SZxsnjA)
10) Pita Jungle (id=qwmHm3s8p7J12AIY6Co8HQ)
11) Olive & Ivy (id=53YGfwmbW73JhFiemNeyzQ)
12) Four Peaks Brewing Co (id=JokKtdXU7zXHcr20Lrk29A)


In [12]:
# wrap the path search into a function

def pathways(business1, business2, verbose=True):
    """Given two business IDs, find the shortest paths between these two."""
    weighted_sp_tree = gl.shortest_path.create(similarity_graph, business1, weight_field='rank', verbose=False)
    weighted_path = weighted_sp_tree.get_path(business2)
    if verbose:
        filtered_business = business.filter_by([_[0] for _ in path], "business_id")
        small_business_id_to_name = {b_id: name for b_id, name in zip(filtered_business["business_id"], filtered_business["name"])}
        for (b_id, step) in weighted_path:
            print "%d) %s (id=%s)" % (step, business_id_to_name[b_id], b_id)
    return weighted_path

In [13]:
pathways(business_id1, business_id2)
pass

0) Pizzeria Bianco (id=VVeogjZya58oiTxK7qUjAQ)
3) Scottsdale Fashion Square (id=Hdi7jkB7pHiM1nyPHcqSdw)
4) Hob Nobs Cafe & Spirits (id=4sW8Z6NLXLRkruSKSKUEUw)
6) Bandera (id=7QSYBp2-AOdyUJXEaLnbgA)
7) Solo Caf Gourmet Coffee & Tea House (id=ABC57h7Dh1Vy9q3cztAm5A)
9) Grazie Pizzeria & Wine Bar (id=k6Si433-EJrY4J7SZxsnjA)
10) Pita Jungle (id=qwmHm3s8p7J12AIY6Co8HQ)
11) Olive & Ivy (id=53YGfwmbW73JhFiemNeyzQ)
12) Four Peaks Brewing Co (id=JokKtdXU7zXHcr20Lrk29A)
