#  Tutorial: Building Recommendation engines with Graphlab

In this tutorial we will be building movie recommendation engines using GraphLab. GraphLab is a parallel framework for machine learning developed here at CMU. It has a set of libraries for data transformation, manipulation and visualization.

Some benefits of using GraphLab:
 - Can handle large datasets
 - GraphLab Canvas allows easy data exporation and visualization
 - Supports various data sources (JSON, CSV, DB, s3 etc.)
 - Easy capability to create features to enhance model performance 
 
In this tutorial, we will walk through the steps to create a movie recommendation engine based on amazon movie review data. We will be exploring multiple approaches and methods to generate recommendations using various features and functions in GraphLab. I have included screenshots of any important output in case you are unable to run any of the cells.

## Installing the Libraries

In order to use graphlab, you should first create a new anaconda environment with Python 2.7.x

 $ conda create -n py27 python=2.7 anaconda=4.0.0

Graphlab is free to use for 1 year for acedemic use. You can register for a free trial using the following link: https://turi.com/download/academic.html. You should recieve a product key. You can now install your licensed copy of GraphLab Create using pip in the anaconda environment you just created.

 $ pip install --upgrade --no-cache-dir https://get.graphlab.com/GraphLab-Create/2.1/registered_email_address/product_key/GraphLab-Create-License.tar.gz
    
Now that Graphlab has been installed, make sure the following command works for you:

In [1]:
import graphlab as gl
import os.path
import json
import csv
import ast
from exceptions import ValueError
from time import sleep
from lxml import html 
import requests

[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1522536551.log


This non-commercial license of GraphLab Create for academic use is assigned to kdodhia@andrew.cmu.edu and will expire on March 28, 2019.


## Loading the data 

Now that we've installed and loaded the libraries, let's load some movie review data to build our recommendation engine. We are going to load data in json format. Download the 5-core Movies and TV dataset from http://jmcauley.ucsd.edu/data/amazon/, which contains reviews and metadata from Amazon for Movies and TV shows. Then unzip the reviews_Movies_and_TV_5.json.gz file to create the reviews_Movies_and_TV_5.json file, which you should move to this directory (the smae folder as this notebook).

If you open the reviews_Movies_and_TV_5.json file, you will see that each line is in its own individual dictionary format, and is not connected to the rest of the lines. Thus a simple json.loads(fname) or Sframe.read_json(fname) will not work. Instead, we will need to loop through each line and process it individually. In order to make our lives much easier, before loading the data into an SFrame we will first load data from the json file into a csv file and save this. After the first run, we can directly use the csv file we previously created without doing any other processing. 

In [2]:
json_fname = 'reviews_Movies_and_TV_5.json'
csv_fname = 'reviews_Movies_and_TV_5.csv'
if (not os.path.isfile(csv_fname)):
    json = open(json_fname, "r")
    csv_file = open(csv_fname, 'w')
    csvwriter = csv.writer(csv_file)
    count = 0
    for row in json:
        data = ast.literal_eval(row)
        if (count == 0):
            csvwriter.writerow(data.keys())
        csvwriter.writerow(data.values())
        count+=1
        
    json.close()
    csv_file.close()
    
reviews = gl.SFrame.read_csv(csv_fname, column_type_hints=[str,str,str,str,str,int,str,int,str])

At this point, we should remove movies that have been rated fewer than 5 times as these are likely to behave unpredictably.

In [4]:
rare_movies = reviews.groupby('asin', gl.aggregate.COUNT) #group by asin value
rare_movies = rare_movies.sort('Count') #sort by count
rare_movies = rare_movies[rare_movies.apply(lambda x: x['Count'] <= 5)] 
reviews = reviews.filter_by(rare_movies['asin'], 'asin', exclude = True)

In [5]:
reviews.head(3)

reviewerID,asin,reviewerName,helpful,reviewText,overall
ADZPIG9QOCDG5,5019281,"Alice L. Larson ""alice- loves-books"" ...","[0, 0]",This is a charming version of the classic ...,4
A35947ZP82G7JH,5019281,Amarah Strack,"[0, 0]",It was good but not as emotionally moving as ...,3
A3UORV8A9D5L2E,5019281,Amazon Customer,"[0, 0]","Don't get me wrong, Winkler is a wonderful ...",3

summary,unixReviewTime,reviewTime
good version of a classic,1203984000,"02 26, 2008"
Good but not as moving,1388361600,"12 30, 2013"
Winkler's Performance was ok at best! ...,1388361600,"12 30, 2013"


This should yield something like the following:

<img src='reviews_head.png'>

GraphLab has this exteremely useful feature that allows you to visualize properties and relationships in the data. With just one SFrame.show() command, a GraphLab Canvas will open up in a new window. The Canvas contains three tabs - summary, table and plot.

In [6]:
reviews.show()

Canvas is updated and available in a tab in the default browser.


The summary tab gives a summary of the data, variables and columns. It looks like the following:

<img src='summary.png'>

The Table tab provides an interactive tabular view of the data inside SFrame. It looks like the following:
    
<img src='table.png'>

The Plot tab allows you to plot relationships between the data. It looks like the following:
    
<img src='plot.png'>

#### Side note: Scraping Movie/TV show titles from the Amazon website

When looking at the headers in the dataset, you will notice that the dataset does not contain a title field for each item but rather only has an asin value, which is the Amazom product ID for the item. In order to build a recommendation engine, it is useful to have an asin value to title matching. In order to get the titles from the asin field we will need to scape data from the Amazon website. You could use the Amazon Product Advertizement API, however, this requires setting up an account and registering as an Associate (which requires you to have a website). For those that are interested, the code that uses the Amazon Product Advertizement API is as follows:
            
            
            from amazon.api import AmazonAPI
            
            def get_product_title(asin):
                amazon = AmazonAPI(AMAZON_ACCESS_KEY, AMAZON_SECRET_KEY, AMAZON_ASSOC_TAG, region="US")

                try:
                    product_info = amazon.lookup(ItemId=asin)        
                except amazon.api.AsinNotFound:
                    print(asin)
                    return ""

                return product_info.title
            
 
Below is the code to scape data straight from the amazon website (be careful not to get blocked though!). We will store the data in a json file and will use this later on in the tutorial.

In [6]:
def get_product_title(asin):
    url = "http://www.amazon.com/dp/"+asin
    page = requests.get(url)
    try:
        doc = html.fromstring(page.content)
        xpath_title = '//h1[@id="title"]//text()'
        raw_name = doc.xpath(xpath_title)
        name = ' '.join(''.join(raw_name).split())
    except Exception:
        print Exception
        name = "ERROR"

    return name

In [None]:
title_asin_json = 'title_asin.json'
def add_title_asin(reviews):
    asin_col = reviews.select_column('asin')
    scapped_data  = {}
    if (os.path.isfile(title_asin_json)):
        scapped_data = json.load(open(title_asin_json))
    count = 1
    for key in asin_col:
        if ((key not in scapped_data) or ((key in scapped_data) and (scapped_data[key] == ""))):
            name = get_product_title(key)
            scapped_data[key] = name
            print(name)
            # sleep to avoid getting blocked from the amazon website
            sleep(20)
            if (count % 20 == 0):
                #dump the data into json file for future use
                f=open(title_asin_json,'w')
                json.dump(scapped_data, f)
            count += 1
    return scapped_data

            
scapped_data = add_title_asin(reviews)
#dump the data into json file for future use
f=open(title_asin_json,'w')
json.dump(scapped_data, f)

## Building the Models 
Now that we have loaded and pre-processed the dataset, we can build the model. We must specify the user ids column and item ids column. We could also pass in an optional target column, however, we will not do this in our case. All other columns are used by the underlying model as side features.

In [8]:
model = gl.recommender.create(reviews, user_id="reviewerID", item_id="asin")

In [9]:
similarities = model.get_similar_items()

Here similarities is just another SFrame, where you have the title of the item (in our case the asin value), and the similar item with a particular score.

In [10]:
similarities.head(5)

asin,similar,score,rank
5019281,780623746,0.0597015023232,1
5019281,6302993687,0.0409356951714,2
5019281,6303824358,0.0315789580345,3
5019281,6301175239,0.021505355835,4
5019281,6300251004,0.0204081535339,5


This should yield the following:

<img src='similarities.png'>

Now we can generate a graph for the model. Each row of the table will be one graph edge between the asin value and the similar item. Esentially, the SGraph is a connection between items, where related items have an edge. SGraphs are generally extremely useful in GraphLab, as it contains an abundance of operations that you can use to do really interesting analytics. The SGraph is generated as follows:

In [11]:
graph = gl.SGraph().add_edges(similarities['asin','similar','score'], src_field = "asin", dst_field = "similar")
graph.summary()

{'num_edges': 442576, 'num_vertices': 44266}

Now that we have constructed the graph data structure, we can proceed to generate recommendations. There are multiple approaches we could take here. I will go through two possible methods:

   1. We can construct a ShortestPath struct from the SGraph, and then find a path between two items. You start with one item, and build a spanning tree to all other items using the shortest_path.create method. Then we can query the tree using get_path() to find the shortest path, and this returns a list of items along this shortest path. These items are the recommendations generated from the two inputs. 

   2. Since we know that the SGraph is a connection between items where related items have an edge, we can simply find the neighbors of a given item to find similar items to it. This is a simplistic approach but still yeilds good results.

#### First SGraph approach to generate recommendations:

In [12]:
def get_recommendations_sp(graph, asin_1, asin_2):
    title_asin_json = 'title_asin.json'
    sp = gl.shortest_path.create(graph, asin_1)
    recommendations = sp.get_path(asin_2)
    print(recommendations)
    scapped_data  = {}
    res = []
    if (os.path.isfile(title_asin_json)):
        scapped_data = json.load(open(title_asin_json))
    # extract titles from asin values
    for recommendation, idx in recommendations:
        # skip the first and last element, as they are the inputted items
        if not (idx == 0 or idx == (len(recommendations)-1)):
            if ((recommendation not in scapped_data) or \
                ((recommendation in scapped_data) and (scapped_data[recommendation] == ""))):
                for i in range(5):
                    name = get_product_title(recommendation)
                    if (not name == ""):
                        break;
                    sleep(20)

                scapped_data[recommendation] = name
            else:
                name = scapped_data[recommendation]
            res.append(name)
    f=open(title_asin_json,'w')
    json.dump(scapped_data, f)
    return res
'''
B000HCO87C - The Omen
B000FAOC2M - The Hills Have Eyes
'''
print(get_recommendations_sp(graph, 'B000HCO87C', 'B000FAOC2M'))
'''
6300251004 - Scrooge VHS
0005019281 - An American Christmas Carol VHS
'''
print(get_recommendations_sp(graph, '6300251004', '0005019281'))

[('B000HCO87C', 0.0), ('0780656946', 1.0), ('B000FAOC2M', 2.0)]
[u'The Texas Chainsaw Massacre: The Beginning']


[('6300251004', 0.0), ('0780623746', 1.0), ('0005019281', 2.0)]
[u'A Christmas Carol']


When you run the above code, you get the following recommendation for the first example:

    [u'The Texas Chainsaw Massacre: The Beginning']
    
And the following recommendation for the second example:

    ['A Christmas Carol']
    
As you can notice, these recommendations are really good. It is important to note that when we use this shortest path method, it is not necessary that we will get only one recommendation. For instance, if the inputted items are more disimilar, we will likely get more recommendations as the shortest path may contain more vertices.

#### Second SGraph approach to generate recommendations:

In [14]:
def get_recommendations_graph(graph, asin_1):
    title_asin_json = 'title_asin.json'
    subgraph = graph.get_neighborhood(ids=[asin_1], radius=1)
    recommendations = subgraph.get_vertices()
    print(recommendations)
    scapped_data  = {}
    res = []
    if (os.path.isfile(title_asin_json)):
        scapped_data = json.load(open(title_asin_json))
    idx = 0
    # extract titles from asin values
    for row in recommendations:
        recommendation = row['__id']
        # skip the first item as it is the inputted item
        if not (idx == 0):
            if ((recommendation not in scapped_data) or \
                ((recommendation in scapped_data) and (scapped_data[recommendation] == ""))):
                for i in range(5):
                    name = get_product_title(recommendation)
                    if (not name == ""):
                        break;
                    sleep(20)

                scapped_data[recommendation] = name
            else:
                name = scapped_data[recommendation]
            res.append(name)
        idx+=1
    f=open(title_asin_json,'w')
    json.dump(scapped_data, f)
    return res
    
'''
B000HCO87C - The Omen
B000FAOC2M - The Hills Have Eyes
6300251004 - Scrooge VHS
0005019281 - An American Christmas Carol VHS
'''
print(get_recommendations_graph(graph, 'B000HCO87C'))

print(get_recommendations_graph(graph, '6300251004'))

+------------+
|    __id    |
+------------+
| B000HCO87C |
| 6302814391 |
| 6302643627 |
| 6300247333 |
| 0780656946 |
| B000CCBC9E |
| 6300247147 |
| 6300247104 |
| B000AYEL4W |
| B000AA4JKW |
+------------+
[16 rows x 1 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
[u'Exorcist II - The Heretic VHS', u'Jesus of Nazareth VHS', u'Omen 3: The Final Conflict VHS', u'The Texas Chainsaw Massacre: The Beginning', u'The Fog', u'Omen 2 VHS', u'The Omen VHS', u'Dominion - Prequel to the Exorcist', u'The Amityville Horror', u'Freedomland', u'The Black Dahlia', u'Hulk', u'When a Stranger Calls', u'Two for the Money', u"Mary Shelley's Frankenstein"]
+------------+
|    __id    |
+------------+
| 0792839129 |
| 6300251004 |
| 0005019281 |
| 6302794331 |
| 6300215695 |
| 0782002064 |
| 6305609756 |
| 6303824358 |
| 6301967275 |
| 6301442962 |
+------------+
[13 rows x 1 columns]
Note: Only the head of the SFr

When you run the above code, you get the following recommendations for the first example:

        ['Exorcist II - The Heretic VHS', 'Jesus of Nazareth VHS', 'Omen 3: The Final Conflict VHS', 
        u'The Texas    Chainsaw Massacre: The Beginning', 'The Fog', 'Omen 2 VHS', 'The Omen VHS', 
        'Dominion - Prequelto the Exorcist', 'The Amityville Horror', 'Freedomland', 'The Black Dahlia', 
        'Two for the Money',u"Mary Shelley's Frankenstein"]
 
And the following recommendation for the second example:
        
        [u'Scrooge VHS', u'An American Christmas Carol VHS', 'The Muppet Christmas Carol VHS', 
        'White Christmas VHS',"It's a Wonderful Life Original Uncut Version VHS", u'Scrooged',
        u'A Christmas Carol VHS', 'Christmas Carol VHS', u'Miracle on 34th Street VHS',
        u'A Christmas Carol', u'Surviving Christmas', u'A Christmas Carol Colorized VHS']

Again these are very strong results, and there is the added benefit that this approach usually returns many more recommendations than the first approach. We could also easily modify the code to return the k-best recommendations.

#### Side note: You can also simply visualize the recommender model without using SGraphs.

Once we create the recommender model, we do not necessarily need to create a SGraph struct if we are simply trying to explore and visualize the recommendation model. GraphLab allows you interactively evaluate and explore recommendations model using the views.overview() command. However, here you can only manually search for recommendations, and wouldn't make sense to use if the recommendation engine is part of a larger program you are building.

In [15]:
view = model.views.overview(validation_set=reviews)
view.show()

This launches an interactive web-based view for exploring the model. This looks like the following:

<img src='explore.png'>

We can click on the first link (B003EYVXV4) in the popular items. This will take you to the following page:

<img src='focus.png'> 

As you can see, there is a ranked drop down list of similar items. To make this easier to understand, I will list  the titles of the focus item and top 9 similar items below:

Focus item: The Hunger Games

Top Similar Items:

        1) The avengers
        2) Marvel's: The Avengers
        3) Prometheus
        4) Snow White and the Huntsman
        5) The Hobbit: An Unexpected Journey
        6) Mission: Impossible Ghost Protocol
        7) The Amazing Spider-Man
        8) Brave
        9) Captain America: The First Avenger

This approach allows you to interactively and easily browse through the recommendations, and also gives specific recommendations to particular users based on their previous preferences:

<img src='users.png'> 

## Summary and References

As you can see, GraphLab makes it extremely easy and quick to try out different models and approaches. It does most of the heavy duty work behind the scenes, allowing the user to focus on the high level details.  

In addition to recommendation engines, GraphLab is extremely useful for data visualization, generating machine learning models, deep learning, feature engineering, working with large datasets and so much more! The following links will be super useful if you want to learn more about graphlab:

    1. GraphLab documentation: https://turi.com/products/create/docs/index.html
    2. Sample code and more tutorials on GraphLab: https://github.com/turi-code
    
References:
    1. https://www.analyticsvidhya.com/blog/2015/12/started-graphlab-python/
    2. https://turi.com/products/create/docs/index.html
    3. https://github.com/turi-code