# GraphLab Create Training Day

## Installation & preparation

1. Sign up for a [free academic license here](https://dato.com/download/academic.html). Collect your GraphLab Create license key.

2. [Download the Dato Launcher from here](https://dato.com/download/install-graphlab-create.html) and follow the installation instructions. We have installers for Windows, Linux and Mac (64 bit). 

3. Please download the following input files and save them locally:
<ul>
<li>`business.csv.gz`: https://s3.amazonaws.com/dato-datasets/dato-training/business.csv.gz (640.2 KB)
<li>`review.csv.gz`: https://s3.amazonaws.com/dato-datasets/dato-training/review.csv.gz (80.4 MB)
<li>`user.csv.gz`: https://s3.amazonaws.com/dato-datasets/dato-training/user.csv.gz (1.1 MB)
</ul>

You can unzip these files, or not - GraphLab can read from gzipped files as well.

In [None]:
import os
from urllib import urlretrieve
url_template = 'https://s3.amazonaws.com/dato-datasets/dato-training/%s'
targets = ["business.csv.gz", "review.csv.gz", "user.csv.gz"]

for target in targets:
    if not os.path.exists(target):
        urlretrieve(url_template % target, target)

## Testing your setup

If you installed via Dato Launcher, open `ipython notebook`.

Otherwise if you installed using pip, run the command `ipython notebook` in a terminal window.

Create a new notebook (pressing "New notebook").

In the command box write: `import graphlab` and then click the "run" button and make sure it runs without error.

Run the following commands and verify the output is as seen in the picture.

In [None]:
import graphlab
graphlab.canvas.set_target('ipynb')
graphlab.SArray(["1","2","3","4"]).show()

#### Expected Output:
<table>
<img src="images/graphlab.SArray1234.show.png" width="50%"></img>


### Technical support
For any questions, please email <a href="guy@dato.com">guy@dato.com</a>, <a href="alon@dato.com">alon@dato.com</a> or <a href="bickson@dato.com">bickson@dato.com</a>.


### One last note...
Isn't `graphlab` a bit too long to type? In many notebooks you will see this shorthand alias:

In [None]:
import graphlab as gl

## Getting Started with GraphLab Create: 5-Lines Recommender

Let's [build a recommender system with 5 lines of Python code](https://dato.com/learn/gallery/notebooks/five_line_recommender.html).

### Loading a dataset
Load reviews dataset (a table stored in a gzipped CSV file) into a GraphLab SFrame. Since we're only demonstrting how to get started, we'll load just 100 rows:

In [None]:
review = graphlab.SFrame.read_csv('review.csv.gz', nrows=100)

### Visualizing a dataset
Get some statistics using the GraphLab Canvas:

In [None]:
gl.canvas.set_target('browser')
review.show()

Or directly look at the beginning of the data using `user.head()`:

In [None]:
review.head(3)

In [None]:
model = graphlab.recommender.create(review, user_id="user_id", item_id="business_id", target="stars")

In [None]:
# You can now make recommendations for all the users you've just trained on
results = model.recommend(users=None, k=5)
results.head()

In [None]:
# Save the model for later use
model.save("model_100rows")

## Exercise A - Data Engineering

We'll do the data engineering exercise over the users data. Let's load it into an SFrame:

In [None]:
user = graphlab.SFrame.read_csv('user.csv.gz')

### Accessing a column
Use `user['name']` to access the column `name`.

In [None]:
user['name']

### Computing statistics of a column
Calculating the mean average_star rating:
Use `user['average_stars'].mean()` to compute the mean star rating.

In [None]:
user['average_stars'].mean()

## Before you continue: a primer on operator overloading in Python
**Intermediate-level Python programmers can skip this section**.

You are about to discover the SFrame, the basic data structure used by GraphLab Create. If you are also new to Python, some of the syntax you are about to see may look weird to you. Don't worry! It is just a result of a really cool feature in Python called **operator overloading**. While Python is not the only language that has this feature, it certainly made programming operator overloading super-easy, so we'll be using it a lot to write shorter, more beautiful code.

What is operator overloading? An operator is just what you know from your math class -
```
a + b ==> the + operator applied to a and b ==> +(a, b)
```

Since Python is an object-oriented language, when we see `object1 + object2`, what happens behind the scenes is:
```
a.__add__(b) # __add__ stands for +
```

So whenever you see some unfamiliar syntax, think: perhaps it's only operator overloading, and I should look for the documentation of that operator?

Here's a quick example:

In [None]:
class Pacman():
    """An object with some customized operators in it."""
    
    def __init__(self):
        """Constructor"""
        self.stomach = []
    
    def eat(self, food):
        """Method"""
        self.stomach.append(food)
        return self
    
    def __add__(self, food):
        """+ operator"""
        return self.eat(food)
    
    def __str__(self):
        """print statement operator"""
        return "(%s<)" % ("".join(map(str, self.stomach)))
    
    def __repr__(self):
        """REPL printout operator"""
        return "Pacman(%s)" % (str(self.stomach))

In [None]:
pac = Pacman()
print pac

In [None]:
pac.eat(1)
pac.eat('*')
print pac

In [None]:
pac = pac + 3
print pac

In [None]:
pac

Even what's printed on the screen may get manipulated by objects... *You have been warned!*

And now...

## Back to the Exercise

### Accessing a row
Print out the first 3 rows:

In [None]:
user.head(3) # or user[:3]

Print out the last 3 rows:

In [None]:
user.tail(3) # or user[-3:]

### Accessing a row column combination
Print the name and average_stars column of the first 10 rows:

In [None]:
user[0:10][['name','average_stars']]

### Logical operators
Test for all users name Jim:

In [None]:
selection = user['name'] == 'Jim'

Select all data rows for users named Jim:

In [None]:
user[selection]

Shorthand notation (AKA one-liner):

In [None]:
user[user['name'] == 'Jim']

### How many user named Jim  have more than 2 avg_stars:

In [None]:
user[(user['name'] == 'Jim') & (user['average_stars'] > 2)]

### SQL primitives

Count users with the same name:

In [None]:
user.groupby(['name'], graphlab.aggregate.COUNT())

Count users with the same name and number of votes_funny:

In [None]:
user.groupby(['name', 'votes_funny'], graphlab.aggregate.COUNT())

Calculate the average number of reviews per name:

In [None]:
user.groupby(['name'], graphlab.aggregate.AVG('review_count'))

### Transformations
Add 12 reviews to all users. This does not modify the SFrame, and the result is not stored:

In [None]:
user['review_count'].apply(lambda x: x + 12)

You can also use Python libraries and your own user-defined functions with the `apply()` method.

In [None]:
from string import capitalize

def cap_string(s):
    """First letter gets uppercased, the rest get lowercased."""
    return capitalize(s)

user['user_id'].apply(capitalize)[:3]

### Generate a new column which is the sum of votes_cool and votes_funny

In [None]:
user['funny_cool'] = user['votes_cool'] + user['votes_funny']

By storing the resulting column into a new column name, the SFrame is modified.

### Load business and review data:

In [None]:
business = graphlab.SFrame.read_csv('business.csv.gz')
review = graphlab.SFrame.read_csv('review.csv.gz')

### Visualize the business data:

In [None]:
business.show()

### Show reviews with the max stars rating:

In [None]:
review.topk("stars")

### Remove all the fields in the business table with has only a single unique value:

In [None]:
business.remove_columns(["type"])

### Find all reviews of all users named Jim:

Review writers are identified by their `user_id` in the `review` SFrame, but their names are in a different SFrame (`user`). We'll join the two SFrames in order to find Jim's reviews (and demonstrate `join()`, of course).

In [None]:
joined = review.join(user, on='user_id')
joined[joined['name'] == 'Jim']

## Exercise B: Graph Analytics

### Construct a graph out of the 500 first rows with edges between year and name:

In [None]:
joined['year'] = joined['year'].astype(str) # vertices should have the same data type
graph = graphlab.SGraph().add_edges(joined[:500], src_field='year', dst_field='name')

### Visualize the graph:

In [None]:
highlight_color = [0.69, 0.0, 0.498]
highlight = {year: highlight_color for year in joined['year'].unique()}
graph.show(vlabel_hover=True, vlabel='__id', highlight=highlight)

### Calculate PageRank:

In [None]:
pr = graphlab.pagerank.create(graph) 

### Find the 10 most important nodes in the graph:

In [None]:
print pr['pagerank'].topk('pagerank')

### Add the pagerank information into the graph vertices:

In [None]:
graph = graph.add_vertices(pr['pagerank'])

### Show the pagerank values on the graph:

graph.show(vlabel_hover=True, vlabel='pagerank')

### Define a new label for the visualization

In [None]:
def give_label(row):
    label = "%s (%f)" % (row["__id"], row["pagerank"])
    return label
    
graph.vertices['label'] = graph.vertices.apply(give_label)
graph.vertices

In [None]:
graph.show(vlabel_hover=True, vlabel='label', highlight=highlight)

## Exercise C: Recommenders (User modeling)

### Split the data into 90% train and  10% test sets:

In [None]:
train, test = review.random_split(0.9)

### Build a matrix factorization recommender to predict star rating:

In [None]:
model = graphlab.factorization_recommender.create(train, target='stars',
                                                  item_id='business_id',
                                                  side_data_factorization=False)
print 'Training RMSE', model.get('training_rmse')
print 'Test RMSE', graphlab.evaluation.rmse(test['stars'], model.predict(test))

### Now we will improve the model using business side features:

In [None]:
model_with_side_data_factorization = graphlab.factorization_recommender.create(train,
                                                                               target='stars',
                                                                               item_id='business_id',
                                                                               side_data_factorization=True,
                                                                               user_data=user, 
                                                                               item_data=business)

print 'Training RMSE', model_with_side_data_factorization.get('training_rmse')
print 'Test RMSE', graphlab.evaluation.rmse(test['stars'], model_with_side_data_factorization.predict(test))

### Is there an improvement?

In [None]:
print "Test RMSE without side features: %f" % graphlab.evaluation.rmse(test['stars'], model.predict(test))
print "Test RMSE with side features: %f" % graphlab.evaluation.rmse(test['stars'], model_with_side_data_factorization.predict(test))

In [None]:
print "Max error without side features: %f" % graphlab.evaluation.max_error(test['stars'], model.predict(test))
print "Max error with side features: %f" % graphlab.evaluation.max_error(test['stars'], model_with_side_data_factorization.predict(test))