# GraphLab Create Training Day

## Installation & preparation

1. Sign up for a [free academic license here](https://dato.com/download/academic.html). Collect your GraphLab Create license key.

2. [Download the Dato Launcher from here](https://dato.com/download/install-graphlab-create.html) and follow the installation instructions. We have installers for Windows, Linux and Mac (64 bit). 

3. Please download the following input files and save them locally:
<ul>
<li>`business.csv.gz`: https://s3.amazonaws.com/dato-datasets/dato-training/business.csv.gz (640.2 KB)
<li>`review.csv.gz`: https://s3.amazonaws.com/dato-datasets/dato-training/review.csv.gz (80.4 MB)
<li>`user.csv.gz`: https://s3.amazonaws.com/dato-datasets/dato-training/user.csv.gz (1.1 MB)
</ul>

You can unzip these files, or not - GraphLab can read from gzipped files as well.

In [None]:
import os
from urllib import urlretrieve
url_template = 'https://s3.amazonaws.com/dato-datasets/dato-training/%s'
targets = ["business.csv.gz", "review.csv.gz", "user.csv.gz"]

for target in targets:
    if not os.path.isfile(_):
        urlretrieve(url_template % target, target)

## Testing your setup

If you installed via Dato Launcher, open `ipython notebook`.

Otherwise if you installed using pip, run the command `ipython notebook` in a terminal window.

Create a new notebook (pressing "New notebook").

In the command box write: `import graphlab` and then click the "run" button and make sure it runs without error.

Run the following commands and verify the output is as seen in the picture:

```
import graphlab
graphlab.canvas.set_target('ipynb')
graphlab.SArray(["1","2","3","4"]).show()
```

<img src="images/graphlab.SArray1234.show.png" width="50%"></img>

Run the following command, and make sure you get the output:

```
>>> graphlab.SArray(["1","2","3","4"]).apply(lambda x: x + "_")
dtype: str
Rows: 4
['1_', '2_', '3_', '4_']
```

### Technical support
For any questions, please email <a href="guy@dato.com">guy@dato.com</a>, <a href="alon@dato.com">alon@dato.com</a> or <a href="bickson@dato.com">bickson@dato.com</a>.

### Testing your setup - Results

In [None]:
import graphlab
graphlab.canvas.set_target('ipynb')
graphlab.SArray(["1","2","3","4"]).show()

In [None]:
graphlab.SArray(["1","2","3","4"]).apply(lambda x: x + "_")

## Exercise A - Data Engineering

### Loading a dataset
Load users dataset:

```
user = graphlab.SFrame.read_csv('user.csv.gz')
```

### Visualizing a dataset
Use `user.show()` to get some statistics.

Use `user.head(3)` to view the first 3 lines.


### Accessing a column
Use `user['name']` to access the column `name`.

### Computing statistics of a column
Calculating the mean average_star rating:

Use `user['average_stars'].mean()` to compute the mean star rating.

### Accessing a row
Print out the first 3 rows:

```
user.head(3) # or user[:3]
```

Print out the last 3 rows:
```
user.tail(3) # or user[-3:]
```

### Accessing a row column combination
Print the name and average_stars column of the first 10 rows:

```
user[0:10][['name','average_stars']]
```

### Logical operators
Test for all users name Jim:

```
selection = user['name'] == 'Jim'
```

Select all data rows for users named Jim:

````
user[selection]
```

Shorhand notation:

```
user[user['name'] == 'Jim']
```

### How many user named Jim  have more than 2 avg_stars:

```
user[(user['name'] == 'Jim') & (user['average_stars'] > 2)]
```

### SQL primitives
Count users with the same name:

```
user.groupby(['name'], graphlab.aggregate.COUNT())
```

Count users with the same name and number of votes_funny:

```
user.groupby(['name', 'votes_funny'], graphlab.aggregate.COUNT())
```

Calculate the average number of reviews per name:

```
user.groupby(['name'], graphlab.aggregate.AVG('review_count'))
```

### Transformations
Add 12 reviews to all users. This does not modify the SFrame, and the result is not stored:

```
user['review_count'].apply(lambda x: x + 12)
```

### Generate a new column which is the sum of votes_cool and votes_funny

```
user['funny_cool'] = user['votes_cool'] + user['votes_funny']
```

By storing the resulting column into a new column name, the SFrame is modified.

### Load business and review data:
```
business = graphlab.SFrame.read_csv('business.csv.gz')
review = graphlab.SFrame.read_csv('review.csv.gz')
```

### Visualize the business data:
```
business.show()
```

### Show reviews with the max stars rating:
```
review.topk("average_stars")
```
	
### Remove all the fields in the business table with has only a single unique value:
```
business.remove_columns(["type"])
```

### Find all reviews of all users named Jim:
```
joined = review.join(user, on='user_id')
joined[joined['name'] == 'Jim']
```

## Exercise A - Data Engineering - Results

In [None]:
## Exercise A - Data Engineering

### Loading a dataset
# Load users dataset:
user = graphlab.SFrame.read_csv('user.csv.gz')

### Visualizing a dataset
# get some statistics
user.show()

# view the first 3 rows
user.head(3)

### Accessing a column
# access the column 'name'
user['name']

### Computing statistics of a column
# Calculating the mean average_star rating:
# to compute the mean star rating.
user['average_stars'].mean()

### Accessing a row
# Print out the first 3 rows:
user.head(3) # or user[:3]

# Print out the last 3 rows:
user.tail(3) # or user[-3:]

### Accessing a row column combination
# Print the name and average_stars column of the first 10 rows:
user[0:10][['name','average_stars']]

### Logical operators
# Test for all users name Jim:
selection = user['name'] == 'Jim'

# Select all data rows for users named Jim:
user[selection]

#Shorhand notation:
user[user['name'] == 'Jim']

### How many user named Jim  have more than 2 avg_stars:
user[(user['name'] == 'Jim') & (user['average_stars'] > 2)]

### SQL primitives
# Count users with the same name:
user.groupby(['name'], graphlab.aggregate.COUNT())

# Count users with the same name and number of votes_funny:
user.groupby(['name', 'votes_funny'], graphlab.aggregate.COUNT())

# Calculate the average number of reviews per name:
user.groupby(['name'], graphlab.aggregate.AVG('review_count'))

### Transformations
# Add 12 reviews to all users. This does not modify the SFrame, and the result is not stored:
user['review_count'].apply(lambda x: x + 12)

### Generate a new column which is the sum of votes_cool and votes_funny
user['funny_cool'] = user['votes_cool'] + user['votes_funny']
# By storing the resulting column into a new column name, the SFrame is modified.

### Load business and review data:
business = graphlab.SFrame.read_csv('business.csv.gz')
review = graphlab.SFrame.read_csv('review.csv.gz')

### Visualize the business data:
business.show()

### Show reviews with the max stars rating:
review.topk("stars")

### Remove all the fields in the business table with has only a single unique value:
business.remove_columns(["type"])

### Find all reviews of all users named Jim:
joined = review.join(user, on='user_id')
joined[joined['name'] == 'Jim']

## Exercise B: Graph Analytics

### Construct a graph out of the 500 first rows with edges between year and name:
```
joined['year'] = joined['year'].astype(str)
graph = graphlab.SGraph().add_edges(joined[:500], src_field='year', dst_field='name')
```

### Visualize the graph:
```
graph.show(vlabel_hover=True, vlabel='__id')
```

### Calculate PageRank:
```
pr = graphlab.pagerank.create(graph) 
```

### Find the 10 most important nodes in the graph:
```
print pr[‘pagerank’].topk(‘pagerank’)
```

### Add the pagerank information into the graph vertices:
```
graph = graph.add_vertices(pr['pagerank'])
```

### Show the pagerank values on the graph:
```
graph.show(vlabel_hover=True, vlabel='pagerank')
```

## Exercise B - Graph Analytics - Results

In [None]:
joined['year'] = joined['year'].astype(str)
graph = graphlab.SGraph().add_edges(joined[:500], src_field='year', dst_field='name')
graph.show(vlabel_hover=True, vlabel='__id')

In [None]:
pr = graphlab.pagerank.create(graph) 
print pr['pagerank'].topk('pagerank')
graph = graph.add_vertices(pr['pagerank'])
graph.show(vlabel_hover=True, vlabel='pagerank')

## Exercise C: Recommenders (User modeling)

### Split the data into 90% train and  10% test sets:
```
train, test = review.random_split(0.9)
```

### Build a matrix factorization recommender to predict star rating:
```
model = graphlab.factorization_recommender.create(train, target='stars', 
side_data_factorization=False, item_id='business_id')
```

### Compute train and test accuracy:
```
print 'Training RMSE', model.get('training_rmse')
print 'Test RMSE', graphlab.evaluation.rmse(test['stars'], model.predict(test)) 
```

### Now we will improve the model using business side features:
```
model = graphlab.factorization_recommender.create(train, target='stars',
side_data_factorization=False, item_id='business_id', user_data=user, 
item_data=business)
```

### See the improvement:
```
print 'Training RMSE', model.get('training_rmse')
print 'Test RMSE', graphlab.evaluation.rmse(test['stars'], model.predict(test))
```

## Exercise C: Recommenders (User modeling) - Results

In [None]:
train, test = review.random_split(0.9)

model = graphlab.factorization_recommender.create(train, target='stars',
                                                  item_id='business_id',
                                                  side_data_factorization=False)
print 'Training RMSE', model.get('training_rmse')
print 'Test RMSE', graphlab.evaluation.rmse(test['stars'], model.predict(test))

model_with_side_data_factorization = graphlab.factorization_recommender.create(train,
                                                                               target='stars',
                                                                               item_id='business_id',
                                                                               side_data_factorization=True,
                                                                               user_data=user, 
                                                                               item_data=business)

print 'Training RMSE', model_with_side_data_factorization.get('training_rmse')
print 'Test RMSE', graphlab.evaluation.rmse(test['stars'], model_with_side_data_factorization.predict(test))

In [None]:
print "Test RMSE without side features: %f" % graphlab.evaluation.rmse(test['stars'], model.predict(test))
print "Test RMSE with side features: %f" % graphlab.evaluation.rmse(test['stars'], model_with_side_data_factorization.predict(test))

In [None]:
print "Max error without side features: %f" % graphlab.evaluation.max_error(test['stars'], model.predict(test))
print "Max error with side features: %f" % graphlab.evaluation.max_error(test['stars'], model_with_side_data_factorization.predict(test))