# About This Notebook

This notebook demonstrate how to use ML Workbench to create a regression model that accepts numeric and categorical data. This one shows "cloud run" mode, which does each step in Google Cloud Platform with various services. Cloud run can be distributed so it can handle large data without being restricted on memory, computation, or disk limits. The notebook is similar to last one (Taxi Fare Model (small data)), but it uses full data (about 77M instances).

There are only a few things that need to change between "local run" and "cloud run":

* all data sources or file paths must be on GCS.
* the --cloud flag must be set for each step.
* "cloud_config" can be set for cloud specific settings, such as project_id, machine_type. In some cases it is required.

Other than this, nothing else changes from local to cloud!

Note: "Run all cells" does not work for this notebook because the steps are asynchonous. In many steps it submits a cloud job, and you should track the status by following the job link.

Execution of this notebook requires Google Datalab (see [setup instructions](https://cloud.google.com/datalab/docs/quickstarts)).



# The Data

We will use [Chicago Taxi Trip Data](https://cloud.google.com/bigquery/public-data/chicago-taxi). Using pickup location, drop off location, taxi company, the model we will build predicts the trip fare.

## Split Data Into Train/Eval Sets

Use bigquery to select the features we need and also randomly choose 5% for eval, 95% for training.

In [27]:
%%bq query --name texi_query_eval
SELECT
  unique_key,
  fare,
  CAST(EXTRACT(DAYOFWEEK FROM trip_start_timestamp) AS STRING) as weekday,
  CAST(EXTRACT(DAYOFYEAR FROM trip_start_timestamp) AS STRING) as day,
  CAST(EXTRACT(HOUR FROM trip_start_timestamp) AS STRING) as hour,
  pickup_latitude,
  pickup_longitude,
  dropoff_latitude,
  dropoff_longitude,
  company
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE 
  fare > 2.0 AND fare < 200.0 AND
  pickup_latitude IS NOT NULL AND
  pickup_longitude IS NOT NULL AND
  dropoff_latitude IS NOT NULL AND
  dropoff_longitude IS NOT NULL AND
  MOD(ABS(FARM_FINGERPRINT(unique_key)), 100) < 5
  

In [28]:
%%bq query --name texi_query_train
SELECT
  unique_key,
  fare,
  CAST(EXTRACT(DAYOFWEEK FROM trip_start_timestamp) AS STRING) as weekday,
  CAST(EXTRACT(DAYOFYEAR FROM trip_start_timestamp) AS STRING) as day,
  CAST(EXTRACT(HOUR FROM trip_start_timestamp) AS STRING) as hour,
  pickup_latitude,
  pickup_longitude,
  dropoff_latitude,
  dropoff_longitude,
  company
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE 
  fare > 2.0 AND fare < 200.0 AND
  pickup_latitude IS NOT NULL AND
  pickup_longitude IS NOT NULL AND
  dropoff_latitude IS NOT NULL AND
  dropoff_longitude IS NOT NULL AND
  MOD(ABS(FARM_FINGERPRINT(unique_key)), 100) >= 5

Create "chicago_taxi.train" and "chicago_taxi.eval" BQ tables to store results.

In [29]:
%%bq datasets create --name chicago_taxi

In [30]:
%%bq execute
query: texi_query_eval
table: chicago_taxi.eval
mode: overwrite

unique_key,fare,weekday,day,hour,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude,company
2bc572255bcaa2389a211f282a7291916ed3da07,5.45,1,313,7,41.934762456,-87.639853859,41.936310131,-87.651562592,
5d5fcecab5b21f369daab6c08f909fec5eef39f7,3.25,6,3,21,41.946294536,-87.654298084,41.946294536,-87.654298084,
abb241ae6ed453a54fa9761f99b109272c0af770,10.65,7,74,13,41.89967018,-87.669837798,41.921778356,-87.641459759,Dispatch Taxi Affiliation
de3b3be03f4428b542c5abe8afd52545e72d9cbe,4.65,4,184,19,41.906025969,-87.675311622,41.906025969,-87.675311622,Taxi Affiliation Services
583aeed939da4b9cd39746eed19e96a1c8c0b185,17.25,3,41,12,41.808916283,-87.596183344,41.706587882,-87.623366512,Taxi Affiliation Services
c60a2ab89b7d57fec060b5a6b8fa1c141e1540bd,7.65,7,130,3,41.946294536,-87.654298084,41.921854911,-87.646210977,Taxi Affiliation Services
4ec0e84280d13fce4efb3a035297ac0aeb1fa5bb,5.05,1,321,0,41.921854911,-87.646210977,41.93057857,-87.642206313,Taxi Affiliation Services
7cd32a624a19e513615a142e6e485da0814c9022,7.05,7,53,3,41.965445784,-87.66319585,41.946294536,-87.654298084,
34746e341940dce3f90e98873ed2d54cd5d1515d,28.05,1,54,2,41.946294536,-87.654298084,41.945069205,-87.67606274,
549a108a9579646b1d1fb674866d93873bfa4875,5.45,6,291,19,41.965445784,-87.66319585,41.972667956,-87.663865496,


In [31]:
%%bq execute
query: texi_query_train
table: chicago_taxi.train
mode: overwrite

unique_key,fare,weekday,day,hour,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude,company
7bc601797a07c11ac351a49d02850789acda94b1,35.25,4,203,22,41.97907082,-87.903039661,41.890922026,-87.618868355,Taxi Affiliation Services
a2c7b99420515793e18c2cf896963edc1dbaafff,38.05,1,230,13,41.785998518,-87.750934289,41.949139771,-87.656803909,Taxi Affiliation Services
82a1dfb248c4cb6bdede9fbe6a50d2fd2d6043dc,8.25,4,16,8,41.899602111,-87.633308037,41.953582125,-87.72345239,
d3131be93213ce27b6f5c895b7400b2b80bb48a5,8.65,5,276,20,41.904935302,-87.649907226,41.880994471,-87.632746489,Taxi Affiliation Services
b95cff84b984569fdab6a91ec8b0233656aafa3b,9.85,4,225,17,41.880994471,-87.632746489,41.849246754,-87.624135298,Taxi Affiliation Services
cd1641a6062ada8f3019500e5b71e8a6f5edca2d,46.65,5,127,13,42.001571027,-87.695012589,41.983636307,-87.723583185,Taxi Affiliation Services
94977a09ce63010d0edb5692beb4643161fb4a87,6.85,4,71,21,41.884987192,-87.620992913,41.867902418,-87.642958665,Taxi Affiliation Services
98fce58c13ae616b787a93a5a1df7156077a7306,24.45,5,220,20,41.785998518,-87.750934289,41.89321636,-87.63784421,Northwest Management LLC
d6288f216a2d82fbd43a311d1cf70260940dcd78,5.05,4,99,22,41.942691844,-87.651770507,41.93057857,-87.642206313,
58fd7078712ed5a1cea6a87289212af5747c20ea,36.05,3,218,12,41.880994471,-87.632746489,41.97907082,-87.903039661,Taxi Affiliation Services


Sanity check on the data.

In [32]:
%%bq query
SELECT count(*) FROM chicago_taxi.train

f0_
68126775


In [10]:
%%bq query
SELECT count(*) FROM chicago_taxi.eval

f0_
3585149


## Explore Data

See previous notebook (Taxi Fare Model (small data)) for data exploration.

# Create Model with ML Workbench


The MLWorkbench Magics are a set of Datalab commands that allow an easy code-free experience to training, deploying, and predicting ML models. This notebook will take the data in BigQuery tables and build a regression model. The MLWorkbench Magics are a collection of magic commands for each step in ML workflows: analyzing input data to build transforms, transforming data, training a model, evaluating a model, and deploying a model.

For details of each command, run with --help. For example, "%%ml train --help".

This notebook will run the analyze, transform, and training steps in cloud with services. Notice the "--cloud" flag is set for each step.

In [3]:
import google.datalab.contrib.mlworkbench.commands # this loads the %%ml commands

In [35]:
%%ml dataset create
name: taxi_data_full
format: bigquery
train: chicago_taxi.train
eval: chicago_taxi.eval

In [None]:
!gsutil mb gs://datalab-chicago-taxi-demo # Create a Storage Bucket to store results.

## Step 1: Analyze

The first step in the MLWorkbench workflow is to analyze the data for the requested transformations. Analysis in this case builds vocabulary for categorical features, and compute numeric stats for numeric features.

In [None]:
!gsutil rm -r -f gs://datalab-chicago-taxi-demo/analysis # Remove previous analysis results if any

In [38]:
%%ml analyze --cloud
output: gs://datalab-chicago-taxi-demo/analysis
data: taxi_data_full
features:
  unique_key:
    transform: key
  fare:
    transform: target         
  company:
    transform: embedding
    embedding_dim: 10
  weekday:
    transform: one_hot
  day:
    transform: one_hot
  hour:
    transform: one_hot
  pickup_latitude:
    transform: scale    
  pickup_longitude:
    transform: scale
  dropoff_latitude:
    transform: scale
  dropoff_longitude:
    transform: scale

Analyzing column fare...
Updated property [core/project].
column fare analyzed.
Analyzing column hour...
Updated property [core/project].
column hour analyzed.
Analyzing column company...
Updated property [core/project].
column company analyzed.
Analyzing column pickup_longitude...
Updated property [core/project].
column pickup_longitude analyzed.
Analyzing column day...
Updated property [core/project].
column day analyzed.
Analyzing column dropoff_longitude...
Updated property [core/project].
column dropoff_longitude analyzed.
Analyzing column weekday...
Updated property [core/project].
column weekday analyzed.
Analyzing column pickup_latitude...
Updated property [core/project].
column pickup_latitude analyzed.
Analyzing column dropoff_latitude...
Updated property [core/project].
column dropoff_latitude analyzed.
Updated property [core/project].


## Step 2: Transform

The transform step performs some transformations on the input data and saves the results to a special TensorFlow file called a TFRecord file containing TF.Example protocol buffers. This allows training to start from preprocessed data. If this step is not used, training would have to perform the same preprocessing on every row of csv data every time it is used. As TensorFlow reads the same data row multiple times during training, this means the same row would be preprocessed multiple times. By writing the preprocessed data to disk, we can speed up training.

The transform is required if your source data is in BigQuery table.

We run the transform step for the training and eval data.

In [None]:
!gsutil -m rm -r -f gs://datalab-chicago-taxi-demo/transform # Remove previous transform results if any.

Transform takes about 6 hours in cloud. Data is fairely big (33GB) and processing locally on a single VM would be much longer.

In [40]:
%%ml transform --cloud
output: gs://datalab-chicago-taxi-demo/transform
analysis: gs://datalab-chicago-taxi-demo/analysis
data: taxi_data_full

running sdist
running egg_info
writing requirements to trainer.egg-info/requires.txt
writing trainer.egg-info/PKG-INFO
writing top-level names to trainer.egg-info/top_level.txt
writing dependency_links to trainer.egg-info/dependency_links.txt
reading manifest file 'trainer.egg-info/SOURCES.txt'
writing manifest file 'trainer.egg-info/SOURCES.txt'

running check

creating trainer-1.0.0
creating trainer-1.0.0/trainer
creating trainer-1.0.0/trainer.egg-info
copying files to trainer-1.0.0...
copying setup.py -> trainer-1.0.0
copying trainer/__init__.py -> trainer-1.0.0/trainer
copying trainer/feature_analysis.py -> trainer-1.0.0/trainer
copying trainer/feature_transforms.py -> trainer-1.0.0/trainer
copying trainer/task.py -> trainer-1.0.0/trainer
copying trainer.egg-info/PKG-INFO -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/SOURCES.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/dependency_links.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/req

In [5]:
!gsutil list gs://datalab-chicago-taxi-demo/transform/eval-*

gs://datalab-chicago-taxi-demo/transform/eval-00000-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00001-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00002-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00003-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00004-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00005-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00006-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00007-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00008-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00009-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00010-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00011-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transform/eval-00012-of-00056.tfrecord.gz
gs://datalab-chicago-taxi-demo/transfo

In [5]:
%%ml dataset create
name: taxi_data_transformed
format: transformed
train: gs://datalab-chicago-taxi-demo/transform/train-*
eval: gs://datalab-chicago-taxi-demo/transform/eval-*

## Step 3: Training

MLWorkbench help build standard TensorFlow models without you having to write any TensorFlow code. We already know from last notebook that DNN regression model works better.

In [None]:
!gsutil -m rm -r -f gs://datalab-chicago-taxi-demo/train # Remove previous training results.

Training takes about 30 min with "STANRDARD_1" scale_tier. Note that we will perform 1M steps. This will take much longer if we run it locally on Datalab's VM. With CloudML Engine, it runs training in a distributed way with multiple VMs, so it runs much faster.

In [6]:
%%ml train --cloud
output: gs://datalab-chicago-taxi-demo/train
analysis: gs://datalab-chicago-taxi-demo/analysis
data: taxi_data_transformed
model_args:
    model: dnn_regression
    hidden-layer-size1: 400
    hidden-layer-size2: 200
    train-batch-size: 1000
    max-steps: 1000000
cloud_config:
    region: us-east1
    scale_tier: STANDARD_1

## Step 4: Evaluation using batch prediction

Below, we use the evaluation model and run batch prediction in cloud. For demo purpose, we will use the evaluation data again.

In [None]:
# Delete previous results
!gsutil -m rm -r gs://datalab-chicago-taxi-demo/batch_prediction

Currently, batch_prediction service does not work with BigQuery data. So we export eval data to csv file.

In [9]:
%%bq extract
table: chicago_taxi.eval
format: csv
path: gs://datalab-chicago-taxi-demo/eval.csv

Run batch prediction. Note that we use evaluation_model because it takes input data with target (truth) column.

In [8]:
%%ml batch_predict --cloud
model: gs://datalab-chicago-taxi-demo/train/evaluation_model
output: gs://datalab-chicago-taxi-demo/batch_prediction
format: csv
data:
  csv: gs://datalab-chicago-taxi-demo/eval.csv
cloud_config:
  region: us-east1

Once batch prediction is done, check results files. Batch prediction service outputs to JSON files.

In [14]:
!gsutil list -l -h gs://datalab-chicago-taxi-demo/batch_prediction

       0 B  2017-10-28T04:23:12Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.errors_stats-00000-of-00001
       0 B  2017-10-28T04:10:44Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00000-of-00001
 19.74 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00000-of-00022
 19.76 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00001-of-00022
 19.79 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00002-of-00022
 19.76 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00003-of-00022
 19.86 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00004-of-00022
 19.81 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results-00005-of-00022
 19.87 MiB  2017-10-28T04:23:11Z  gs://datalab-chicago-taxi

We can load the results back to BigQuery.

In [10]:
%%bq load
format: json
mode: overwrite  
table: chicago_taxi.eval_results
path: gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results*
schema:
  - name: unique_key
    type: STRING
  - name: predicted
    type: FLOAT
  - name: target
    type: FLOAT

With data in BigQuery can do some query analysis. For example, RMSE.

In [11]:
%%ml evaluate regression
bigquery: chicago_taxi.eval_results

Unnamed: 0,metric,value
0,Root Mean Square Error,3.474433
1,Mean Absolute Error,1.531116
2,50 Percentile Absolute Error,0.911206
3,90 Percentile Absolute Error,2.88205
4,99 Percentile Absolute Error,12.204504


From above, the results are better than local run with sampled data. RMSE reduced by 2.5%, MAE reduced by around 20%. Average absolute error reduced by around 30%.

Select top results sorted by error.

In [12]:
%%bq query
SELECT
  predicted, 
  target,
  ABS(predicted-target) as error,
  s.* 
FROM `chicago_taxi.eval_results` as r 
JOIN `chicago_taxi.eval` as s 
ON r.unique_key = s.unique_key 
ORDER BY error DESC
LIMIT 10

predicted,target,error,unique_key,fare,weekday,day,hour,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude,company
7.06353187561,197.770004272,190.706472397,357449c0af2cc90b5c49fb04087183ba7f90cea5,197.77,4,280,4,41.899602111,-87.633308037,41.899602111,-87.633308037,
8.17792129517,197.050003052,188.872081757,e2f0acebdc1d8ac7d998818e399e4b76505247e2,197.05,7,150,17,41.9867118,-87.663416405,42.009622881,-87.670166857,Taxi Affiliation Services
8.41388988495,195.850006104,187.436116219,3b03f0108c84bc2b4c42ba874b0cfdccd72e575e,195.85,6,318,20,41.97907082,-87.903039661,41.97907082,-87.903039661,Dispatch Taxi Affiliation
7.15682601929,192.850006104,185.693180084,76266a055557deb60189004a10d5a43f48843c86,192.85,1,222,3,41.9867118,-87.663416405,41.96581197,-87.655878786,Blue Ribbon Taxi Association Inc.
13.7977991104,199.050003052,185.252203941,a759a0491eef2661c12027d5dc5283b64124a18c,199.05,2,342,14,41.944226601,-87.655998182,42.009622881,-87.670166857,
5.72312879562,189.669998169,183.946869373,b0fb64578ac47490df8e94a223f0ec82613fa1f0,189.67,7,250,11,41.879066994,-87.657005027,41.879255084,-87.642648998,
12.2571325302,194.649993896,182.392861366,18136afd91cf91f58bb664b5fba4c9f33ebe46b4,194.65,2,327,9,41.97907082,-87.903039661,41.97907082,-87.903039661,
5.31241512299,187.0,181.687584877,1bc27016f87f8de32dfead8b282c996e4ff0ce15,187.0,1,136,3,41.892072635,-87.628874157,41.892507781,-87.626214906,Chicago Medallion Leasing INC
9.88418579102,188.050003052,178.165817261,a5c3dd3f3f8228eb0ca208e5e949d1e9401e7de5,188.05,3,181,23,41.922686284,-87.649488729,41.947791586,-87.683834942,
18.1057472229,196.050003052,177.944255829,1ac25e0730bbefa234752b6a2e6824296936cc82,196.05,4,134,21,41.878865584,-87.625192142,41.96581197,-87.655878786,Taxi Affiliation Services


There is also a feature slice visualization component designed for viewing evaluation results. It shows correlation between features and prediction results.

In [40]:
%%bq query --name error_by_hour
SELECT
  COUNT(*) as count,
  hour as feature,
  AVG(ABS(predicted - target)) as avg_error,
  STDDEV(ABS(predicted - target)) as stddev_error
FROM `chicago_taxi.eval_results` as r
JOIN `chicago_taxi.eval` as s 
ON r.unique_key = s.unique_key 
GROUP BY hour

In [44]:
# Note: the interactive output is replaced with a static image so it displays well in github.
# Please execute this cell to see the interactive component.

from google.datalab.ml import FeatureSliceView

FeatureSliceView().plot(error_by_hour)

<IPython.core.display.Image object>

In [42]:
%%bq query --name error_by_weekday
SELECT
  COUNT(*) as count,
  weekday as feature,
  AVG(ABS(predicted - target)) as avg_error,
  STDDEV(ABS(predicted - target)) as stddev_error
FROM `chicago_taxi.eval_results` as r
JOIN `chicago_taxi.eval` as s 
ON r.unique_key = s.unique_key 
GROUP BY weekday

In [45]:
# Note: the interactive output is replaced with a static image so it displays well in github.
# Please execute this cell to see the interactive component.

from google.datalab.ml import FeatureSliceView

FeatureSliceView().plot(error_by_weekday)

<IPython.core.display.Image object>

What we can see from above charts is that model performs worst in hour 5 and 6 (why?), and best on Sundays (less traffic?).

# Model Deployment and Online Prediction

Model deployment works the same between locally trained models and cloud trained models. Please see previous notebook (Taxi Fare Model (small data)).

# Cleanup

In [None]:
!gsutil -m rm -rf gs://datalab-chicago-taxi-demo