<h1> Scaling up ML using Cloud ML Engine </h1>

In this notebook, we take a previously developed TensorFlow model to predict taxifare rides and package it up so that it can be run in Cloud MLE. For now, we'll run this on a small dataset. The model that was developed is rather simplistic, and therefore, the accuracy of the model is not great either.  However, this notebook illustrates *how* to package up a TensorFlow model to run it within Cloud ML. 

Later in the course, we will look at ways to make a more effective machine learning model.

<h2> Environment variables for project and bucket </h2>

Note that:
<ol>
<li> Your project id is the *unique* string that identifies your project (not the project name). You can find this from the GCP Console dashboard's Home page.  My dashboard reads:  <b>Project ID:</b> cloud-training-demos </li>
<li> Cloud training often involves saving and restoring model files. If you don't have a bucket already, I suggest that you create one from the GCP console (because it will dynamically check whether the bucket name you want is available). A common pattern is to prefix the bucket name by the project id, so that it is unique. Also, for cost reasons, you might want to use a single region bucket. </li>
</ol>
<b>Change the cell below</b> to reflect your Project ID and bucket name.


In [1]:
import os
PROJECT = 'qwiklabs-gcp-4ec8a2116b8b4a75' # REPLACE WITH YOUR PROJECT ID
REGION = 'us-central1' # Choose an available region for Cloud MLE from https://cloud.google.com/ml-engine/docs/regions.
BUCKET = 'qwiklabs-gcp-4ec8a2116b8b4a75' # REPLACE WITH YOUR BUCKET NAME. Use a regional bucket in the region you selected.

In [2]:
# for bash
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.4'  # Tensorflow version

In [3]:
%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


Allow the Cloud ML Engine service account to read/write to the bucket containing training data.

In [4]:
%bash
PROJECT_ID=$PROJECT
AUTH_TOKEN=$(gcloud auth print-access-token)
SVC_ACCOUNT=$(curl -X GET -H "Content-Type: application/json" \
    -H "Authorization: Bearer $AUTH_TOKEN" \
    https://ml.googleapis.com/v1/projects/${PROJECT_ID}:getConfig \
    | python -c "import json; import sys; response = json.load(sys.stdin); \
    print(response['serviceAccount'])")

echo "Authorizing the Cloud ML Service account $SVC_ACCOUNT to access files in $BUCKET"
gsutil -m defacl ch -u $SVC_ACCOUNT:R gs://$BUCKET
gsutil -m acl ch -u $SVC_ACCOUNT:R -r gs://$BUCKET  # error message (if bucket is empty) can be ignored
gsutil -m acl ch -u $SVC_ACCOUNT:W gs://$BUCKET

Authorizing the Cloud ML Service account service-940366452726@cloud-ml.google.com.iam.gserviceaccount.com to access files in qwiklabs-gcp-4ec8a2116b8b4a75


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   235    0   235    0     0   1080      0 --:--:-- --:--:-- --:--:--  1077
Updated default ACL on gs://qwiklabs-gcp-4ec8a2116b8b4a75/
Encountered a problem: CommandException: No URLs matched: gs://qwiklabs-gcp-4ec8a2116b8b4a75/*
Updated ACL on gs://qwiklabs-gcp-4ec8a2116b8b4a75/


<h2> Packaging up the code </h2>

Take your code and put into a standard Python package structure.  <a href="taxifare/trainer/model.py">model.py</a> and <a href="taxifare/trainer/task.py">task.py</a> contain the Tensorflow code from earlier (explore the <a href="taxifare/trainer/">directory structure</a>).

In [5]:
!find taxifare

taxifare
taxifare/setup.cfg
taxifare/trainer
taxifare/trainer/model.py
taxifare/trainer/__init__.py
taxifare/trainer/task.py
taxifare/setup.py
taxifare/PKG-INFO


In [6]:
!cat taxifare/trainer/model.py

#!/usr/bin/env python

# Copyright 2017 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import tensorflow as tf
import numpy as np
import shutil

tf.logging.set_verbosity(tf.logging.INFO)

# List the CSV columns
CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']

#Choose which column is your label
LABEL_COLUMN = 'fare_amount'

# Set the default values for each CSV column in case there is a missing value
DEFAULTS =

<h2> Find absolute paths to your data </h2>

Note the absolute paths below. /content is mapped in Datalab to where the home icon takes you

In [7]:
%bash
echo $PWD
rm -rf $PWD/taxi_trained
cp $PWD/../tensorflow/taxi-train.csv .
cp $PWD/../tensorflow/taxi-valid.csv .
head -1 $PWD/taxi-train.csv
head -1 $PWD/taxi-valid.csv

/content/datalab/training-data-analyst/courses/machine_learning/cloudmle
9.0,-73.93219757080078,40.79558181762695,-73.93547058105469,40.80010986328125,1,0
21.0,-73.975305,40.790067,-73.996612,40.733275,1,0


<h2> Running the Python module from the command-line </h2>

In [8]:
%bash
rm -rf taxifare.tar.gz taxi_trained
export PYTHONPATH=${PYTHONPATH}:${PWD}/taxifare
python -m trainer.task \
   --train_data_paths="${PWD}/taxi-train*" \
   --eval_data_paths=${PWD}/taxi-valid.csv  \
   --output_dir=${PWD}/taxi_trained \
   --train_steps=1000 --job-dir=./tmp

  from ._conv import register_converters as _register_converters
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_evaluation_master': '', '_save_summary_steps': 100, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_task_id': 0, '_is_chief': True, '_log_step_count_steps': 100, '_model_dir': '/content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained', '_global_id_in_cluster': 0, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_secs': 600, '_train_distribute': None, '_save_checkpoints_steps': None, '_num_worker_replicas': 1, '_tf_random_seed': None, '_session_config': None, '_num_ps_replicas': 0, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fa9f1835ba8>, '_master': ''}
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after 300 secs (eval_spec.throttle_secs) or training is fin

In [9]:
%bash
ls $PWD/taxi_trained/export/exporter/

1559849759


In [10]:
%writefile ./test.json
{"pickuplon": -73.885262,"pickuplat": 40.773008,"dropofflon": -73.987232,"dropofflat": 40.732403,"passengers": 2}

Writing ./test.json


In [None]:
## local predict doesn't work with Python 3 yet
#%bash
#model_dir=$(ls ${PWD}/taxi_trained/export/exporter)
#gcloud ml-engine local predict \
#    --model-dir=${PWD}/taxi_trained/export/exporter/${model_dir} \
#    --json-instances=./test.json

<h2> Running locally using gcloud </h2>

In [11]:
%bash
rm -rf taxifare.tar.gz taxi_trained
gcloud ml-engine local train \
   --module-name=trainer.task \
   --package-path=${PWD}/taxifare/trainer \
   -- \
   --train_data_paths=${PWD}/taxi-train.csv \
   --eval_data_paths=${PWD}/taxi-valid.csv  \
   --train_steps=1000 \
   --output_dir=${PWD}/taxi_trained 

  from ._conv import register_converters as _register_converters
INFO:tensorflow:TF_CONFIG environment variable: {'job': {'args': ['--train_data_paths=/content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi-train.csv', '--eval_data_paths=/content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi-valid.csv', '--train_steps=1000', '--output_dir=/content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained'], 'job_name': 'trainer.task'}, 'task': {}, 'environment': 'cloud', 'cluster': {}}
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_train_distribute': None, '_model_dir': '/content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi_trained', '_tf_random_seed': None, '_service': None, '_session_config': None, '_evaluation_master': '', '_save_checkpoints_secs': 600, '_save_summary_steps': 100, '_is_chief': True, '_log_step_count_steps': 100, '_task_id': 0, '_num_ps_replicas': 0, '_num_

When I ran it (due to random seeds, your results will be different), the ```average_loss``` (Mean Squared Error) on the evaluation dataset was 187, meaning that the RMSE was around 13.

In [12]:
from google.datalab.ml import TensorBoard
TensorBoard().start('{}/taxi_trained'.format(os.environ['PWD']))

  from ._conv import register_converters as _register_converters


4369

In [13]:
for pid in TensorBoard.list()['pid']:
  TensorBoard().stop(pid)
  print('Stopped TensorBoard with pid {}'.format(pid))

Stopped TensorBoard with pid 4369


If the above step (to stop TensorBoard) appears stalled, just move on to the next step. You don't need to wait for it to return.

In [14]:
!ls $PWD/taxi_trained

checkpoint				     model.ckpt-1000.index
eval					     model.ckpt-1000.meta
events.out.tfevents.1559849827.3d03893d7992  model.ckpt-1.data-00000-of-00001
export					     model.ckpt-1.index
graph.pbtxt				     model.ckpt-1.meta
model.ckpt-1000.data-00000-of-00001


<h2> Submit training job using gcloud </h2>

First copy the training data to the cloud.  Then, launch a training job.

After you submit the job, go to the cloud console (http://console.cloud.google.com) and select <b>Machine Learning | Jobs</b> to monitor progress.  

<b>Note:</b> Don't be concerned if the notebook stalls (with a blue progress bar) or returns with an error about being unable to refresh auth tokens. This is a long-lived Cloud job and work is going on in the cloud.  Use the Cloud Console link (above) to monitor the job.

In [21]:
%bash
echo $BUCKET
gsutil -m rm -rf gs://${BUCKET}/taxifare/smallinput/
gsutil -m cp ${PWD}/*.csv gs://${BUCKET}/taxifare/smallinput/

qwiklabs-gcp-4ec8a2116b8b4a75


Removing gs://qwiklabs-gcp-4ec8a2116b8b4a75/taxifare/smallinput/taxi-train.csv#1559850041676263...
Removing gs://qwiklabs-gcp-4ec8a2116b8b4a75/taxifare/smallinput/taxi-valid.csv#1559850041641499...
/ [1/2 objects]  50% Done                                                       / [2/2 objects] 100% Done                                                       
Operation completed over 2 objects.                                              
Copying file:///content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi-train.csv [Content-Type=text/csv]...
Copying file:///content/datalab/training-data-analyst/courses/machine_learning/cloudmle/taxi-valid.csv [Content-Type=text/csv]...
/ [0 files][    0.0 B/393.4 KiB]                                                / [0/2 files][    0.0 B/477.3 KiB]   0% Done                                    / [1/2 files][477.3 KiB/477.3 KiB]  99% Done                                    / [2/2 files][477.3 KiB/477.3 KiB] 100% Done          

In [23]:
%%bash
OUTDIR=gs://${BUCKET}/taxifare/smallinput/taxi_trained
JOBNAME=lab3a_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/taxifare/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=BASIC \
   --runtime-version=$TFVERSION \
   -- \
   --train_data_paths="gs://${BUCKET}/taxifare/smallinput/taxi-train*" \
   --eval_data_paths="gs://${BUCKET}/taxifare/smallinput/taxi-valid*"  \
   --output_dir=$OUTDIR \
   --train_steps=10000

gs://qwiklabs-gcp-4ec8a2116b8b4a75/taxifare/smallinput/taxi_trained us-central1 lab3a_190606_194649
jobId: lab3a_190606_194649
state: QUEUED


Removing gs://qwiklabs-gcp-4ec8a2116b8b4a75/taxifare/smallinput/taxi_trained/#1559850361993772...
Removing gs://qwiklabs-gcp-4ec8a2116b8b4a75/taxifare/smallinput/taxi_trained/checkpoint#1559850363370354...
Removing gs://qwiklabs-gcp-4ec8a2116b8b4a75/taxifare/smallinput/taxi_trained/eval/#1559850251251369...
Removing gs://qwiklabs-gcp-4ec8a2116b8b4a75/taxifare/smallinput/taxi_trained/eval/events.out.tfevents.1559850251.cmle-training-17203795334990632502#1559850251942305...
Removing gs://qwiklabs-gcp-4ec8a2116b8b4a75/taxifare/smallinput/taxi_trained/eval/events.out.tfevents.1559850365.cmle-training-14569149004279592833#1559850366407699...
Removing gs://qwiklabs-gcp-4ec8a2116b8b4a75/taxifare/smallinput/taxi_trained/events.out.tfevents.1559850242.cmle-training-17203795334990632502#1559850368213644...
Removing gs://qwiklabs-gcp-4ec8a2116b8b4a75/taxifare/smallinput/taxi_trained/events.out.tfevents.1559850356.cmle-training-14569149004279592833#1559850357239139...
/ [1/26 objects]   3% Done   

Don't be concerned if the notebook appears stalled (with a blue progress bar) or returns with an error about being unable to refresh auth tokens. This is a long-lived Cloud job and work is going on in the cloud. 

<b>Use the Cloud Console link to monitor the job and do NOT proceed until the job is done.</b>

<h2> Deploy model </h2>

Find out the actual name of the subdirectory where the model is stored and use it to deploy the model.  Deploying model will take up to <b>5 minutes</b>.

In [27]:
%bash
gsutil ls gs://${BUCKET}/taxifare/smallinput/taxi_trained/export/exporter

gs://qwiklabs-gcp-4ec8a2116b8b4a75/taxifare/smallinput/taxi_trained/export/exporter/
gs://qwiklabs-gcp-4ec8a2116b8b4a75/taxifare/smallinput/taxi_trained/export/exporter/1559850541/


In [28]:
%bash
MODEL_NAME="taxifare"
MODEL_VERSION="v1"
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/taxifare/smallinput/taxi_trained/export/exporter | tail -1)
echo "Run these commands one-by-one (the very first time, you'll create a model and then create a version)"
#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ml-engine models delete ${MODEL_NAME}
gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version $TFVERSION

Run these commands one-by-one (the very first time, you'll create a model and then create a version)


ERROR: (gcloud.ml-engine.models.create) Resource in project [qwiklabs-gcp-4ec8a2116b8b4a75] is the subject of a conflict: Field: model.name Error: A model with the same name already exists.
- '@type': type.googleapis.com/google.rpc.BadRequest
  fieldViolations:
  - description: A model with the same name already exists.
    field: model.name
Creating version (this might take a few minutes)......
..................................................................................................................................................................................................................................................................................done.


<h2> Prediction </h2>

In [29]:
%bash
gcloud ml-engine predict --model=taxifare --version=v1 --json-instances=./test.json

PREDICTIONS
[-142.9744110107422]


In [30]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1', credentials=credentials,
            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json')

request_data = {'instances':
  [
      {
        'pickuplon': -73.885262,
        'pickuplat': 40.773008,
        'dropofflon': -73.987232,
        'dropofflat': 40.732403,
        'passengers': 2,
      }
  ]
}

parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'taxifare', 'v1')
response = api.projects().predict(body=request_data, name=parent).execute()
print("response={0}".format(response))

response={'predictions': [{'predictions': [-142.9744110107422]}]}


<h2> Train on larger dataset </h2>

I have already followed the steps below and the files are already available. <b> You don't need to do the steps in this comment. </b> In the next chapter (on feature engineering), we will avoid all this manual processing by using Cloud Dataflow.

Go to http://bigquery.cloud.google.com/ and type the query:
<pre>
SELECT
  (tolls_amount + fare_amount) AS fare_amount,
  pickup_longitude AS pickuplon,
  pickup_latitude AS pickuplat,
  dropoff_longitude AS dropofflon,
  dropoff_latitude AS dropofflat,
  passenger_count*1.0 AS passengers,
  'nokeyindata' AS key
FROM
  [nyc-tlc:yellow.trips]
WHERE
  trip_distance > 0
  AND fare_amount >= 2.5
  AND pickup_longitude > -78
  AND pickup_longitude < -70
  AND dropoff_longitude > -78
  AND dropoff_longitude < -70
  AND pickup_latitude > 37
  AND pickup_latitude < 45
  AND dropoff_latitude > 37
  AND dropoff_latitude < 45
  AND passenger_count > 0
  AND ABS(HASH(pickup_datetime)) % 1000 == 1
</pre>

Note that this is now 1,000,000 rows (i.e. 100x the original dataset).  Export this to CSV using the following steps (Note that <b>I have already done this and made the resulting GCS data publicly available</b>, so you don't need to do it.):
<ol>
<li> Click on the "Save As Table" button and note down the name of the dataset and table.
<li> On the BigQuery console, find the newly exported table in the left-hand-side menu, and click on the name.
<li> Click on "Export Table"
<li> Supply your bucket name and give it the name train.csv (for example: gs://cloud-training-demos-ml/taxifare/ch3/train.csv). Note down what this is.  Wait for the job to finish (look at the "Job History" on the left-hand-side menu)
<li> In the query above, change the final "== 1" to "== 2" and export this to Cloud Storage as valid.csv (e.g.  gs://cloud-training-demos-ml/taxifare/ch3/valid.csv)
<li> Download the two files, remove the header line and upload it back to GCS.
</ol>

<p/>
<p/>

<h2> Run Cloud training on 1-million row dataset </h2>

This took 60 minutes and uses as input 1-million rows.  The model is exactly the same as above. The only changes are to the input (to use the larger dataset) and to the Cloud MLE tier (to use STANDARD_1 instead of BASIC -- STANDARD_1 is approximately 10x more powerful than BASIC).  At the end of the training the loss was 32, but the RMSE (calculated on the validation dataset) was stubbornly at 9.03. So, simply adding more data doesn't help.

In [None]:
%%bash

XXXXX  this takes 60 minutes. if you are sure you want to run it, then remove this line.

OUTDIR=gs://${BUCKET}/taxifare/ch3/taxi_trained
JOBNAME=lab3a_$(date -u +%y%m%d_%H%M%S)
CRS_BUCKET=cloud-training-demos # use the already exported data
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/taxifare/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=STANDARD_1 \
   --runtime-version=$TFVERSION \
   -- \
   --train_data_paths="gs://${CRS_BUCKET}/taxifare/ch3/train.csv" \
   --eval_data_paths="gs://${CRS_BUCKET}/taxifare/ch3/valid.csv"  \
   --output_dir=$OUTDIR \
   --train_steps=100000

Copyright 2016 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License