# MLWorkbench Magics

This notebook does the same thing as the previous notebook, but uses cloud services for each step. The goal is to show how the MLWorkbench magic are used differently when using ML Engine and other GCP products. This notebook does not cover MLWorkbench is detail--see the previous notebook--but points how what changes when running on the cloud. 

If you changed the WORKSPACE_PATH variable in the previous notebook, you must also change it here. If you made no modifications, there is no need to update the next cell. The previous notebook must be executed before this one.

In [1]:
WORKSPACE_PATH = '/content/datalab/workspace/text_classification_20newsgroup'

## What changes from local to cloud usage of the MLWorkbench magics?

Generally, a few things need to change:

* all data sources or file paths must be on GCS
* the --cloud flag must be set
* optional cloud_config values can be set

Other than this, nothing else changes from local to cloud!

# Step 1: Move the data to GCS

The csv files, and all input files to the MLWorkbench magics must exist on GCS first. Therefore the first step is to make a new GCS bucket and copy the local csv files to GCS. 


In [2]:
# Make a bucket name. This bucket name should not exist.
gcs_bucket = 'gs://' + datalab_project_id() + '-mlworkbench-20news-lab' # Feel free to change this

The next two cells will make the bucket and copy the clean test and train csv files over

In [3]:
!gsutil mb $gcs_bucket

Creating gs://cloud-ml-dev-mlworkbench-20news-lab/...


In [4]:
!gsutil -m cp $WORKSPACE_PATH/news_clean_train.csv $WORKSPACE_PATH/news_clean_test.csv $gcs_bucket

Copying file:///content/datalab/workspace/text_classification_20newsgroup/news_clean_train.csv [Content-Type=text/csv]...
Copying file:///content/datalab/workspace/text_classification_20newsgroup/news_clean_test.csv [Content-Type=text/csv]...
- [2/2 files][ 10.7 MiB/ 10.7 MiB] 100% Done                                    
Operation completed over 2 objects/10.7 MiB.                                     


Check the files are on GCS

In [5]:
!gsutil ls $gcs_bucket

gs://cloud-ml-dev-mlworkbench-20news-lab/news_clean_test.csv
gs://cloud-ml-dev-mlworkbench-20news-lab/news_clean_train.csv


In [6]:
import google.datalab.contrib.mlworkbench.commands  # This loads the '%%ml' magics
import os

In [7]:
# Make some constant file paths

# Input files
train_csv_file = os.path.join(gcs_bucket, 'news_clean_train.csv')
eval_csv_file = os.path.join(gcs_bucket, 'news_clean_test.csv')

# For analyze step
analyze_output = os.path.join(gcs_bucket, 'analyze_output')

# For the transform step
transform_output = os.path.join(gcs_bucket, 'transform_output')
transformed_train_pattern = os.path.join(transform_output, 'features_train*')
transformed_eval_pattern = os.path.join(transform_output, 'features_eval*')

# For the training step
training_output = os.path.join(gcs_bucket, 'training_output')

# For the prediction steps
batch_predict_output = os.path.join(gcs_bucket, 'batch_predict_output')
evaluation_model = os.path.join(training_output, 'evaluation_model')
regular_model = os.path.join(training_output, 'model')

When we deploy the model, we will create a ML Engine model and two versions. Change the names below if desired. The model and version names should not exist.

In [8]:
mlengine_model_name = 'datalab_mlworkbench_20news_model'
mlengine_evaluation_version_name = 'evaluation_version'
mlengine_regular_version_name = 'v1'

full_evaluation_model_name = mlengine_model_name + '.' + mlengine_evaluation_version_name
full_regular_model_name = mlengine_model_name + '.' + mlengine_regular_version_name

# Step 2: Analyze the csv file
The csv data must be on GCS. We copied the data in the above cells. To run analyze in the cloud, the csv file must be on GCS and the --cloud flag must be used. Cloud analyze will use BigQuery as the backend.

In [9]:
%%ml analyze --cloud
output: $analyze_output
training_data:
  csv: $train_csv_file
  schema:
    - name: news_label
      type: STRING
    - name: text
      type: STRING
features:
  news_label:
    transform: target
  text:
    transform: bag_of_words

Analyzing column news_label...
column news_label analyzed.
Analyzing column text...
Updated property [core/project].
column text analyzed.
Updated property [core/project].


In [None]:
!ls $analyze_output

# Step 3: Transforming the input data

The output, analyze, and csv parameters must all be GCS paths. Unlike analyze, running the transform step using cloud services supports cloud options which are passed to the DataFlow job. run '%%ml transform --help' for a list of cloud options.

In [10]:
%%ml transform --shuffle --cloud
output: $transform_output
analysis: $analyze_output
prefix: features_train
training_data:
  csv: $train_csv_file
cloud_config:
  num_workers: 2

  super(GcsIO, cls).__new__(cls, storage_client))
running sdist
running egg_info
creating trainer.egg-info
writing requirements to trainer.egg-info/requires.txt
writing trainer.egg-info/PKG-INFO
writing top-level names to trainer.egg-info/top_level.txt
writing dependency_links to trainer.egg-info/dependency_links.txt
writing manifest file 'trainer.egg-info/SOURCES.txt'
reading manifest file 'trainer.egg-info/SOURCES.txt'
writing manifest file 'trainer.egg-info/SOURCES.txt'

running check

creating trainer-1.0.0
creating trainer-1.0.0/trainer
creating trainer-1.0.0/trainer.egg-info
copying files to trainer-1.0.0...
copying setup.py -> trainer-1.0.0
copying trainer/__init__.py -> trainer-1.0.0/trainer
copying trainer/feature_transforms.py -> trainer-1.0.0/trainer
copying trainer/task.py -> trainer-1.0.0/trainer
copying trainer.egg-info/PKG-INFO -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/SOURCES.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/dependency_link

Click the above link to see the dataflow job. Note that control went back to the notebook--you can run other cells--but the dataflow job is still running. The job will take about 5-10 minutes. It is up to you to wait for the job to finish before continuing this notebook.

We have to run transform on the eval set too. Because the dataset is small, dataflow's startup time is larger than the time it takes to run the transformation. So we run the next cell locally. If you wish, add --cloud to the next cell to run another dataflow job. As all paths are on GCS, the output will be on GCS.

In [11]:
%%ml transform 
output: $transform_output
analysis: $analyze_output
prefix: features_eval
training_data:
  csv: $eval_csv_file

  super(GcsIO, cls).__new__(cls, storage_client))
2017-07-25 23:43:40.452248: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-25 23:43:40.452282: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-25 23:43:40.452309: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-25 23:43:40.452330: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-25 23:43:40.452345: W tensorflow/core/platform/cpu_

Note how more than 1 file may have been made for training and eval. Sharding an input file can improve TensorFlow training performance, especially when running on a distributed cluster. 

In [12]:
!gsutil ls $transform_output

gs://cloud-ml-dev-mlworkbench-20news-lab/transform_output/errors_features_eval-00000-of-00001.txt
gs://cloud-ml-dev-mlworkbench-20news-lab/transform_output/errors_features_train-00000-of-00001.txt
gs://cloud-ml-dev-mlworkbench-20news-lab/transform_output/features_eval-00000-of-00008.tfrecord.gz
gs://cloud-ml-dev-mlworkbench-20news-lab/transform_output/features_eval-00001-of-00008.tfrecord.gz
gs://cloud-ml-dev-mlworkbench-20news-lab/transform_output/features_eval-00002-of-00008.tfrecord.gz
gs://cloud-ml-dev-mlworkbench-20news-lab/transform_output/features_eval-00003-of-00008.tfrecord.gz
gs://cloud-ml-dev-mlworkbench-20news-lab/transform_output/features_eval-00004-of-00008.tfrecord.gz
gs://cloud-ml-dev-mlworkbench-20news-lab/transform_output/features_eval-00005-of-00008.tfrecord.gz
gs://cloud-ml-dev-mlworkbench-20news-lab/transform_output/features_eval-00006-of-00008.tfrecord.gz
gs://cloud-ml-dev-mlworkbench-20news-lab/transform_output/features_eval-00007-of-00008.tfrecord.gz
g

An errors file is always written, even if there are no errors. Let's check the error files are empty.

In [13]:
!gsutil ls -lh $transform_output/errors*

       0 B  2017-07-25T23:43:57Z  gs://cloud-ml-dev-mlworkbench-20news-lab/transform_output/errors_features_eval-00000-of-00001.txt
       0 B  2017-07-25T23:41:17Z  gs://cloud-ml-dev-mlworkbench-20news-lab/transform_output/errors_features_train-00000-of-00001.txt
TOTAL: 2 objects, 0 bytes (0 B)


# Step 4: Training a TensorFlow model
Again, see '%%ml train --help' for a list of cloud options. The cell below will run with default cloud options. Note that every file path must be a GCS path. You may want to change the cloud_config region value. Because the dataset is small, the cloud training will take more time than local training because of startup costs. It should take about 10 minutes.

In [14]:
# Training should use an empty output folder. So if you run training multiple times,
# use different folders or remove the output from the previous run.
!gsutil -m rm -fr $training_output

CommandException: 1 files/objects could not be removed.


In [15]:
%%ml train --cloud
output: $training_output
analysis: $analyze_output
training_data:
  transformed: $transformed_train_pattern
evaluation_data:
  transformed: $transformed_eval_pattern
model_args:
  l2-regularization: 5
  model: linear_classification
  top-n: 4
  learning-rate: 1
  max-steps: 5000
  train-batch-size: 500
  eval-batch-size: 500
  save-checkpoints-secs: 60
cloud_config:
  scale_tier: STANDARD_1
  region: us-central1
  runtime_version: '1.2'
    

It is up to you to wait for the training job to finish before continuing this notebook.

In [17]:
!gsutil ls  $training_output

gs://cloud-ml-dev-mlworkbench-20news-lab/training_output/schema_without_target.json
gs://cloud-ml-dev-mlworkbench-20news-lab/training_output/evaluation_model/
gs://cloud-ml-dev-mlworkbench-20news-lab/training_output/model/
gs://cloud-ml-dev-mlworkbench-20news-lab/training_output/staging/
gs://cloud-ml-dev-mlworkbench-20news-lab/training_output/train/


# Step 5: Deploying the model

See the previous notebook about the output models of training and the naming of ML Engine models. 

Below, we create a new ML Engine model, and two ML Engine model versions, one for each tensorflow model.

In [18]:
from google.datalab.ml import Models, ModelVersions

In [19]:
# Makes a ML Engine Model
# If the model already exists, comment out this line
Models().create(mlengine_model_name)  

{u'name': u'projects/cloud-ml-dev/models/datalab_mlworkbench_20news_model',
 u'regions': [u'us-central1']}

In [20]:
# Makes a ML Engine Version
ModelVersions(mlengine_model_name).deploy(
    version_name=mlengine_regular_version_name,
    path=regular_model,
    runtime_version='1.2')  

Waiting for operation "projects/cloud-ml-dev/operations/create_datalab_mlworkbench_20news_model_v1-1501026791078"
Done.


In [21]:
# Makes a ML Engine Version
ModelVersions(mlengine_model_name).deploy(
    version_name=mlengine_evaluation_version_name,
    path=evaluation_model,
    runtime_version='1.2')  

Waiting for operation "projects/cloud-ml-dev/operations/create_datalab_mlworkbench_20news_model_evaluation_version-1501027275238"
Done.


# Step 6: Evaluation using batch prediction

In the example below, we will run evaluation on the deployed evaluation model. Note the output and input file paths are on GCS. Also, model is not a path, it is the name of the deployed model.

In [23]:
%%ml batch_predict --cloud
model: $full_evaluation_model_name
output: $batch_predict_output
format: json
prediction_data:
  csv: $eval_csv_file
cloud_config:
  job_id: mlworkbench_batch_prediction_job_name_3
  region: us-central1

In [26]:
!gsutil ls -lh $batch_predict_output

       0 B  2017-07-26T00:08:16Z  gs://cloud-ml-dev-mlworkbench-20news-lab/batch_predict_output/prediction.errors_stats-00000-of-00001
  2.27 MiB  2017-07-26T00:07:54Z  gs://cloud-ml-dev-mlworkbench-20news-lab/batch_predict_output/prediction.results-00000-of-00001
TOTAL: 2 objects, 2383983 bytes (2.27 MiB)


In [30]:
!gsutil cat $batch_predict_output/prediction.results* | head -n 1

{"target": "rec.autos", "probability": 0.35624340176582336, "probability_4": 0.06863237172365189, "predicted": "rec.autos", "probability_3": 0.07430016994476318, "probability_2": 0.1993453949689865, "predicted_2": "comp.sys.mac.hardware", "predicted_3": "rec.sport.baseball", "predicted_4": "soc.religion.christian"}


# Step 7: Instant prediction



## Prediction within MLWorkbench

The MLWorkbench also supports running prediction on the deployed model directly.

In [31]:
%%ml predict --cloud
model: $full_regular_model_name
headers: text
prediction_data:
  - nasa
  - windows xp

predicted,predicted_2,predicted_3,predicted_4,probability,probability_2,probability_3,probability_4,text
sci.space,rec.motorcycles,rec.sport.baseball,rec.autos,0.129927,0.06909,0.068524,0.060403,nasa
comp.os.ms-windows.misc,comp.graphics,rec.motorcycles,comp.windows.x,0.200934,0.064302,0.064078,0.062777,windows xp


## Prediction from a python client

See the previous notebook in this sequence for the example. 

# Step 8: Clean up

This section is optional. We will delete all the GCP resources and local files created in this sequence of notebooks. If you are not ready to delete anything, don't run any of the following cells.


Delete the deployed versions and model.

In [33]:
ModelVersions(mlengine_model_name).delete(mlengine_evaluation_version_name)

Waiting for operation "projects/cloud-ml-dev/operations/delete_datalab_mlworkbench_20news_model_evaluation_version-1501027934074"
Done.


In [34]:
ModelVersions(mlengine_model_name).delete(mlengine_regular_version_name)

Waiting for operation "projects/cloud-ml-dev/operations/delete_datalab_mlworkbench_20news_model_v1-1501027965090"
Done.


In [35]:
Models().delete(mlengine_model_name)

Waiting for operation "projects/cloud-ml-dev/operations/delete_model_datalab_mlworkbench_20news_model-1501027983"
Done.


Delete the files in the GCS bucket, and delete the bucket

In [37]:
!gsutil -m rm -r $gcs_bucket

Removing gs://cloud-ml-dev-mlworkbench-20news-lab/news_clean_test.csv#1501025712935123...
Removing gs://cloud-ml-dev-mlworkbench-20news-lab/news_clean_train.csv#1501025713034551...
Removing gs://cloud-ml-dev-mlworkbench-20news-lab/analyze_output/features.json#1501025749387412...
Removing gs://cloud-ml-dev-mlworkbench-20news-lab/analyze_output/stats.json#1501025747335926...
Removing gs://cloud-ml-dev-mlworkbench-20news-lab/analyze_output/vocab_news_label.csv#1501025735943598...
Removing gs://cloud-ml-dev-mlworkbench-20news-lab/batch_predict_output/prediction.errors_stats-00000-of-00001#1501027696147452...
Removing gs://cloud-ml-dev-mlworkbench-20news-lab/training_output/evaluation_model/#1501026698479947...
Removing gs://cloud-ml-dev-mlworkbench-20news-lab/batch_predict_output/prediction.results-00000-of-00001#1501027674964638...
Removing gs://cloud-ml-dev-mlworkbench-20news-lab/analyze_output/schema.json#1501025748239368...
Removing gs://cloud-ml-dev-mlworkbench-20news-lab/analyze_outp

Remove the local files from the previous notebooks

In [38]:
!rm -fr $WORKSPACE_PATH