# MLWorkbench Magics

This notebook does the same thing as the previous notebook, but uses cloud services for each step. The goal is to show how the MLWorkbench magic are used differently when using ML Engine and other GCP products. Using cloud services performs each step in a distributed way, which helps with large data. Cloud workloads work better for large datasets because there is a startup cost for most commands. So the steps in this notebook might be slower than in the previous notebook.

If you changed the WORKSPACE_PATH variable in the previous notebook, you must also change it here. If you made no modifications, there is no need to update the next cell. The previous notebook must be executed before this one.

In [None]:
WORKSPACE_PATH = '/content/datalab/workspace/structured_data_classification_stackoverflow'

# What changes from local to cloud usage of the MLWorkbench magics?

Generally, a few things need to change:
* all data sources or file paths must be on GCS
* the --cloud flag must be set
* optional cloud_config values can be set

Other than this, nothing else changes from local to cloud!

# Step 1: Move the data to GCS

The csv files, and all input files to the MLWorkbench magics must exist on GCS first. Therefore the first step is to make a new GCS bucket and copy the local csv files to GCS.

As we will deploy a model to ML Engine, we also need a GCS bucket location to save files. The bucket name needs to be unique. Please rename the following bucket if it exists.

In [None]:
# Make a bucket name. This bucket name should not exist.
# If the bucket does exist, skip the next cell.
gcs_bucket = 'gs://' + datalab_project_id() + '-mlworkbench-stackoverflow-lab3' # Feel free to change this

In [None]:
# Make the bucket
!gsutil mb $gcs_bucket

In [None]:
import google.datalab.contrib.mlworkbench.commands # this loads the %%ml commands

In [None]:
import os
import csv
import re
import pandas as pd
import six
import string
import random
import numpy as np
import json
from tensorflow.python.lib.io import file_io

In [None]:
# Clean local data files
local_clean_folder = os.path.join(WORKSPACE_PATH, 'clean_input')
local_train_data_path = os.path.join(local_clean_folder, 'train.csv')
local_eval_data_path = os.path.join(local_clean_folder, 'eval.csv')
local_schema_path = os.path.join(local_clean_folder, 'schema.json')
local_transform_path = os.path.join(local_clean_folder, 'transforms.json')

# Clean GCS data files
clean_folder = os.path.join(gcs_bucket, 'clean_input')
train_data_path = os.path.join(clean_folder, 'train.csv')
eval_data_path = os.path.join(clean_folder, 'eval.csv')
schema_path = os.path.join(clean_folder, 'schema.json')
transform_path = os.path.join(clean_folder, 'transforms.json')


# For analyze step
analyze_output = os.path.join(gcs_bucket, 'analyze_output')

# For the transform step
transform_output = os.path.join(gcs_bucket, 'transform_output')
transformed_train_pattern = os.path.join(transform_output, 'features_train*')
transformed_eval_pattern = os.path.join(transform_output, 'features_eval*')

# For the training step
training_output = os.path.join(gcs_bucket, 'training_output')

# For the prediction steps
batch_predict_output = os.path.join(gcs_bucket, 'batch_predict_output')
evaluation_model = os.path.join(training_output, 'evaluation_model')
regular_model = os.path.join(training_output, 'model')

# For depolying the model
mlengine_model_name = 'stackoverflowmodel'
mlengine_evaluation_version_name = 'evaluation_version'
mlengine_regular_version_name = 'example'

full_evaluation_model_name = mlengine_model_name + '.' + mlengine_evaluation_version_name
full_regular_model_name = mlengine_model_name + '.' + mlengine_regular_version_name

In [None]:
# Assert the local files exist before we copy them.
assert(os.path.isfile(local_train_data_path) 
    and os.path.isfile(local_eval_data_path) 
    and os.path.isfile(local_schema_path) 
    and os.path.isfile(local_transform_path))

In [None]:
!gsutil -m cp -r $local_clean_folder  $gcs_bucket

In [None]:
# Look at the copied files
!gsutil ls -R $gcs_bucket

# Step 2: Analyze the csv file

The csv data must be on GCS. We copied the data in the above cells. To run analyze in the cloud, the csv file must be on GCS and the --cloud flag must be used. Cloud analyze will use BigQuery as the backend.

In [None]:
# Load the features and schema into memory
with open(local_schema_path) as f:
    schema = json.loads(f.read())
  
with open(local_transform_path) as f:
    transforms = json.loads(f.read())

In [None]:
%%ml analyze --cloud
output: $analyze_output
training_data:
    csv: $train_data_path
    schema: $schema
features: $transforms

In [None]:
!gsutil ls $analyze_output

# Step 3: Transform the input data

The output, analyze, and csv parameters must all be GCS paths. Unlike analyze, running the transform step using cloud services supports cloud options which are passed to the DataFlow job. run '%%ml transform --help' for a list of cloud options.

In [None]:
!gsutil rm -r -f $transform_output

In [None]:
%%ml transform --shuffle --cloud
output: $transform_output
analysis: $analyze_output
prefix: features_train
training_data:
    csv: $train_data_path
cloud_config:
    num_workers: 5        

Click the above link to see the dataflow job. Note that control went back to the notebook--you can run other cells--but the dataflow job is still running. The job will take about 10-20 minutes. It is up to you to wait for the job to finish before continuing this notebook.

We have to run transform on the eval set too. Because the dataset is small, dataflow's startup time is larger than the time it takes to run the transformation. So we run the next cell locally. If you wish, add --cloud to the next cell to run another dataflow job. As all paths are on GCS, the output will be on GCS.

In [None]:
%%ml transform
output: $transform_output
analysis: $analyze_output
prefix: features_eval
training_data:
    csv: $eval_data_path

In [None]:
# Let's look at the output
!gsutil ls $transform_output

In [None]:
# Error files are written even if there are no errors.
# Check that they are empty
!gsutil cat $transform_output/errors* | wc

# Step 4: Training

Again, see '%%ml train --help' for a list of cloud options. The cell below will run with default cloud options. Note that every file path must be a GCS path. You may want to change the cloud_config region value. Because the dataset is small, the cloud training will take more time than local training because of startup costs. It should take about 10 minutes.

Unlike the previous notebook, we will use the transformed output, but the csv files could have been used.

In [None]:
# Training should use an empty output folder. So if you run training multiple times,
# use different folders or remove the output from the previous run.
!gsutil rm -fr $training_output

In [None]:
%%ml train --cloud
output: $training_output
analysis: $analyze_output
training_data:
    transformed: $transformed_train_pattern
evaluation_data:
    transformed: $transformed_eval_pattern
model_args:
    model: dnn_classification
    hidden-layer-size1: 100
    max-steps: 5000
    top-n: 2
    save-checkpoints-secs: 60
cloud_config:
    scale_tier: STANDARD_1
    region: us-central1
    runtime_version: '1.2'        

It is up to you to wait for the training job to finish before continuing this notebook.

In [None]:
!gsutil ls  $training_output

# Step 5: Deploying the model

See the previous notebook about the output models of training and the naming of ML Engine models.
Below, we create a new ML Engine model, and two ML Engine model versions, one for each tensorflow model.

In [None]:
from google.datalab.ml import Models, ModelVersions

In [None]:
# Makes a ML Engine Model
# If the model already exists, comment out this line
Models().create(mlengine_model_name)

In [None]:
# Makes a ML Engine Version
ModelVersions(mlengine_model_name).deploy(
    version_name=mlengine_regular_version_name,
    path=regular_model,
    runtime_version='1.2')

In [None]:
# Makes a ML Engine Version
ModelVersions(mlengine_model_name).deploy(
    version_name=mlengine_evaluation_version_name,
    path=evaluation_model,
    runtime_version='1.2')

# Step 6: Evaluation using batch prediction

In the example below, we will run evaluation on the deployed evaluation model. Note the output and input file paths are on GCS. Also, model is not a path, it is the name of the deployed model.

In [None]:
%%ml batch_predict --cloud
model: $full_evaluation_model_name
output: $batch_predict_output
format: json
prediction_data:
  csv: $eval_data_path
cloud_config:
  job_id: mlworkbench_batch_prediction_job_name_4
  region: us-central1    

In [None]:
!gsutil ls $batch_predict_output

In [None]:
!gsutil cat $batch_predict_output/prediction.errors* | wc -l

In [None]:
!gsutil cat $batch_predict_output/prediction.results* | head -n 1

# Step 7: Instant prediction

## Prediction within MLWorkbench
The MLWorkbench also supports running prediction on the deployed model directly.

In [None]:
headers_string = ','.join([col['name'] for col in schema if col['name'] != schema[1]['name']])

In [None]:
%%ml predict --cloud
model: $full_regular_model_name
headers: $headers_string
prediction_data:
    - 1,"Yes, both",United States,No,"Not employed, and not looking for work",Secondary school,,,,,2 to 3 years,,,,,,,,,,,,,,,,,"With a soft ""g,"" like ""jiff""",Strongly agree,Strongly agree,Agree,Disagree,Strongly agree,Agree,Agree,Disagree,Somewhat agree,Disagree,Strongly agree,Strongly agree,Strongly disagree,Agree,Agree,Disagree,Agree,"I'm not actively looking, but I am open to new opportunities",0.0,Not applicable/ never,Very important,Very important,Important,Very important,Very important,Very important,Important,Very important,Very important,Very important,Very important,Very important,Somewhat important,Not very important,Somewhat important,Stock_options Vacation/days_off Remote_options,Yes,Other,,,Important,Important,Important,Somewhat important,Important,Not very important,Not very important,Not at all important,Somewhat important,Very important,,,Tabs,,Online_course Open_source_contributions,,,,6:00 AM,Swift,Swift,,,,,iOS,iOS,Atom Xcode,Turn on some music,,,,,,,,,,,,Somewhat satisfied,Not very satisfied,Not at all satisfied,Very satisfied,Satisfied,Not very satisfied,,,,,,,,,,,,I have created a CV or Developer Story on Stack Overflow,9.0,Desktop iOS_app,At least once each week,Haven't done at all,Once or twice,Haven't done at all,Haven't done at all,Several times,Several times,Once or twice,Somewhat agree,Strongly disagree,Strongly disagree,Strongly agree,Agree,Strongly agree,Strongly agree,Strongly disagree,Male,High school,White_or_of_European_descent,Strongly disagree,Strongly agree,Disagree,Strongly agree,,
    - 7,"Yes, both",United States,No,Employed full-time,Master's degree,A non-computer-focused engineering discipline,"Less than half the time, but at least one day each week",20 to 99 employees,Government agency or public school/university,9 to 10 years,8 to 9 years,,,,,Data_scientist,7.0,6.0,,,,,,,,,"With a hard ""g,"" like ""gift""",,,,,,,,,,,,,,,,,,"I'm not actively looking, but I am open to new opportunities",1.0,More than 4 years ago,Somewhat important,Very important,Not very important,Important,Important,Very important,Important,Important,Important,Very important,Very important,Very important,Somewhat important,Not very important,Very important,Health_benefits Equipment Professional_development_sponsorship Education_sponsorship Remote_options,Yes,,,"A friend, family member, or former colleague told me",Very important,Important,Important,Somewhat important,Somewhat important,Somewhat important,Somewhat important,Somewhat important,Not very important,Very important,,,Spaces,,Online_course Part-time/evening_course On-the-job_training Self-taught Open_source_contributions,Official_documentation Trade_book Textbook Stack_Overflow_Q&A Friends_network Built-in_help,,,7:00 AM,Matlab Python,JavaScript Julia Matlab Python R SQL,,Hadoop Node.js,SQLite,MongoDB SQL_Server PostgreSQL SQLite,Windows_Desktop,Arduino Raspberry_Pi,Sublime_Text IPython_/_Jupyter Visual_Studio_Code,Turn on some music,,,,,,,,,,,,Satisfied,Very satisfied,Very satisfied,Satisfied,Satisfied,Very satisfied,Some influence,No influence at all,Not much influence,Not much influence,A lot of influence,A lot of influence,Some influence,No influence at all,No influence at all,No influence at all,Not much influence,I have created a CV or Developer Story on Stack Overflow,8.0,Desktop iOS_browser iOS_app,Several times,Once or twice,Haven't done at all,Haven't done at all,Haven't done at all,Haven't done at all,Several times,At least once each day,Somewhat agree,Disagree,Disagree,Agree,Agree,Strongly agree,Agree,Disagree,Male,A doctoral degree,White_or_of_European_descent,Disagree,Agree,Disagree,Agree,,
    - 14,"Yes, both",Germany,No,Employed full-time,Some college/university study without earning a bachelor's degree,Computer science or software engineering,"Less than half the time, but at least one day each week",Fewer than 10 employees,Venture-funded startup,15 to 16 years,15 to 16 years,,Web_developer,Full stack Web developer,,,8.0,6.0,,,,,,,,,"With a hard ""g,"" like ""gift""",,,,,,,,,,,,,,,,,,I am actively looking for a job,3.0,Between 1 and 2 years ago,Somewhat important,Important,Important,Somewhat important,Important,Somewhat important,Important,Somewhat important,Important,Important,Important,Important,Not very important,Important,Important,Stock_options Vacation/days_off Equipment Professional_development_sponsorship Remote_options,Yes,LinkedIn Xing,I was just giving it a regular update,"A friend, family member, or former colleague told me",Somewhat important,Somewhat important,Important,Somewhat important,Somewhat important,Somewhat important,Somewhat important,Not very important,Somewhat important,Important,,,Spaces,Not at all important,Part-time/evening_course On-the-job_training Self-taught Coding_competition Hackathon Open_source_contributions,Official_documentation Trade_book Stack_Overflow_Q&A,,,10:00 AM,Java JavaScript Ruby SQL,JavaScript Ruby Rust Swift,React,React,Redis MySQL PostgreSQL,Redis PostgreSQL,Amazon_Web_Services_(AWS),Amazon_Web_Services_(AWS),Vim,Turn on some music,Agile Lean Scrum Extreme Pair Kanban,Git,Multiple times a day,Somewhat agree,Disagree,Disagree,Somewhat agree,Agree,Disagree,Somewhat agree,Customer_satisfaction Benchmarked_product_performance On_time/in_budget Revenue_performance Manager's_rating Peers'_rating Self-rating,Satisfied,Satisfied,Satisfied,Satisfied,Satisfied,Not very satisfied,A lot of influence,Some influence,I am the final decision maker,I am the final decision maker,I am the final decision maker,A lot of influence,I am the final decision maker,I am the final decision maker,Some influence,A lot of influence,Some influence,I have created a CV or Developer Story on Stack Overflow,10.0,Desktop,Several times,Several times,Haven't done at all,Haven't done at all,Haven't done at all,Haven't done at all,Several times,Haven't done at all,Somewhat agree,Somewhat agree,Disagree,Agree,Strongly agree,Agree,Somewhat agree,Disagree,Female,A master's degree,Hispanic_or_Latino/Latina,Somewhat agree,Agree,Disagree,Strongly agree,,

## Prediction from a python client

See the previous notebook in this sequence for the example.

# Step 8: Clean up

This section is optional. We will delete all the GCP resources and local files created in this sequence of notebooks. If you are not ready to delete anything, don't run any of the following cells.

In [None]:
# Delete the eval version
ModelVersions(mlengine_model_name).delete(mlengine_evaluation_version_name)

In [None]:
# Delete the regular version
ModelVersions(mlengine_model_name).delete(mlengine_regular_version_name)

In [None]:
# Delete the model
Models().delete(mlengine_model_name)

In [None]:
# Delete the GCS bucket
!gsutil -m rm -r $gcs_bucket

In [None]:
# Delete the local files
!rm -fr $WORKSPACE_PATH