## Structured data prediction using Cloud ML Engine 

This notebook illustrates:

1. Exploring a BigQuery dataset using Datalab
2. Creating datasets for Machine Learning using Dataflow
3. Creating a model using the high-level Estimator API 
4. Training on Cloud ML Engine
5. Deploying the model
6. Predicting with the model

### Housekeeping 

In [None]:
BUCKET = 'ksalama-gcs-cloudml'
PROJECT = 'ksalama-gcp-playground'
REGION = 'europe-west1'

In [None]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

In [None]:
gcs_data_dir = 'gs://{0}/data/babyweight/'.format(BUCKET)
gcs_model_dir = 'gs://{0}/ml-models/babyweight/'.format(BUCKET)

In [None]:
%%bash

gsutil -m rm -rf gs://ksalama-gcs-cloudml/ml-models/babyweight_estimator/*
gsutil -m rm -rf gs://ksalama-gcs-cloudml/data/babyweight/big_data/*

## Query data in BigQuery

The data is natality data (record of births in the US). My goal is to predict the baby's weight given a number of factors about the pregnancy and the baby's mother.  Later, we will want to split the data into training and eval datasets. The hash of the year-month will be used for that.

In [None]:
%%bq query --name data

SELECT
  CAST(mother_race AS string) race_index,
  AVG(weight_pounds) avg_weight,
  COUNT(weight_pounds) instance_Count
FROM
  `publicdata.samples.natality`
WHERE 
    year > 2000
AND weight_pounds > 0
AND mother_age > 0
AND plurality > 0
AND gestation_weeks > 0
AND month > 0
AND mother_race is not null
GROUP BY
  mother_race
ORDER BY
  avg_weight DESC

## Visualise with Datalab commands 
http://googledatalab.github.io/pydatalab/google.datalab%20Commands.html

In [None]:
%chart columns --data data --fields race_index,avg_weight
title: Mother Race Index vs Average Baby Weight
height: 400
width: 900
hAxis:
  title: Race Index
vAxis:
  title: Average Weight

### Fetch data from BigQuery as a pandas dataframe

In [None]:
step_size = 10000

In [None]:
%sql --module query 

SELECT
  weight_pounds,
  is_male,
  mother_age,
  mother_race,
  plurality,
  gestation_weeks,
  mother_married,
  ever_born,
  cigarette_use,
  alcohol_use
FROM
  `publicdata.samples.natality`
WHERE 
    year > 2000
AND weight_pounds > 0
AND mother_age > 0
AND plurality > 0
AND gestation_weeks > 0
AND month > 0
AND
  MOD(ABS(FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING)))), 2) = 0
LIMIT $STEP_SIZE

In [None]:
import datalab.bigquery as bq
import sys
data = bq.Query(query, STEP_SIZE = step_size).to_dataframe(dialect='standard')
print('Row count:{}'.format(len(data)))
data.head(5)

In [None]:
data.describe()

### Explore & Visualise

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [None]:
plt.close('all')
#plt.figure(figsize=(45, 25))
plt.figure(figsize=(30, 15))

# Baby Weight Distribution
plt.subplot(2,3,1)
plt.title("Baby Weight Histogram")
plt.hist(data.weight_pounds, bins=150)
#plt.axis([0, 50, 0, 3500])
plt.xlabel("Baby Weight Ranges")
plt.ylabel("Frequency")
# ---------------------------

# Mother Age vs Baby Weight
plt.subplot(2,3,2)
plt.title("Mother Age vs Baby Weight")
plt.scatter(data.mother_age,data.weight_pounds)
plt.xlabel("Mother Age")
plt.ylabel("Baby Weight")
# ---------------------------

# Gestation Weeks vs Baby Weight
plt.subplot(2,3,3)
fit = np.polyfit(data.gestation_weeks,data.weight_pounds, deg=1)
plt.plot(data.gestation_weeks, fit[0] * data.gestation_weeks + fit[1], color='red')
plt.scatter(data.gestation_weeks, data.weight_pounds)
plt.xlabel("Gestation Weeks")
plt.ylabel("Baby Weight")

#---------------------------

# Is Male vs Baby Weight Boxplot
plt.subplot(2,3,4)
plt.title("Is Male vs Baby Weight")

is_male_values = list(data.is_male.value_counts().index.values)
is_male_data = []
for i in is_male_values:
    is_male_data = is_male_data + [data.weight_pounds[data.is_male == i].values]

plt.boxplot(is_male_data)
plt.axis([0, 3, 4, 11])
plt.xlabel("Is Male")
plt.ylabel("Baby Weight")
# ---------------------------

# Mother Race vs Baby Weight Boxplot
plt.subplot(2,3,5)
plt.title("Mother Race vs Baby Weight")

race_values = list(data.mother_race.value_counts().index.values)
race_data = []
for i in race_values:
    race_data = race_data + [data.weight_pounds[data.mother_race == i].values]

plt.boxplot(race_data)
plt.axis([0, 16, 4, 11])
plt.xlabel("Mother Race")
plt.ylabel("Baby Weight")

# # ---------------------------

plt.subplot(2,3,6)
plt.title("Alcohol & Cigarette Use")

alch_use_values = list(data.alcohol_use.value_counts().index.values)
cig_use_values = list(data.cigarette_use.value_counts().index.values)

use_data = []
labels = []

for i in alch_use_values:
    for j in cig_use_values:
        labels = labels + ['alch-use:{} & cig-use:{}'.format(i,j)]
        condition = (data.alcohol_use == i) & (data.cigarette_use == j)
        values = data.weight_pounds[condition].values
        if (len(values) > 0):
            use_data = use_data + [len(values)]

plt.pie(use_data)
plt.legend(labels, loc="lower center")

plt.show()

### Average Weight as a Baseline Estimator

In [None]:
import numpy as np

avg_weight = data.weight_pounds.mean()
print("Average Weight: {}".format(round(avg_weight,3)))
rmse = np.sqrt(data.weight_pounds.map(lambda value: (value-avg_weight)**2).mean())
print("RMSE: {}".format(round(rmse,3)))

### Create ML dataset using Dataflow

Let's use Cloud Dataflow to read in the BigQuery data and write it out as CSV files. 


In [None]:
import apache_beam as beam
import datetime

dataset_size = 100000
train_size = dataset_size * 0.7
eval_size = dataset_size * 0.3

query = """
        SELECT
          weight_pounds,
          is_male,
          mother_age,
          mother_race,
          plurality,
          gestation_weeks,
          mother_married,
          ever_born,
          cigarette_use,
          alcohol_use,
          FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth
        FROM
          publicdata.samples.natality
        WHERE year > 2000
        AND weight_pounds > 0
        AND mother_age > 0
        AND plurality > 0
        AND gestation_weeks > 0
        AND month > 0
    """

out_dir = gcs_data_dir + "big_data"

def to_csv(rowdict):
    # pull columns from BQ and create a line
    import hashlib
    import copy
    CSV_COLUMNS = 'weight_pounds,is_male,mother_age,mother_race,plurality,gestation_weeks,mother_married,cigarette_use,alcohol_use'.split(',')
    # modify opaque numeric race code into human-readable data
    races = dict(zip([1,2,3,4,5,6,7,18,28,39,48],
                     ['White', 'Black', 'American Indian', 'Chinese', 
                      'Japanese', 'Hawaiian', 'Filipino',
                      'Asian Indian', 'Korean', 'Samaon', 'Vietnamese']))
    result = copy.deepcopy(rowdict)
    if 'mother_race' in rowdict and rowdict['mother_race'] in races:
        result['mother_race'] = races[rowdict['mother_race']]
    else:
        result['mother_race'] = 'Unknown'
    
    data = ','.join([str(result[k]) if k in result else 'None' for k in CSV_COLUMNS])
    key = hashlib.sha224(data).hexdigest()  # hash the columns to form a key
    return str('{},{}'.format(data, key))
  
def run_pipeline():
    
    job_name = 'preprocess-babyweight-data' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S')
    print 'Launching Dataflow job {} ... hang on'.format(job_name)

    options = {
        'staging_location': os.path.join(out_dir, 'tmp', 'staging'),
        'temp_location': os.path.join(out_dir, 'tmp'),
        'job_name': job_name,
        'project': PROJECT,
        'teardown_policy': 'TEARDOWN_ALWAYS',
        'no_save_main_session': True
    }
  
    opts = beam.pipeline.PipelineOptions(flags=[], **options)
    RUNNER = 'DataflowRunner'
  
    pipeline = beam.Pipeline(RUNNER, options=opts)
    
    for step in ['train', 'eval']:
        if step == 'train':
            source_query = 'SELECT * FROM ({}) WHERE MOD(hashmonth,4) < 3 LIMIT {}'.format(query,int(train_size))
        else:
            source_query = 'SELECT * FROM ({}) WHERE MOD(hashmonth,4) = 3 LIMIT {}'.format(query,int(eval_size))
            
        sink_location = os.path.join(out_dir, '{}-data'.format(step))

        (pipeline 
           | '{} - Read from BigQuery'.format(step) >> beam.io.Read(beam.io.BigQuerySource(query=source_query, use_standard_sql=True))
           | '{} - Process to CSV'.format(step) >> beam.Map(to_csv)
           | '{} - Write to GCS '.format(step) >> beam.io.Write(beam.io.WriteToText(sink_location,
                                                                file_name_suffix='.csv',
                                                                num_shards=5))
        )
    
   
    job = pipeline.run()

## Run Dataflow Preprocessing Pipeline

In [None]:
run_pipeline()

In [None]:
%%bash

gsutil ls gs://ksalama-gcs-cloudml/data/babyweight/big_data

## Create TensorFlow Models using Estimator API

In [None]:
%%bash

pip install -U tensorflow

In [None]:
import tensorflow as tf
from tensorflow import data

print(tf.__version__)

## Train Linear Regression Model

1. Define dataset metadata + input function (to read and parse the data files)

2. Create feature columns based on metadata

3. Instantiate the model with feature columns 

4. Train, evaluate, and predict using the model and the data input function


### 1 - Define Metadata &  Input Function

In [None]:
HEADER = 'weight_pounds,is_male,mother_age,mother_race,plurality,gestation_weeks,mother_married,cigarette_use,alcohol_use,key'.split(',')
TARGET_NAME = 'weight_pounds'
KEY_COLUMN = 'key'
DEFAULTS = [[0.0], ['null'], [0.0], ['null'], [0.0], [0.0], ['null'], ['null'], ['null'], ['nokey']]

In [None]:
def parse_csv_row(csv_row):
    
    columns = tf.decode_csv(tf.expand_dims(csv_row, -1), record_defaults=DEFAULTS)
    features = dict(zip(HEADER, columns))
    features.pop(KEY_COLUMN)
    target = features.pop(TARGET_NAME)
    return features, target

In [None]:
def csv_input_fn(file_name, mode=tf.estimator.ModeKeys.EVAL, 
                 skip_header_lines=0, 
                 num_epochs=1, 
                 batch_size=500):
    
    shuffle = True if mode == tf.estimator.ModeKeys.TRAIN else False
    
    file_names = tf.matching_files(file_name)

    dataset = data.TextLineDataset(filenames=file_names)
    dataset = dataset.skip(skip_header_lines)
    
    if shuffle:
        dataset = dataset.shuffle(buffer_size=2 * batch_size + 1)

    dataset = dataset.batch(batch_size)
    dataset = dataset.map(lambda csv_row: parse_csv_row(csv_row))
    dataset = dataset.repeat(num_epochs)
    iterator = dataset.make_one_shot_iterator()
    
    features, target = iterator.get_next()
    return features, target

### 2 - Create Feature Columns

In [None]:
def create_feature_columns():

    is_male=tf.feature_column.categorical_column_with_vocabulary_list('is_male', ['True', 'False'])
    mother_age=tf.feature_column.numeric_column('mother_age')
    mother_race=tf.feature_column.categorical_column_with_vocabulary_list('mother_race', ['White', 'Black', 'American Indian', 'Chinese', 
               'Japanese', 'Hawaiian', 'Filipino', 'Unknown', 'Asian Indian', 'Korean', 'Samaon', 'Vietnamese'])
    plurality=tf.feature_column.numeric_column('plurality')
    gestation_weeks=tf.feature_column.numeric_column('gestation_weeks')
    mother_married=tf.feature_column.categorical_column_with_vocabulary_list('mother_married', ['True', 'False'])
    cigarette_use=tf.feature_column.categorical_column_with_vocabulary_list('cigarette_use', ['True', 'False', 'None'])
    alcohol_use=tf.feature_column.categorical_column_with_vocabulary_list('alcohol_use', ['True', 'False', 'None'])
    
    feature_columns = [is_male, mother_age, mother_race, plurality, gestation_weeks, mother_married, cigarette_use, alcohol_use]
    
    return feature_columns

### 3 - Instantiate a Regression Estimator

In [None]:
local_model_dir = "trained_models/babyweight_estimator_lr"

feature_columns = create_feature_columns()

lr_estimator = tf.estimator.LinearRegressor(feature_columns=feature_columns,
                                            model_dir=local_model_dir)


### 4 - Train, Evaluate, and Predict

In [None]:
%%bash

ls data/babyweight

### a. train the model with the

In [None]:
import shutil

train_data_files = "data/babyweight/train.csv"

train_input_fn = lambda: csv_input_fn(train_data_files, 
                                              mode=tf.estimator.ModeKeys.TRAIN, 
                                              num_epochs=10,
                                              batch_size = 200
                                         )

# remove the following line of code to resume training
shutil.rmtree(local_model_dir, ignore_errors=True)

lr_estimator.train(train_input_fn, max_steps=1000)

In [None]:
%%bash

ls trained_models/babyweight_estimator_lr

### b. evaluate the trained model

In [None]:
eval_data_files = "data/babyweight/eval.csv"

eval_input_fn =lambda: csv_input_fn(eval_data_files)

lr_estimator.evaluate(eval_input_fn)

### c. predict using the trained model

In [None]:
import itertools

predictions = lr_estimator.predict(eval_input_fn)
values = list(map(lambda item: item["predictions"][0],list(itertools.islice(predictions, 5))))
print("")
print("Predicted Values: {}".format(values))

## Train a DNN Liner Combined Regression Model + Feature Engineering

1. Define dataset metadata + input function (to read and parse the data files, + **process features**) 

2. Create feature columns based on metadata + **Extended Feature Columns**

3. Initialise the Estimator + **Wide & Deep Columns for the combined DNN model**

4. Run **train_and_evaluate** experiment: Supply TrainSpec, EvalSepc, config, and params


### 1. Define input function with process features

In [None]:
HEADER = 'weight_pounds,is_male,mother_age,mother_race,plurality,gestation_weeks,mother_married,cigarette_use,alcohol_use,key'.split(',')
TARGET_NAME = 'weight_pounds'
KEY_COLUMN = 'key'
DEFAULTS = [[0.0], ['null'], [0.0], ['null'], [0.0], [0.0], ['null'], ['null'], ['null'], ['nokey']]

In [None]:
def parse_csv_row(csv_row):
    
    columns = tf.decode_csv(tf.expand_dims(csv_row, -1), record_defaults=DEFAULTS)
    features = dict(zip(HEADER, columns))
    features.pop(KEY_COLUMN)
    target = features.pop(TARGET_NAME)
    return features, target

In [None]:
# to be applied in traing and serving
def process_features(features):
    return features

In [None]:
def csv_input_fn(file_name, mode=tf.estimator.ModeKeys.EVAL, 
                 skip_header_lines=0, 
                 num_epochs=1, 
                 batch_size=500):
    
    shuffle = True if mode == tf.estimator.ModeKeys.TRAIN else False
    
    file_names = tf.matching_files(file_name)

    dataset = data.TextLineDataset(filenames=file_names)
    dataset = dataset.skip(skip_header_lines)
    
    if shuffle:
        dataset = dataset.shuffle(buffer_size=2 * batch_size + 1)

    dataset = dataset.batch(batch_size)
    dataset = dataset.map(lambda csv_row: parse_csv_row(csv_row))
    dataset = dataset.map(lambda features, target: (process_features(features), target))
    dataset = dataset.repeat(num_epochs)
    iterator = dataset.make_one_shot_iterator()
    
    features, target = iterator.get_next()
    return features, target

### 2. Create Feature Columns with Extensions

In [None]:
def get_deep_and_wide_columns():

    is_male=tf.feature_column.categorical_column_with_vocabulary_list('is_male', ['True', 'False'])
    mother_age=tf.feature_column.numeric_column('mother_age')
    mother_race=tf.feature_column.categorical_column_with_vocabulary_list('mother_race', ['White', 'Black', 'American Indian', 'Chinese', 
               'Japanese', 'Hawaiian', 'Filipino', 'Unknown', 'Asian Indian', 'Korean', 'Samaon', 'Vietnamese'])
    plurality=tf.feature_column.numeric_column('plurality')
    gestation_weeks=tf.feature_column.numeric_column('gestation_weeks')
    mother_married=tf.feature_column.categorical_column_with_vocabulary_list('mother_married', ['True', 'False'])
    cigarette_use=tf.feature_column.categorical_column_with_vocabulary_list('cigarette_use', ['True', 'False', 'None'])
    alcohol_use=tf.feature_column.categorical_column_with_vocabulary_list('alcohol_use', ['True', 'False', 'None'])
    
    # extended feature columns
    cigarette_use_X_alcohol_use = tf.feature_column.crossed_column([cigarette_use, alcohol_use], 9)
    
    mother_age_bucketized = tf.feature_column.bucketized_column(mother_age, boundaries=[18, 22, 28, 32, 36, 40, 42, 45, 50])
    
    mother_race_X_mother_age_bucketized = tf.feature_column.crossed_column( [mother_age_bucketized,mother_race],  120)
    
    mother_race_X_mother_age_bucketized_embedded = tf.feature_column.embedding_column(mother_race_X_mother_age_bucketized, 5)
    
    # wide and deep columns
    wide_columns = [is_male, mother_race, plurality, mother_married, cigarette_use, alcohol_use, cigarette_use_X_alcohol_use, mother_age_bucketized, mother_race_X_mother_age_bucketized] 
    deep_columns = [mother_age, gestation_weeks, mother_race_X_mother_age_bucketized_embedded]
    
    return wide_columns, deep_columns

#get_deep_and_wide_columns()

### 3 - Create a DNN Regression Estimator

In [None]:
def create_DNNLinearCombinedRegressor(run_config, hparams):
  
    wide_columns, deep_columns = get_deep_and_wide_columns()

    dnn_optimizer = tf.train.AdamOptimizer(learning_rate=hparams.learning_rate)
    
    estimator = tf.estimator.DNNLinearCombinedRegressor(
                linear_feature_columns = wide_columns,
                dnn_feature_columns = deep_columns,
                dnn_optimizer=dnn_optimizer,
                dnn_hidden_units=hparams.hidden_units,
                config = run_config
                )
    
    return estimator

### 4. Run Local Experiment

### a. RunConfig and Hyper-params

In [None]:
# Hyper-parameters
hparams  = tf.contrib.training.HParams(num_epochs = 10,
                                       batch_size = 500,
                                       hidden_units=[32, 16],
                                       max_steps = 100,
                                       learning_rate = 0.1,
                                       evaluate_after_sec=10)

# RunConfig
local_model_dir = "trained_models/babyweight_estimator_dnn"
run_config = tf.estimator.RunConfig(
    tf_random_seed=19830610,
    model_dir=local_model_dir
)

### b. Serving Function

In [None]:
def csv_serving_input_fn():
  
  SERVING_HEADER = 'is_male,mother_age,mother_race,plurality,gestation_weeks,mother_married,cigarette_use,alcohol_use'.split(',')
  SERVING_HEADER_DEFAULTS = [['null'], [0.0], ['null'], [0.0], [0.0], ['null'], ['null'], ['null']]

  rows_string_tensor = tf.placeholder(dtype=tf.string,
                                         shape=[None],
                                         name='csv_rows')
    
  receiver_tensor = {'csv_rows': rows_string_tensor}

  row_columns = tf.expand_dims(rows_string_tensor, -1)
  columns = tf.decode_csv(row_columns, record_defaults=SERVING_HEADER_DEFAULTS)
  features = dict(zip(SERVING_HEADER, columns))
  
  # apply feature preprocessing used input_fn
  features = process_features(features)
  
  return tf.estimator.export.ServingInputReceiver(
        features, receiver_tensor)

### c. TrainSpec and EvalSpec

In [None]:
train_data_files = "data/babyweight/train.csv"
eval_data_files = "data/babyweight/eval.csv"

# TrainSpec
train_spec = tf.estimator.TrainSpec(
  input_fn = lambda: csv_input_fn(
    train_data_files,
    mode=tf.estimator.ModeKeys.TRAIN,
    num_epochs= hparams.num_epochs,
    batch_size = hparams.batch_size
  ),
  max_steps=hparams.max_steps,
)

# EvalSpec
eval_spec = tf.estimator.EvalSpec(
  input_fn =lambda: csv_input_fn(eval_data_files),
  exporters=[tf.estimator.LatestExporter(
      name="estimate",  # the name of the folder in which the model will be exported to under export
      serving_input_receiver_fn=csv_serving_input_fn,
      exports_to_keep=1,
      as_text=True)],
  steps = None,
  throttle_secs = hparams.evaluate_after_sec # evalute after each 10 training seconds!
)

### d. Run train_and_evaluate

In [None]:
import shutil

# remove the following line of code to resume training
shutil.rmtree(local_model_dir, ignore_errors=True)

dnn_estimator = create_DNNLinearCombinedRegressor(run_config, hparams)

# run train and evaluate experiment
tf.estimator.train_and_evaluate(
  dnn_estimator,
  train_spec,
  eval_spec
)



In [None]:
%%bash

ls trained_models/babyweight_estimator_dnn/

### >> TensorBoard

In [None]:
from google.datalab.ml import TensorBoard
TensorBoard().start("trained_models/babyweight_estimator")
TensorBoard().list()

In [None]:
# to stop TensorBoard
# TensorBoard().stop(23002)
# print('stopped TensorBoard')
# TensorBoard().list()

## Train the Model on Cloud ML Engine

In [None]:
%%bash

gsutil -m cp -r gs://ksalama-gcs-cloudml/ml-packages/babyweight ml-packages
ls ml-packages/babyweight

In [None]:
%%bash

echo "Submitting a Cloud ML Engine job..."

REGION=europe-west1
TIER=BASIC # BASIC | BASIC_GPU | STANDARD_1 | PREMIUM_1
BUCKET=ksalama-gcs-cloudml

MODEL_NAME="babyweight_estimator"

PACKAGE_PATH=ml-packages/babyweight-tf1.4/trainer
TRAIN_FILES=gs://${BUCKET}/data/babyweight/train-data.csv
VALID_FILES=gs://${BUCKET}/data/babyweight/eval-data.csv
MODEL_DIR=gs://${BUCKET}/models/${MODEL_NAME}

#remove model directory, if you don't want to resume training, or if you have changed the model structure
#gsutil -m rm -r ${MODEL_DIR}

CURRENT_DATE=`date +%Y%m%d_%H%M%S`
JOB_NAME=train_${MODEL_NAME}_${TIER}_${CURRENT_DATE}

gcloud ml-engine jobs submit training ${JOB_NAME} \
        --job-dir=${MODEL_DIR} \
        --runtime-version=1.4 \
        --region=${REGION} \
        --scale-tier=${TIER} \
        --module-name=trainer.task \
        --package-path=${PACKAGE_PATH} \
        -- \
        --train-files=${TRAIN_FILES} \
        --num-epochs=100 \
        --train-batch-size=500 \
        --eval-files=${VALID_FILES} \
        --eval-batch-size=500 \
        --learning-rate=0.01 \
        --hidden-units="64,0,0" \
        --layer-sizes-scale-factor=0.5 \
        --num-layers=3 \
        --job-dir=${MODEL_DIR}

## Train the Model on Cloud ML Engine + GPUs

In [None]:
%%bash

echo "Submitting a Cloud ML Engine job..."

REGION=europe-west1
TIER=BASIC_GPU # BASIC | BASIC_GPU | STANDARD_1 | PREMIUM_1
BUCKET=ksalama-gcs-cloudml

MODEL_NAME="babyweight_estimator"

PACKAGE_PATH=ml-packages/babyweight-tf1.4/trainer
TRAIN_FILES=gs://${BUCKET}/data/babyweight/train-*.csv
VALID_FILES=gs://${BUCKET}/data/babyweight/eval-*.csv
MODEL_DIR=gs://${BUCKET}/models/${MODEL_NAME}_${TIER}

CURRENT_DATE=`date +%Y%m%d_%H%M%S`
JOB_NAME=train_${MODEL_NAME}_${TIER}_${CURRENT_DATE}

gcloud ml-engine jobs submit training ${JOB_NAME} \
        --job-dir=${MODEL_DIR} \
        --runtime-version=1.4 \
        --region=${REGION} \
        --scale-tier=${TIER} \
        --module-name=trainer.task \
        --package-path=${PACKAGE_PATH} \
        -- \
        --train-files=${TRAIN_FILES} \
        --num-epochs=10 \
        --train-batch-size=1000 \
        --eval-files=${VALID_FILES} \
        --eval-batch-size=1000 \
        --learning-rate=0.01 \
        --hidden-units="64,0,0" \
        --layer-sizes-scale-factor=0.5 \
        --num-layers=3 \
        --job-dir=${MODEL_DIR}

## Train the Model on Cloud ML Engine + Custom GPUs Cluster

In [None]:
%%bash

echo "Submitting a Cloud ML Engine job..."

REGION=europe-west1
TIER=CUSTOM # BASIC | BASIC_GPU | STANDARD_1 | PREMIUM_1
BUCKET=ksalama-gcs-cloudml

MODEL_NAME="babyweight_estimator"

PACKAGE_PATH=ml-packages/babyweight-tf1.4/trainer
TRAIN_FILES=gs://${BUCKET}/data/babyweight/big_data/train-*.csv
VALID_FILES=gs://${BUCKET}/data/babyweight/big_data/eval-*.csv
MODEL_DIR=gs://${BUCKET}/models/${MODEL_NAME}_${TIER}

CURRENT_DATE=`date +%Y%m%d_%H%M%S`
JOB_NAME=train_${MODEL_NAME}_${TIER}_${CURRENT_DATE}

gcloud ml-engine jobs submit training ${JOB_NAME} \
        --job-dir=${MODEL_DIR} \
        --runtime-version=1.4 \
        --region=${REGION} \
        --module-name=trainer.task \
        --package-path=${PACKAGE_PATH} \
        --config=ml-packages/babyweight-tf1.4/custom.yaml \
        -- \
        --train-files=${TRAIN_FILES} \
        --num-epochs=100 \
        --train-batch-size=1000 \
        --eval-files=${VALID_FILES} \
        --eval-batch-size=1000 \
        --learning-rate=0.001 \
        --hidden-units="64,0,0" \
        --layer-sizes-scale-factor=0.5 \
        --num-layers=3 \
        --job-dir=${MODEL_DIR}

## Hyper-parameters Tuning on Cloud ML Engine

In [None]:
%%bash

echo "Submitting a Cloud ML Engine job..."

REGION=europe-west1
BUCKET=ksalama-gcs-cloudml

MODEL_NAME="babyweight_estimator"

PACKAGE_PATH=ml-packages/babyweight-tf1.4/trainer
TRAIN_FILES=gs://${BUCKET}/data/babyweight/big_data/train-*.csv
VALID_FILES=gs://${BUCKET}/data/babyweight/big_data/eval-*.csv
MODEL_DIR=gs://${BUCKET}/models/${MODEL_NAME}_tune

CURRENT_DATE=`date +%Y%m%d_%H%M%S`
JOB_NAME=tune_${MODEL_NAME}_${TIER}_${CURRENT_DATE}

gcloud ml-engine jobs submit training ${JOB_NAME} \
        --job-dir=${MODEL_DIR} \
        --runtime-version=1.4 \
        --region=${REGION} \
        --module-name=trainer.task \
        --package-path=${PACKAGE_PATH} \
        --config=ml-packages/babyweight-tf1.4/hyperparams.yaml \
        -- \
        --train-files=${TRAIN_FILES} \
        --num-epochs=100 \
        --train-batch-size=1000 \
        --eval-files=${VALID_FILES} \
        --eval-batch-size=1000 \
        --job-dir=${MODEL_DIR}

## Deploy the Model

In [None]:
%%bash

REGION=europe-west1
BUCKET=ksalama-gcs-cloudml

MODEL_NAME="babyweight_estimator"
MODEL_VERSION="v1"

#MODEL_BINARIES=$(gsutil ls gs://${BUCKET}/models/${MODEL_NAME}/export/Servo | tail -1)

#gsutil ls ${MODEL_BINARIES}

# delete model version
#gcloud ml-engine versions delete ${MODEL_VERSION} --model=${MODEL_NAME}

# delete model
#gcloud ml-engine models delete ${MODEL_NAME}

# deploy model to GCP
#gcloud ml-engine models create ${MODEL_NAME} --regions=${REGION}

#deploy model version
#gcloud ml-engine versions create ${MODEL_VERSION} --model=${MODEL_NAME} --origin=${MODEL_BINARIES} --runtime-version=1.4

#echo  ${MODEL_NAME} ${MODEL_VERSION} 
# invoke deployed model to make prediction given new data instances
#gcloud ml-engine predict --model=${MODEL_NAME} --version=${MODEL_VERSION} --json-instances=data/babyweight/new-data.json

## Consume the Model as API

In [None]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials

def estimate(project, model_name, version, instances):

    credentials = GoogleCredentials.get_application_default()
    api = discovery.build('ml', 'v1', credentials=credentials,
                discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json')

    request_data = {'instances': instances}

    model_url = 'projects/{}/models/{}/versions/{}'.format(project, model_name, version)
    response = api.projects().predict(body=request_data, name=model_url).execute()

    estimates = list(map(lambda item: round(item["scores"],2)
        ,response["predictions"]
    ))

    return estimates

In [None]:
PROJECT='ksalama-gcp-playground'
MODEL_NAME='babyweight_estimator'
VERSION='v1'

instances = [
      {
        'is_male': 'True',
        'mother_age': 26.0,
        'mother_race': 'Asian Indian',
        'plurality': 1.0,
        'gestation_weeks': 39,
        'mother_married': 'True',
        'cigarette_use': 'False',
        'alcohol_use': 'False'
      },
      {
        'is_male': 'False',
        'mother_age': 29.0,
        'mother_race': 'Asian Indian',
        'plurality': 1.0,
        'gestation_weeks': 38,
        'mother_married': 'True',
        'cigarette_use': 'False',
        'alcohol_use': 'False'
      },
      {
        'is_male': 'True',
        'mother_age': 26.0,
        'mother_race': 'White',
        'plurality': 1.0,
        'gestation_weeks': 39,
        'mother_married': 'True',
        'cigarette_use': 'False',
        'alcohol_use': 'False'
      },
      {
        'is_male': 'True',
        'mother_age': 26.0,
        'mother_race': 'White',
        'plurality': 2.0,
        'gestation_weeks': 37,
        'mother_married': 'True',
        'cigarette_use': 'False',
        'alcohol_use': 'True'
      }
  ]

estimates = estimate(instances=instances
                     ,project=PROJECT
                     ,model_name=MODEL_NAME
                     ,version=VERSION)

print(estimates)

### the end ...