# ML with Structured Data using Google Cloud

This tutorial is adapted from [this awesome tutorial](https://docs.google.com/presentation/d/e/2PACX-1vR-d6ztE9pkRS1L0pKInaaGMsBf7d_bMETr3Mx0uFYng2Y22zexg0ZaPRWbmmc497EMBeRgg8xvLLfI/pub?start=false&loop=false&delayms=3000&slide=id.g3444070087_0_0) created by **Lak Lakshmanan** for end-to-end ML with TensorFlow on GCP, which includes the original [codelabs](https://codelabs.developers.google.com/codelabs/end-to-end-ml/#0).

This notebook illustrates:

1. Exploring a BigQuery dataset using Datalab
2. Creating datasets for Machine Learning using Dataflow
3. Creating a model using the high-level Estimator API 
4. Training on Cloud ML Engine
5. Deploying the model
6. Predicting with the model

### Housekeeping 

In [8]:
BUCKET = 'ksalama-gcs-cloudml'
PROJECT = 'ksalama-gcp-playground'
REGION = 'europe-west1'

In [9]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

In [4]:
gcs_data_dir = 'gs://{0}/data/babyweight/'.format(BUCKET)
gcs_model_dir = 'gs://{0}/models/babyweight/'.format(BUCKET)

local_data_dir = 'data/babyweight'
local_models_dir= 'models/babyweight'

In [7]:
%%bash

gsutil -m rm -rf gs://ksalama-gcs-cloudml/models/babyweight/*
gsutil -m rm -rf gs://ksalama-gcs-cloudml/data/babyweight/big_data/*

CommandException: 1 files/objects could not be removed.
CommandException: 1 files/objects could not be removed.


## 1. Explore Data in BigQuery

The data is natality data (record of births in the US). My goal is to predict the baby's weight given a number of factors about the pregnancy and the baby's mother.  Later, we will want to split the data into training and eval datasets. The hash of the year-month will be used for that.

In [None]:
%%bq query --name data

SELECT
  CAST(mother_race AS string) race_index,
  AVG(weight_pounds) avg_weight,
  COUNT(weight_pounds) instance_Count
FROM
  `publicdata.samples.natality`
WHERE 
    year > 2000
AND weight_pounds > 0
AND mother_age > 0
AND plurality > 0
AND gestation_weeks > 0
AND month > 0
AND mother_race IS NOT NULL
GROUP BY
  mother_race
ORDER BY
  avg_weight DESC

### Visualise with Datalab commands 
http://googledatalab.github.io/pydatalab/google.datalab%20Commands.html

In [None]:
%chart columns --data data --fields race_index,avg_weight
title: Mother Race Index vs Average Baby Weight
height: 400
width: 900
hAxis:
  title: Race Index
vAxis:
  title: Average Weight

### Fetch data from BigQuery as a pandas dataframe

In [None]:
data_size = 10000

In [None]:
%sql --module query 

SELECT
  ROUND(weight_pounds,1) AS weight_pounds,
  is_male,
  mother_age,
  mother_race,
  plurality,
  gestation_weeks,
  mother_married,
  cigarette_use,
  alcohol_use
FROM
  `publicdata.samples.natality`
WHERE 
        year > 2000
    AND weight_pounds > 0
    AND mother_age > 0
    AND plurality > 0
    AND gestation_weeks > 0
    AND month > 0
    AND mother_race IS NOT NULL
LIMIT $DATA_SIZE

In [None]:
import datalab.bigquery as bq
import sys
data = bq.Query(query, DATA_SIZE = data_size).to_dataframe(dialect='standard')
print('Row count:{}'.format(len(data)))
data.head(5)

In [None]:
data.describe()

### Explore & Visualise

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [None]:
plt.close('all')
#plt.figure(figsize=(45, 25))
plt.figure(figsize=(30, 15))

# Baby Weight Distribution
plt.subplot(2,3,1)
plt.title("Baby Weight Histogram")
plt.hist(data.weight_pounds, bins=150)
#plt.axis([0, 50, 0, 3500])
plt.xlabel("Baby Weight Ranges")
plt.ylabel("Frequency")
# ---------------------------

# Mother Age vs Baby Weight
plt.subplot(2,3,2)
plt.title("Mother Age vs Baby Weight")
plt.scatter(data.mother_age,data.weight_pounds)
plt.xlabel("Mother Age")
plt.ylabel("Baby Weight")
# ---------------------------

# Gestation Weeks vs Baby Weight
plt.subplot(2,3,3)
fit = np.polyfit(data.gestation_weeks,data.weight_pounds, deg=1)
plt.plot(data.gestation_weeks, fit[0] * data.gestation_weeks + fit[1], color='red')
plt.scatter(data.gestation_weeks, data.weight_pounds)
plt.xlabel("Gestation Weeks")
plt.ylabel("Baby Weight")

#---------------------------

# Is Male vs Baby Weight Boxplot
plt.subplot(2,3,4)
plt.title("Is Male vs Baby Weight")

is_male_values = list(data.is_male.value_counts().index.values)
is_male_data = []
for i in is_male_values:
    is_male_data = is_male_data + [data.weight_pounds[data.is_male == i].values]

plt.boxplot(is_male_data)
plt.axis([0, 3, 4, 11])
plt.xlabel("Is Male")
plt.ylabel("Baby Weight")
# ---------------------------

# Mother Race vs Baby Weight Boxplot
plt.subplot(2,3,5)
plt.title("Mother Race vs Baby Weight")

race_values = list(data.mother_race.value_counts().index.values)
race_data = []
for i in race_values:
    race_data = race_data + [data.weight_pounds[data.mother_race == i].values]

plt.boxplot(race_data)
plt.axis([0, 16, 4, 11])
plt.xlabel("Mother Race")
plt.ylabel("Baby Weight")

# # ---------------------------

plt.subplot(2,3,6)
plt.title("Alcohol & Cigarette Use")

alch_use_values = list(data.alcohol_use.value_counts().index.values)
cig_use_values = list(data.cigarette_use.value_counts().index.values)

use_data = []
labels = []

for i in alch_use_values:
    for j in cig_use_values:
        labels = labels + ['alch-use:{} & cig-use:{}'.format(i,j)]
        condition = (data.alcohol_use == i) & (data.cigarette_use == j)
        values = data.weight_pounds[condition].values
        if (len(values) > 0):
            use_data = use_data + [len(values)]

plt.pie(use_data)
plt.legend(labels, loc="lower center")

plt.show()

### Average Weight as a Baseline Estimator

In [None]:
import numpy as np

avg_weight = data.weight_pounds.mean()
print("Average Weight: {}".format(round(avg_weight,3)))
rmse = np.sqrt(data.weight_pounds.map(lambda value: (value-avg_weight)**2).mean())
print("RMSE: {}".format(round(rmse,3)))

## 2. Create ML dataset using Dataflow


Let's use Cloud Dataflow to preprocess the data. The pipeline should do the following steps:
1. Read the data from BigQuery 
2. Clean, process, and transform the data to CSV
2. Write the results to files in GCS

### 2.1 Define the Pipeline

In [None]:
import apache_beam as beam
import datetime

dataset_size = 100000
train_size = dataset_size * 0.75
eval_size = dataset_size * 0.25

query = """
    SELECT
      ROUND(weight_pounds,1) AS weight_pounds ,
      is_male,
      mother_age,
      mother_race,
      plurality,
      gestation_weeks,
      mother_married,
      cigarette_use,
      alcohol_use,
      ABS(FARM_FINGERPRINT( 
        CONCAT(
          COALESCE(CAST(weight_pounds AS STRING), 'NA'),
          COALESCE(CAST(is_male AS STRING),'NA'),
          COALESCE(CAST(mother_age AS STRING),'NA'),
          COALESCE(CAST(mother_race AS STRING),'NA'),
          COALESCE(CAST(plurality AS STRING), 'NA'),
          COALESCE(CAST(gestation_weeks AS STRING),'NA'),
          COALESCE(CAST(mother_married AS STRING), 'NA'),
          COALESCE(CAST(cigarette_use AS STRING),'NA'),
          COALESCE(CAST(alcohol_use AS STRING),'NA')
          )
        )) AS key
        FROM
          publicdata.samples.natality
        WHERE year > 2000
        AND weight_pounds > 0
        AND mother_age > 0
        AND plurality > 0
        AND gestation_weeks > 0
        AND month > 0
    """

out_dir = gcs_data_dir + "big_data"

def to_csv(bq_row):

    import copy
    CSV_COLUMNS = 'weight_pounds,is_male,mother_age,mother_race,plurality,gestation_weeks,mother_married,cigarette_use,alcohol_use,key'.split(',')
    
    # modify opaque numeric race code into human-readable data
    races = dict(zip([1,2,3,4,5,6,7,18,28,39,48],
                     ['White', 'Black', 'American Indian', 'Chinese', 
                      'Japanese', 'Hawaiian', 'Filipino',
                      'Asian Indian', 'Korean', 'Samaon', 'Vietnamese']))
    
    result = copy.deepcopy(bq_row)
    
    if 'mother_race' in bq_row and bq_row['mother_race'] in races:
        result['mother_race'] = races[bq_row['mother_race']]
    else:
        result['mother_race'] = 'Unknown'
    
    csv_data = ','.join([str(result[k]) if k in result else 'None' for k in CSV_COLUMNS])
    return csv_data
  
def run_pipeline(runner, opts):
  
    pipeline = beam.Pipeline(RUNNER, options=opts)
    
    for step in ['train', 'eval']:
        
        if step == 'train':
            source_query = 'SELECT * FROM ({}) WHERE MOD(key,100) <= 75 LIMIT {}'.format(query,int(train_size))
        else:
            source_query = 'SELECT * FROM ({}) WHERE MOD(key,100) > 25 LIMIT {}'.format(query,int(eval_size))
            
        sink_location = os.path.join(out_dir, '{}-data'.format(step))

        (
            pipeline 
           | '{} - Read from BigQuery'.format(step) >> beam.io.Read(beam.io.BigQuerySource(query=source_query, use_standard_sql=True))
           | '{} - Process to CSV'.format(step) >> beam.Map(to_csv)
           | '{} - Write to GCS '.format(step) >> beam.io.Write(beam.io.WriteToText(sink_location,
                                                                file_name_suffix='.csv',
                                                                num_shards=5
                                                                                   ))
        )
        
    job = pipeline.run()

### 2.2. Run the Pipeline on Dataflow

In [None]:
job_name = 'preprocess-babyweight-data' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S')

options = {
    'region': 'europe-west1',
    'staging_location': os.path.join(out_dir, 'tmp', 'staging'),
    'temp_location': os.path.join(out_dir, 'tmp'),
    'job_name': job_name,
    'project': PROJECT
}

opts = beam.pipeline.PipelineOptions(flags=[], **options)
RUNNER = 'DataflowRunner'

print 'Launching Dataflow job {} ... hang on'.format(job_name)

run_pipeline(RUNNER, opts)

In [None]:
%%bash

gsutil ls gs://ksalama-gcs-cloudml/data/babyweight/big_data

## 3. Create TensorFlow Models using Estimator API

In [15]:
import tensorflow as tf
from tensorflow import data

print(tf.__version__)

1.7.0


## Train a DNN Liner Combined Regression Model + Feature Engineering

1. Define dataset metadata + input function (to read and parse the data files, + **process features**) 

2. Create feature columns based on metadata + **Extended Feature Columns**

3. Initialise the Estimator + **Wide & Deep Columns for the combined DNN model**

4. Setup and experiment with TrainSpec, EvalSepc, config, and params

5. Run **train_and_evaluate** experiment


### 3.1 Define input function with process features

In [18]:
HEADER = 'weight_pounds,is_male,mother_age,mother_race,plurality,gestation_weeks,mother_married,cigarette_use,alcohol_use,key'.split(',')
TARGET_NAME = 'weight_pounds'
KEY_COLUMN = 'key'
DEFAULTS = [[0.0], ['null'], [0.0], ['null'], [0.0], [0.0], ['null'], ['null'], ['null'], ['nokey']]

In [19]:
def parse_csv_row(csv_row):
    
    columns = tf.decode_csv(tf.expand_dims(csv_row, -1), record_defaults=DEFAULTS)
    features = dict(zip(HEADER, columns))
    features.pop(KEY_COLUMN)
    target = features.pop(TARGET_NAME)
    return features, target

In [20]:
# to be applied in traing and serving
def process_features(features):
    return features

In [21]:
def csv_input_fn(file_name, mode=tf.estimator.ModeKeys.EVAL, 
                 skip_header_lines=0, 
                 num_epochs=1, 
                 batch_size=500):
    
    shuffle = True if mode == tf.estimator.ModeKeys.TRAIN else False
    
    file_names = tf.matching_files(file_name)

    dataset = data.TextLineDataset(filenames=file_names)
    dataset = dataset.skip(skip_header_lines)
    
    if shuffle:
        dataset = dataset.shuffle(buffer_size=2 * batch_size + 1)

    dataset = dataset.batch(batch_size)
    dataset = dataset.map(lambda csv_row: parse_csv_row(csv_row))
    dataset = dataset.map(lambda features, target: (process_features(features), target))
    dataset = dataset.repeat(num_epochs)
    iterator = dataset.make_one_shot_iterator()
    
    features, target = iterator.get_next()
    return features, target

### 3.2 Create Feature Columns with Extensions

In [22]:
def get_deep_and_wide_columns():

    is_male=tf.feature_column.categorical_column_with_vocabulary_list('is_male', ['True', 'False'])
    mother_age=tf.feature_column.numeric_column('mother_age')
    mother_race=tf.feature_column.categorical_column_with_vocabulary_list('mother_race', ['White', 'Black', 'American Indian', 'Chinese', 
               'Japanese', 'Hawaiian', 'Filipino', 'Unknown', 'Asian Indian', 'Korean', 'Samaon', 'Vietnamese'])
    plurality=tf.feature_column.numeric_column('plurality')
    gestation_weeks=tf.feature_column.numeric_column('gestation_weeks')
    mother_married=tf.feature_column.categorical_column_with_vocabulary_list('mother_married', ['True', 'False'])
    cigarette_use=tf.feature_column.categorical_column_with_vocabulary_list('cigarette_use', ['True', 'False', 'None'])
    alcohol_use=tf.feature_column.categorical_column_with_vocabulary_list('alcohol_use', ['True', 'False', 'None'])
    
    # extended feature columns
    cigarette_use_X_alcohol_use = tf.feature_column.crossed_column([cigarette_use, alcohol_use], 9)
    
    mother_age_bucketized = tf.feature_column.bucketized_column(mother_age, boundaries=[18, 22, 28, 32, 36, 40, 42, 45, 50])
    
    mother_race_X_mother_age_bucketized = tf.feature_column.crossed_column( [mother_age_bucketized,mother_race],  120)
    
    mother_race_X_mother_age_bucketized_embedded = tf.feature_column.embedding_column(mother_race_X_mother_age_bucketized, 5)
    
    # wide and deep columns
    wide_columns = [is_male, mother_race, plurality, mother_married, cigarette_use, alcohol_use, cigarette_use_X_alcohol_use, mother_age_bucketized, mother_race_X_mother_age_bucketized] 
    deep_columns = [mother_age, gestation_weeks, mother_race_X_mother_age_bucketized_embedded]
    
    return wide_columns, deep_columns

#get_deep_and_wide_columns()

### 3.3 Create a DNN Regression Estimator

In [23]:
def create_DNNLinearCombinedRegressor(run_config, hparams):
  
    wide_columns, deep_columns = get_deep_and_wide_columns()

    dnn_optimizer = tf.train.AdamOptimizer(learning_rate=hparams.learning_rate)
    
    estimator = tf.estimator.DNNLinearCombinedRegressor(
                linear_feature_columns = wide_columns,
                dnn_feature_columns = deep_columns,
                dnn_optimizer=dnn_optimizer,
                dnn_hidden_units=hparams.hidden_units,
                config = run_config
                )
    
    return estimator

### 3.4 Setup an experiment

##### a) RunConfig and Hyper-params

In [24]:
# Hyper-parameters
hparams  = tf.contrib.training.HParams(num_epochs = 10,
                                       batch_size = 500,
                                       hidden_units=[32, 16],
                                       max_steps = 100,
                                       learning_rate = 0.1,
                                       evaluate_after_sec=10)


model_name = 'dnn_estimator'
model_dir = os.path.join(local_models_dir, model_name)

# RunConfig
run_config = tf.estimator.RunConfig(
    tf_random_seed=19830610,
    model_dir=model_dir
)

Instructions for updating:
Use the retry module or similar alternatives.


##### b) Serving Function

In [None]:
def csv_serving_input_fn():
  
  SERVING_HEADER = 'is_male,mother_age,mother_race,plurality,gestation_weeks,mother_married,cigarette_use,alcohol_use'.split(',')
  SERVING_HEADER_DEFAULTS = [['null'], [0.0], ['null'], [0.0], [0.0], ['null'], ['null'], ['null']]

  rows_string_tensor = tf.placeholder(dtype=tf.string,
                                         shape=[None],
                                         name='csv_rows')
    
  receiver_tensor = {'csv_rows': rows_string_tensor}

  row_columns = tf.expand_dims(rows_string_tensor, -1)
  columns = tf.decode_csv(row_columns, record_defaults=SERVING_HEADER_DEFAULTS)
  features = dict(zip(SERVING_HEADER, columns))
  
  # apply feature preprocessing used input_fn
  features = process_features(features)
  
  return tf.estimator.export.ServingInputReceiver(
        features, receiver_tensor)

##### c) TrainSpec and EvalSpec

In [None]:
train_data_files = "data/babyweight/train.csv"
eval_data_files = "data/babyweight/eval.csv"

# TrainSpec
train_spec = tf.estimator.TrainSpec(
  input_fn = lambda: csv_input_fn(
    train_data_files,
    mode=tf.estimator.ModeKeys.TRAIN,
    num_epochs= hparams.num_epochs,
    batch_size = hparams.batch_size
  ),
  max_steps=hparams.max_steps,
)

# EvalSpec
eval_spec = tf.estimator.EvalSpec(
  input_fn =lambda: csv_input_fn(eval_data_files),
  exporters=[tf.estimator.LatestExporter(
      name="estimate",  # the name of the folder in which the model will be exported to under export
      serving_input_receiver_fn=csv_serving_input_fn,
      exports_to_keep=1,
      as_text=True)],
  steps = None,
  throttle_secs = hparams.evaluate_after_sec # evalute after each 10 training seconds!
)

### >> Start TensorBoard

In [None]:
from google.datalab.ml import TensorBoard
TensorBoard().start(model_dir)
TensorBoard().list()

### 3.5 Run train_and_evaluate

In [None]:
import shutil

# remove the following line of code to resume training
shutil.rmtree(model_dir, ignore_errors=True)

dnn_estimator = create_DNNLinearCombinedRegressor(run_config, hparams)

tf.logging.set_verbosity(tf.logging.INFO)

# run train and evaluate experiment
tf.estimator.train_and_evaluate(
  dnn_estimator,
  train_spec,
  eval_spec
)

In [None]:
%%bash

ls models/babyweigh/dnn_estimator/

### >> Stop TensorBoard

In [None]:
# to stop TensorBoard
TensorBoard().stop(23002)
print('stopped TensorBoard')
TensorBoard().list()

## 4. Train the Model on Cloud ML Engine

### Train the Model on Cloud ML Engine with Single Node

In [None]:
%%bash

echo "Submitting a Cloud ML Engine job..."

REGION=europe-west1
TIER=BASIC # BASIC | BASIC_GPU | STANDARD_1 | PREMIUM_1
BUCKET=ksalama-gcs-cloudml

MODEL_NAME="babyweight_estimator"

PACKAGE_PATH=packages/babyweight-tf1.4/trainer
TRAIN_FILES=gs://${BUCKET}/data/babyweight/train-data.csv
VALID_FILES=gs://${BUCKET}/data/babyweight/eval-data.csv
MODEL_DIR=gs://${BUCKET}/models/babyweight/${MODEL_NAME}

#remove model directory, if you don't want to resume training, or if you have changed the model structure
#gsutil -m rm -r ${MODEL_DIR}

CURRENT_DATE=`date +%Y%m%d_%H%M%S`
JOB_NAME=train_${MODEL_NAME}_${TIER}_${CURRENT_DATE}

gcloud ml-engine jobs submit training ${JOB_NAME} \
        --job-dir=${MODEL_DIR} \
        --runtime-version=1.4 \
        --region=${REGION} \
        --scale-tier=${TIER} \
        --module-name=trainer.task \
        --package-path=${PACKAGE_PATH} \
        -- \
        --train-files=${TRAIN_FILES} \
        --num-epochs=100 \
        --train-batch-size=500 \
        --eval-files=${VALID_FILES} \
        --eval-batch-size=500 \
        --learning-rate=0.01 \
        --hidden-units="64,0,0" \
        --layer-sizes-scale-factor=0.5 \
        --num-layers=3 \
        --job-dir=${MODEL_DIR}

### Train the Model on Cloud ML Engine + GPUs

In [None]:
%%bash

echo "Submitting a Cloud ML Engine job..."

REGION=europe-west1
TIER=BASIC_GPU # BASIC | BASIC_GPU | STANDARD_1 | PREMIUM_1
BUCKET=ksalama-gcs-cloudml

MODEL_NAME="babyweight_estimator"

PACKAGE_PATH=packages/babyweight-tf1.4/trainer
TRAIN_FILES=gs://${BUCKET}/data/babyweight/train-*.csv
VALID_FILES=gs://${BUCKET}/data/babyweight/eval-*.csv
MODEL_DIR=gs://${BUCKET}/models/babyweight/${MODEL_NAME}_${TIER}

CURRENT_DATE=`date +%Y%m%d_%H%M%S`
JOB_NAME=train_${MODEL_NAME}_${TIER}_${CURRENT_DATE}

gcloud ml-engine jobs submit training ${JOB_NAME} \
        --job-dir=${MODEL_DIR} \
        --runtime-version=1.4 \
        --region=${REGION} \
        --scale-tier=${TIER} \
        --module-name=trainer.task \
        --package-path=${PACKAGE_PATH} \
        -- \
        --train-files=${TRAIN_FILES} \
        --num-epochs=10 \
        --train-batch-size=1000 \
        --eval-files=${VALID_FILES} \
        --eval-batch-size=1000 \
        --learning-rate=0.01 \
        --hidden-units="64,0,0" \
        --layer-sizes-scale-factor=0.5 \
        --num-layers=3 \
        --job-dir=${MODEL_DIR}

### Train the Model on Cloud ML Engine + Custom GPUs Cluster

In [None]:
%%bash

echo "Submitting a Cloud ML Engine job..."

REGION=europe-west1
TIER=CUSTOM # BASIC | BASIC_GPU | STANDARD_1 | PREMIUM_1
BUCKET=ksalama-gcs-cloudml

MODEL_NAME="babyweight_estimator"

PACKAGE_PATH=packages/babyweight-tf1.4/trainer
TRAIN_FILES=gs://${BUCKET}/data/babyweight/big_data/train-*.csv
VALID_FILES=gs://${BUCKET}/data/babyweight/big_data/eval-*.csv
MODEL_DIR=gs://${BUCKET}/models/babyweight/${MODEL_NAME}_${TIER}

CURRENT_DATE=`date +%Y%m%d_%H%M%S`
JOB_NAME=train_${MODEL_NAME}_${TIER}_${CURRENT_DATE}

gcloud ml-engine jobs submit training ${JOB_NAME} \
        --job-dir=${MODEL_DIR} \
        --runtime-version=1.4 \
        --region=${REGION} \
        --module-name=trainer.task \
        --package-path=${PACKAGE_PATH} \
        --config=packages/babyweight-tf1.4/custom.yaml \
        -- \
        --train-files=${TRAIN_FILES} \
        --num-epochs=100 \
        --train-batch-size=1000 \
        --eval-files=${VALID_FILES} \
        --eval-batch-size=1000 \
        --learning-rate=0.001 \
        --hidden-units="64,0,0" \
        --layer-sizes-scale-factor=0.5 \
        --num-layers=3 \
        --job-dir=${MODEL_DIR}

### Hyper-parameters Tuning on Cloud ML Engine

In [None]:
%%bash

echo "Submitting a Cloud ML Engine job..."

REGION=europe-west1
BUCKET=ksalama-gcs-cloudml

MODEL_NAME="babyweight_estimator"

PACKAGE_PATH=packages/babyweight-tf1.4/trainer
TRAIN_FILES=gs://${BUCKET}/data/babyweight/big_data/train-*.csv
VALID_FILES=gs://${BUCKET}/data/babyweight/big_data/eval-*.csv
MODEL_DIR=gs://${BUCKET}/models/babyweight/${MODEL_NAME}_tune

CURRENT_DATE=`date +%Y%m%d_%H%M%S`
JOB_NAME=tune_${MODEL_NAME}_${TIER}_${CURRENT_DATE}

gcloud ml-engine jobs submit training ${JOB_NAME} \
        --job-dir=${MODEL_DIR} \
        --runtime-version=1.4 \
        --region=${REGION} \
        --module-name=trainer.task \
        --package-path=${PACKAGE_PATH} \
        --config=packages/babyweight-tf1.4/hyperparams.yaml \
        -- \
        --train-files=${TRAIN_FILES} \
        --num-epochs=100 \
        --train-batch-size=1000 \
        --eval-files=${VALID_FILES} \
        --eval-batch-size=1000 \
        --job-dir=${MODEL_DIR}

## 5. Deploy the Model on Cloud ML Engine

In [13]:
%%bash

REGION=europe-west1
BUCKET=ksalama-gcs-cloudml

MODEL_NAME="babyweight_estimator"
MODEL_VERSION="v1"

SAVEDMODEL_LOCATION=$(gsutil ls gs://${BUCKET}/models/babyweight/${MODEL_NAME}/export/estimate | tail -1)

echo ${SAVEDMODEL_LOCATION}

gsutil ls ${SAVEDMODEL_LOCATION}




CommandException: One or more URLs matched no objects.


In [None]:
%%bash

REGION=europe-west1
BUCKET=ksalama-gcs-cloudml

MODEL_NAME="babyweight_estimator"
MODEL_VERSION="v1"

## delete model version
#gcloud ml-engine versions delete ${MODEL_VERSION} --model=${MODEL_NAME}

## delete model
#gcloud ml-engine models delete ${MODEL_NAME}

# deploy model to GCP
gcloud ml-engine models create ${MODEL_NAME} --regions=${REGION}

# deploy model version
gcloud ml-engine versions create ${MODEL_VERSION} --model=${MODEL_NAME} --origin=${MODEL_BINARIES} --runtime-version=1.4

echo  ${MODEL_NAME} ${MODEL_VERSION} 
# invoke deployed model to make prediction given new data instances
gcloud ml-engine predict --model=${MODEL_NAME} --version=${MODEL_VERSION} --json-instances=data/babyweight/new-data.json

## 6. Consume the Model as API

In [None]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials

def estimate(project, model_name, version, instances):

    credentials = GoogleCredentials.get_application_default()
    api = discovery.build('ml', 'v1', credentials=credentials,
                discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json')

    request_data = {'instances': instances}

    model_url = 'projects/{}/models/{}/versions/{}'.format(project, model_name, version)
    response = api.projects().predict(body=request_data, name=model_url).execute()

    estimates = list(map(lambda item: round(item["scores"],2)
        ,response["predictions"]
    ))

    return estimates

In [None]:
PROJECT='ksalama-gcp-playground'
MODEL_NAME='babyweight_estimator'
VERSION='v1'

instances = [
      {
        'is_male': 'True',
        'mother_age': 26.0,
        'mother_race': 'Asian Indian',
        'plurality': 1.0,
        'gestation_weeks': 39,
        'mother_married': 'True',
        'cigarette_use': 'False',
        'alcohol_use': 'False'
      },
      {
        'is_male': 'False',
        'mother_age': 29.0,
        'mother_race': 'Asian Indian',
        'plurality': 1.0,
        'gestation_weeks': 38,
        'mother_married': 'True',
        'cigarette_use': 'False',
        'alcohol_use': 'False'
      },
      {
        'is_male': 'True',
        'mother_age': 26.0,
        'mother_race': 'White',
        'plurality': 1.0,
        'gestation_weeks': 39,
        'mother_married': 'True',
        'cigarette_use': 'False',
        'alcohol_use': 'False'
      },
      {
        'is_male': 'True',
        'mother_age': 26.0,
        'mother_race': 'White',
        'plurality': 2.0,
        'gestation_weeks': 37,
        'mother_married': 'True',
        'cigarette_use': 'False',
        'alcohol_use': 'True'
      }
  ]

estimates = estimate(instances=instances
                     ,project=PROJECT
                     ,model_name=MODEL_NAME
                     ,version=VERSION)

print(estimates)

### the end ...