# ML with Structured Data using Google Cloud

This tutorial is adapted from [this awesome tutorial](https://docs.google.com/presentation/d/e/2PACX-1vR-d6ztE9pkRS1L0pKInaaGMsBf7d_bMETr3Mx0uFYng2Y22zexg0ZaPRWbmmc497EMBeRgg8xvLLfI/pub?start=false&loop=false&delayms=3000&slide=id.g3444070087_0_0) created by **Lak Lakshmanan** for end-to-end ML with TensorFlow on GCP, which includes the original [codelabs](https://codelabs.developers.google.com/codelabs/end-to-end-ml/#0). It extends on original one by covering: Facets, BQML, TFT, and TFMA.

This notebook illustrates:

1. Exploring a BigQuery dataset using Datalab & Facets
2. Linear Regression with BQML
3. Creating datasets for Machine Learning using Dataflow & tf.Transform
4. Creating a model using the high-level Estimator API 
5. Evaluate model quality using TensorFlow Model Analysis
5. Training & Tuning using Cloud ML Engine
5. Deploying the model
6. Predicting with the model

### Housekeeping 

In [None]:
BUCKET = 'ksalama-gcs-cloudml'
PROJECT = 'ksalama-gcp-playground'
REGION = 'europe-west1'

In [None]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

In [None]:
gcs_data_dir = 'gs://{0}/data/babyweight/'.format(BUCKET)
gcs_model_dir = 'gs://{0}/models/babyweight/'.format(BUCKET)

local_data_dir = 'data/babyweight'
local_models_dir= 'models/babyweight'

In [None]:
%%bash

gsutil -m rm -rf gs://ksalama-gcs-cloudml/models/babyweight/*
gsutil -m rm -rf gs://ksalama-gcs-cloudml/data/babyweight/big_data/*

## 1. Exploring data in BigQuery

The data is natality data (record of births in the US). My goal is to predict the baby's weight given a number of factors about the pregnancy and the baby's mother.  Later, we will want to split the data into training and eval datasets. The hash of the year-month will be used for that.

In [None]:
%%bq query --name data

SELECT
  CAST(mother_race AS string) race_index,
  AVG(weight_pounds) avg_weight,
  COUNT(weight_pounds) instance_Count
FROM
  `publicdata.samples.natality`
WHERE 
    year > 2000
AND weight_pounds > 0
AND mother_age > 0
AND plurality > 0
AND gestation_weeks > 0
AND month > 0
AND mother_race is not null
GROUP BY
  mother_race
ORDER BY
  avg_weight DESC

### Visualise with Datalab commands 
http://googledatalab.github.io/pydatalab/google.datalab%20Commands.html

In [None]:
%chart columns --data data --fields race_index,avg_weight
title: Mother Race Index vs Average Baby Weight
height: 400
width: 900
hAxis:
  title: Race Index
vAxis:
  title: Average Weight

### Fetch data from BigQuery as a pandas dataframe

In [None]:
data_size = 10000

In [None]:
%sql --module query 

SELECT
  ROUND(weight_pounds,1) AS weight_pounds,
  is_male,
  mother_age,
  mother_race,
  plurality,
  gestation_weeks,
  mother_married,
  cigarette_use,
  alcohol_use
FROM
  `publicdata.samples.natality`
WHERE 
        year > 2000
    AND weight_pounds > 0
    AND mother_age > 0
    AND plurality > 0
    AND gestation_weeks > 0
    AND month > 0
    AND mother_race IS NOT NULL
LIMIT $DATA_SIZE

In [None]:
import datalab.bigquery as bq
import sys
data = bq.Query(query, DATA_SIZE = data_size).to_dataframe(dialect='standard')
print('Row count:{}'.format(len(data)))
data.head(5)

In [None]:
data.describe()

### Explore & Visualise

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [None]:
plt.close('all')
#plt.figure(figsize=(45, 25))
plt.figure(figsize=(30, 15))

# Baby Weight Distribution
plt.subplot(2,3,1)
plt.title("Baby Weight Histogram")
plt.hist(data.weight_pounds, bins=150)
#plt.axis([0, 50, 0, 3500])
plt.xlabel("Baby Weight Ranges")
plt.ylabel("Frequency")
# ---------------------------

# Mother Age vs Baby Weight
plt.subplot(2,3,2)
plt.title("Mother Age vs Baby Weight")
plt.scatter(data.mother_age,data.weight_pounds)
plt.xlabel("Mother Age")
plt.ylabel("Baby Weight")
# ---------------------------

# Gestation Weeks vs Baby Weight
plt.subplot(2,3,3)
fit = np.polyfit(data.gestation_weeks,data.weight_pounds, deg=1)
plt.plot(data.gestation_weeks, fit[0] * data.gestation_weeks + fit[1], color='red')
plt.scatter(data.gestation_weeks, data.weight_pounds)
plt.xlabel("Gestation Weeks")
plt.ylabel("Baby Weight")

#---------------------------

# Is Male vs Baby Weight Boxplot
plt.subplot(2,3,4)
plt.title("Is Male vs Baby Weight")

is_male_values = list(data.is_male.value_counts().index.values)
is_male_data = []
for i in is_male_values:
    is_male_data = is_male_data + [data.weight_pounds[data.is_male == i].values]

plt.boxplot(is_male_data)
plt.axis([0, 3, 4, 11])
plt.xlabel("Is Male")
plt.ylabel("Baby Weight")
# ---------------------------

# Mother Race vs Baby Weight Boxplot
plt.subplot(2,3,5)
plt.title("Mother Race vs Baby Weight")

race_values = list(data.mother_race.value_counts().index.values)
race_data = []
for i in race_values:
    race_data = race_data + [data.weight_pounds[data.mother_race == i].values]

plt.boxplot(race_data)
plt.axis([0, 16, 4, 11])
plt.xlabel("Mother Race")
plt.ylabel("Baby Weight")

# # ---------------------------

plt.subplot(2,3,6)
plt.title("Alcohol & Cigarette Use")

alch_use_values = list(data.alcohol_use.value_counts().index.values)
cig_use_values = list(data.cigarette_use.value_counts().index.values)

use_data = []
labels = []

for i in alch_use_values:
    for j in cig_use_values:
        labels = labels + ['alch-use:{} & cig-use:{}'.format(i,j)]
        condition = (data.alcohol_use == i) & (data.cigarette_use == j)
        values = data.weight_pounds[condition].values
        if (len(values) > 0):
            use_data = use_data + [len(values)]

plt.pie(use_data)
plt.legend(labels, loc="lower center")

plt.show()

## Visualise Dataset using Facets - Big Picture
visit: https://research.google.com/bigpicture/

* Use Stacked with categorical features to test the distribution  If used with numerical features will bucketise them.
* Use Scatter with numerical values to test correlations.
* Use Facets to slice and dice (vertically and horizontally).
* Use colour with the target feature.

In [None]:
from IPython.core.display import display, HTML

jsonstr = data.to_json(orient='records')

#HTML_TEMPLATE = """<link rel="import" href="/nbextensions/facets-dist/facets-jupyter.html">
HTML_TEMPLATE = """<link rel="import" href="/nbextensions/facets-jupyter.html">
        <facets-dive id="elem" height="600"></facets-dive>
        <script>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
        </script>"""
html = HTML_TEMPLATE.format(jsonstr=jsonstr)
#display(HTML(html))

file = open("babyweight-facest.html","w") 
file.write(html) 
file.close() 

In [None]:
%%HTML
<iframe width="100%" height="600" src="babyweight-facest.html"></iframe>

### Average Weight as a Baseline Estimator

In [None]:
import numpy as np

avg_weight = data.weight_pounds.mean()
print("Average Weight: {}".format(round(avg_weight,3)))
rmse = np.sqrt(data.weight_pounds.map(lambda value: (value-avg_weight)**2).mean())
print("RMSE: {}".format(round(rmse,3)))

## Linear Regression with BigQuery

### 1- BQML: Create BigQuery dataset for ML models

In [None]:
from google.cloud import bigquery

BQML_DATASET = 'bqml_playground'
BQML_ESTOMATPR_NAME = 'babyweight_estimator'
BQML_DATASET_LOCATION = 'US'

bq_client = bigquery.Client(PROJECT)
dataset_ref = bq_client.dataset(BQML_DATASET)

dataset = bigquery.Dataset(dataset_ref)


if BQML_DATASET in list(map(lambda dataset: dataset.dataset_id,bq_client.list_datasets())):
    print('Deleting BQ Dataset {}...'.format(BQML_DATASET))
    bq_client.delete_dataset(dataset=dataset, delete_contents=True)
    
print('Creating BQ Table {}...'.format(BQML_DATASET))
dataset.location = BQML_DATASET_LOCATION
bq_client.create_dataset(dataset=dataset)

print('BQ Dataset {} is up and running'.format(BQML_DATASET))
print("")

### 2- BQML: Create and train the Linear Regression Model

In [None]:
from datetime import datetime
import time

bqml_train_query = (
'''
CREATE MODEL {}.{} 
  OPTIONS( model_type='linear_reg',
    learn_rate=0.1, 
    #l1_reg=0.001,
    max_iteration=1000,
    labels=['weight_pounds']
  ) AS
SELECT
  ROUND(weight_pounds,1) AS weight_pounds,
  COALESCE(CAST(is_male AS STRING),'NA') is_male,
  mother_age,
  COALESCE(CAST(mother_race AS STRING),'NA') mother_race,
  plurality,
  gestation_weeks,
  mother_married,
  COALESCE(CAST(cigarette_use AS STRING),'NA') cigarette_use,
  COALESCE(CAST(alcohol_use AS STRING),'NA') alcohol_use
FROM
  publicdata.samples.natality
WHERE
  year = 2000
  AND weight_pounds > 0
  AND mother_age > 0
  AND plurality > 0
  AND gestation_weeks > 0
  AND month > 0
LIMIT
  10000;
'''.format(BQML_DATASET, BQML_ESTOMATPR_NAME)
)

#print bqml_train_query

time_start = datetime.utcnow() 
print("Training started at {}".format(time_start.strftime("%H:%M:%S")))
print(".......................................") 

query_job = bq_client.query(
    query=bqml_train_query,
    location=BQML_DATASET_LOCATION
) 
print "Status: {}".format(query_job.state)

try:
    results = query_job.result()
    print results
except:
    pass

print "Status: {}".format(query_job.state)
time_end = datetime.utcnow() 
print(".......................................")
print("Training finished at {}".format(time_end.strftime("%H:%M:%S")))
print("")
time_elapsed = time_end - time_start
print("Training elapsed time: {} seconds".format(time_elapsed.total_seconds()))

### 3- BQML: Get Predictions using the Linear Regression Model

In [None]:
from datetime import datetime
import time

bqml_predict_query = (
'''
SELECT 
    ROUND(predicted_label,1) estimated_weight,
    weight_pounds
FROM ml.PREDICT(
  MODEL {}.{}, 
  (
      SELECT
          ROUND(weight_pounds,1) AS weight_pounds,
          COALESCE(CAST(is_male AS STRING),'NA') is_male,
          mother_age,
          COALESCE(CAST(mother_race AS STRING),'NA') mother_race,
          plurality,
          gestation_weeks,
          mother_married,
          COALESCE(CAST(cigarette_use AS STRING),'NA') cigarette_use,
          COALESCE(CAST(alcohol_use AS STRING),'NA') alcohol_use
      FROM
        publicdata.samples.natality
      WHERE
        year = 2000
        AND weight_pounds > 0
        AND mother_age > 0
        AND plurality > 0
        AND gestation_weeks > 0
          AND month > 0
     LIMIT 10
   )
);

'''.format(BQML_DATASET, BQML_ESTOMATPR_NAME)
)

#print bqml_predict_query

query_job = bq_client.query(
    query=bqml_predict_query,
    location=BQML_DATASET_LOCATION
) 
print "Status: {}".format(query_job.state)

results = query_job.result()
for row in results:
    print("Predicted:{},  Actual: {}".format(row.estimated_weight, row.weight_pounds))

print "Status: {}".format(query_job.state)


## 2. Create ML dataset using Dataflow

Let's use Cloud Dataflow to preprocess the data. The pipeline should do the following steps:
1. Read the data from BigQuery 
2. Clean, process, and transform the data to CSV
2. Write the results to files in GCS 


### 2.1 Define the Pipeline

In [None]:
import apache_beam as beam
import datetime

dataset_size = 100000
train_size = dataset_size * 0.7
eval_size = dataset_size * 0.3

query = """
    SELECT
      ROUND(weight_pounds,1) AS weight_pounds ,
      is_male,
      mother_age,
      mother_race,
      plurality,
      gestation_weeks,
      mother_married,
      cigarette_use,
      alcohol_use,
      FARM_FINGERPRINT( 
        CONCAT(
          COALESCE(CAST(weight_pounds AS STRING), 'NA'),
          COALESCE(CAST(is_male AS STRING),'NA'),
          COALESCE(CAST(mother_age AS STRING),'NA'),
          COALESCE(CAST(mother_race AS STRING),'NA'),
          COALESCE(CAST(plurality AS STRING), 'NA'),
          COALESCE(CAST(gestation_weeks AS STRING),'NA'),
          COALESCE(CAST(mother_married AS STRING), 'NA'),
          COALESCE(CAST(cigarette_use AS STRING),'NA'),
          COALESCE(CAST(alcohol_use AS STRING),'NA')
          )
        ) AS key
        FROM
          publicdata.samples.natality
        WHERE year > 2000
        AND weight_pounds > 0
        AND mother_age > 0
        AND plurality > 0
        AND gestation_weeks > 0
        AND month > 0
    """

out_dir = gcs_data_dir + "big_data"

def to_csv(bq_row):
    # pull columns from BQ and create a line
    import hashlib
    import copy
    CSV_COLUMNS = 'weight_pounds,is_male,mother_age,mother_race,plurality,gestation_weeks,mother_married,cigarette_use,alcohol_use,key'.split(',')
    # modify opaque numeric race code into human-readable data
    races = dict(zip([1,2,3,4,5,6,7,18,28,39,48],
                     ['White', 'Black', 'American Indian', 'Chinese', 
                      'Japanese', 'Hawaiian', 'Filipino',
                      'Asian Indian', 'Korean', 'Samaon', 'Vietnamese']))
    result = copy.deepcopy(bq_row)
    if 'mother_race' in bq_row and bq_row['mother_race'] in races:
        result['mother_race'] = races[bq_row['mother_race']]
    else:
        result['mother_race'] = 'Unknown'
    
    csv_data = ','.join([str(result[k]) if k in result else 'None' for k in CSV_COLUMNS])
    return csv_data
  
def run_pipeline(runner, opts):
  
    pipeline = beam.Pipeline(RUNNER, options=opts)
    
    for step in ['train', 'eval']:
        
        if step == 'train':
            source_query = 'SELECT * FROM ({}) WHERE MOD(key,4) < 3 LIMIT {}'.format(query,int(train_size))
        else:
            source_query = 'SELECT * FROM ({}) WHERE MOD(key,4) = 3 LIMIT {}'.format(query,int(eval_size))
            
        sink_location = os.path.join(out_dir, '{}-data'.format(step))

        (
            pipeline 
           | '{} - Read from BigQuery'.format(step) >> beam.io.Read(beam.io.BigQuerySource(query=source_query, use_standard_sql=True))
           | '{} - Process to CSV'.format(step) >> beam.Map(to_csv)
           | '{} - Write to GCS '.format(step) >> beam.io.Write(beam.io.WriteToText(sink_location,
                                                                file_name_suffix='.csv',
                                                                num_shards=5
                                                                                   ))
        )
        
    job = pipeline.run()

### 2.2. Run the Pipeline on Dataflow

In [None]:
job_name = 'preprocess-babyweight-data' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S')

options = {
    'region': 'europe-west1',
    'staging_location': os.path.join(out_dir, 'tmp', 'staging'),
    'temp_location': os.path.join(out_dir, 'tmp'),
    'job_name': job_name,
    'project': PROJECT
}

opts = beam.pipeline.PipelineOptions(flags=[], **options)
RUNNER = 'DataflowRunner'

print 'Launching Dataflow job {} ... hang on'.format(job_name)

run_pipeline(RUNNER, opts)


In [None]:
%%bash

gsutil ls gs://ksalama-gcs-cloudml/data/babyweight/big_data

## 3. Create TensorFlow Models using Estimator APIs

In [None]:
import tensorflow as tf
from tensorflow import data

print(tf.__version__)

## Experiment A: Train Linear Regression Model

1. Define dataset metadata + data input function

2. Create feature columns based on metadata

3. Instantiate the model with feature columns 

4. Train, evaluate, and predict using the model


### 1- Define Metadata &  Input Function

In [None]:
HEADER = 'weight_pounds,is_male,mother_age,mother_race,plurality,gestation_weeks,mother_married,cigarette_use,alcohol_use,key'.split(',')
TARGET_FEATURE_NAME = 'weight_pounds'
KEY_COLUMN = 'key'
DEFAULTS = [[0.0], ['null'], [0.0], ['null'], [0.0], [0.0], ['null'], ['null'], ['null'], ['nokey']]

In [None]:
def parse_csv_row(csv_row):
    
    columns = tf.decode_csv(tf.expand_dims(csv_row, -1), record_defaults=DEFAULTS)
    features = dict(zip(HEADER, columns))
    features.pop(KEY_COLUMN)
    target = features.pop(TARGET_FEATURE_NAME)
    return features, target

In [None]:
def csv_input_fn(file_name, mode=tf.estimator.ModeKeys.EVAL, 
                 skip_header_lines=0, 
                 num_epochs=1, 
                 batch_size=500):
    
    shuffle = True if mode == tf.estimator.ModeKeys.TRAIN else False
    
    file_names = tf.matching_files(file_name)

    dataset = data.TextLineDataset(filenames=file_names)
    dataset = dataset.skip(skip_header_lines)
    
    if shuffle:
        dataset = dataset.shuffle(buffer_size=2 * batch_size + 1)

    dataset = dataset.batch(batch_size)
    dataset = dataset.map(lambda csv_row: parse_csv_row(csv_row))
    dataset = dataset.repeat(num_epochs)
    iterator = dataset.make_one_shot_iterator()
    
    features, target = iterator.get_next()
    return features, target

### 2- Create Feature Columns

In [None]:
def create_feature_columns():

    is_male=tf.feature_column.categorical_column_with_vocabulary_list('is_male', ['True', 'False'])
    mother_age=tf.feature_column.numeric_column('mother_age')
    mother_race=tf.feature_column.categorical_column_with_vocabulary_list('mother_race', ['White', 'Black', 'American Indian', 'Chinese', 
               'Japanese', 'Hawaiian', 'Filipino', 'Unknown', 'Asian Indian', 'Korean', 'Samaon', 'Vietnamese'])
    plurality=tf.feature_column.numeric_column('plurality')
    gestation_weeks=tf.feature_column.numeric_column('gestation_weeks')
    mother_married=tf.feature_column.categorical_column_with_vocabulary_list('mother_married', ['True', 'False'])
    cigarette_use=tf.feature_column.categorical_column_with_vocabulary_list('cigarette_use', ['True', 'False', 'None'])
    alcohol_use=tf.feature_column.categorical_column_with_vocabulary_list('alcohol_use', ['True', 'False', 'None'])
    
    feature_columns = [is_male, mother_age, mother_race, plurality, gestation_weeks, mother_married, cigarette_use, alcohol_use]
    
    return feature_columns

### 3- Instantiate a Linear Regression Estimator

In [None]:
model_dir = os.path.join(local_models_dir,"lr_estimator")

feature_columns = create_feature_columns()

lr_estimator = tf.estimator.LinearRegressor(feature_columns=feature_columns,
                                            model_dir=model_dir)


### 4- Train, Evaluate, and Predict

In [None]:
%%bash

ls data/babyweight

##### a) Train the model

In [None]:
import shutil

train_data_files = "data/babyweight/train.csv"

train_input_fn = lambda: csv_input_fn(train_data_files, 
                                              mode=tf.estimator.ModeKeys.TRAIN, 
                                              num_epochs=10,
                                              batch_size = 200
                                         )

# remove the following line of code to resume training
shutil.rmtree(model_dir, ignore_errors=True)

lr_estimator.train(train_input_fn, max_steps=1000)

In [None]:
%%bash

ls models/babyweight/lr_estimator

##### b) Evaluate the trained model

In [None]:
eval_data_files = "data/babyweight/eval.csv"

eval_input_fn =lambda: csv_input_fn(eval_data_files)

lr_estimator.evaluate(eval_input_fn)

##### c) Predict using the trained model

In [None]:
import itertools

predictions = lr_estimator.predict(eval_input_fn)
values = list(map(lambda item: item["predictions"][0],list(itertools.islice(predictions, 5))))
print("")
print("Predicted Values: {}".format(values))

## Experiment B:  Wide and Deep DNN Model 
In this experiement, we are going to:
1. Preprocess the data using tf.transform
2. Build a DNNLinearCombinedRegressor

## B.1 Prepare the data using tf.transform

In [None]:
import apache_beam as beam

import tensorflow as tf
import tensorflow_transform as tft
import tensorflow_transform.coders as tft_coders

from tensorflow.contrib.learn.python.learn.utils import input_fn_utils

from tensorflow_transform.beam import impl
from tensorflow_transform.beam.tft_beam_io import transform_fn_io
from tensorflow_transform.tf_metadata import metadata_io
from tensorflow_transform.tf_metadata import dataset_schema
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.saved import saved_transform_io

### 1- Raw Data Metadata

In [None]:
RAW_FEATURE_NAMES = 'weight_pounds,is_male,mother_age,mother_race,plurality,gestation_weeks,mother_married,cigarette_use,alcohol_use,key'.split(',')
CATEGORICAL_FEATURE_NAMES = 'is_male,mother_race,mother_married,cigarette_use,alcohol_use'.split(',')
NUMERIC_FEATURE_NAMES = 'mother_age,plurality,gestation_weeks'.split(',')
TARGET_FEATURE_NAME = 'weight_pounds'
KEY_COLUMN = 'key'

def create_raw_metadata():  
    
    raw_data_schema = {}
    
    raw_data_schema[KEY_COLUMN]= dataset_schema.ColumnSchema(tf.float32, [], dataset_schema.FixedColumnRepresentation())
    
    raw_data_schema[TARGET_FEATURE_NAME]= dataset_schema.ColumnSchema(tf.float32, [], dataset_schema.FixedColumnRepresentation())
    
    raw_data_schema.update(
        { column_name : dataset_schema.ColumnSchema(tf.string, [], dataset_schema.FixedColumnRepresentation())
                   for column_name in CATEGORICAL_FEATURE_NAMES
        }
    )
    
    raw_data_schema.update(
        { column_name : dataset_schema.ColumnSchema(tf.float32, [], dataset_schema.FixedColumnRepresentation())
                   for column_name in NUMERIC_FEATURE_NAMES
        }
    )
    
    raw_metadata = dataset_metadata.DatasetMetadata(dataset_schema.Schema(raw_data_schema))
    
    return raw_metadata

print(create_raw_metadata().schema.as_feature_spec())
#print(create_raw_metadata().schema._column_schemas.keys())

### 2- Source Query

In [None]:
dataset_size = 1000#00
train_size = dataset_size * 0.7
eval_size = dataset_size * 0.3

def get_source_query(step):
    
    query = """
    SELECT
      ROUND(weight_pounds,1) AS weight_pounds,
      is_male,
      mother_age,
      mother_race,
      plurality,
      gestation_weeks,
      mother_married,
      cigarette_use,
      alcohol_use,
      FARM_FINGERPRINT( 
        CONCAT(
          COALESCE(CAST(weight_pounds AS STRING), 'NA'),
          COALESCE(CAST(is_male AS STRING),'NA'),
          COALESCE(CAST(mother_age AS STRING),'NA'),
          COALESCE(CAST(mother_race AS STRING),'NA'),
          COALESCE(CAST(plurality AS STRING), 'NA'),
          COALESCE(CAST(gestation_weeks AS STRING),'NA'),
          COALESCE(CAST(mother_married AS STRING), 'NA'),
          COALESCE(CAST(cigarette_use AS STRING),'NA'),
          COALESCE(CAST(alcohol_use AS STRING),'NA')
          )
        ) AS key
        FROM
          publicdata.samples.natality
        WHERE year > 2000
        AND weight_pounds > 0
        AND mother_age > 0
        AND plurality > 0
        AND gestation_weeks > 0
        AND month > 0
    """
    
    if step == 'train':
        source_query = 'SELECT * FROM ({}) WHERE MOD(key,4) < 3 LIMIT {}'.format(query,int(train_size))
    else:
        source_query = 'SELECT * FROM ({}) WHERE MOD(key,4) = 3 LIMIT {}'.format(query,int(eval_size))
    
    return source_query

### 3- Data Processing Functions

In [None]:
def cleanup(bq_row):
    
    RAW_FEATURE_NAMES = 'weight_pounds,is_male,mother_age,mother_race,plurality,gestation_weeks,mother_married,cigarette_use,alcohol_use,key'.split(',')
    
    # modify opaque numeric race code into human-readable data
    races = dict(zip([1,2,3,4,5,6,7,18,28,39,48],
                     ['White', 'Black', 'American Indian', 'Chinese', 
                      'Japanese', 'Hawaiian', 'Filipino',
                      'Asian Indian', 'Korean', 'Samaon', 'Vietnamese']))
    result = {} 
    
    for feature_name in RAW_FEATURE_NAMES:
        result[feature_name] = str(bq_row[feature_name])

    if 'mother_race' in bq_row and bq_row['mother_race'] in races:
        result['mother_race'] = races[bq_row['mother_race']]
    else:
        result['mother_race'] = 'Unknown'

    return result

def preprocess_tft(input_features):
    
    output_features = {}
    
    output_features['key'] = input_features['key']
    output_features['weight_pounds'] = input_features['weight_pounds']

    # normalisation
    output_features['mother_age_normalized'] = tft.scale_to_z_score(input_features['mother_age'])
    
    # bucktisation based on quantiles
    age_buckets = tft.quantiles(input_features['mother_age'], num_buckets=5, epsilon=0.01)
    output_features['mother_age_bucketized'] = tft.apply_buckets(input_features['mother_age'], age_buckets)
    
    # scaling between 0 and 1
    output_features['gestation_weeks_scaled'] =  tft.scale_to_0_1(input_features['gestation_weeks'])
    
    # you can compute new features based on custom formulas
    output_features['mother_age_log'] = tf.log(input_features['mother_age'])
    
    # or create flags/indicators
    output_features['is_multiple'] = tf.cast(input_features['plurality'] > [1], dtype=tf.int64)
    
    # extract vocab from categorical columns
    for feature_name in CATEGORICAL_FEATURE_NAMES:
        tft.uniques(input_features[feature_name], vocab_filename=feature_name)
        output_features[feature_name] = input_features[feature_name]
        
    return output_features

### 4- Transformation Beam Pipeline

In [None]:
TEMP_DIR = 'tmp'
TRANSFORM_ARTEFACTS_DIR = 'transform'
TRANSFORMED_DATA_DIR = 'transformed'

def run_transformation_pipeline(runner='DirectRunner', options=None):
    
    pipeline_options = beam.pipeline.PipelineOptions(flags=[], **options)
    
    sink_transformed_data_location = os.path.join(gcs_data_dir if runner=='DataflowRunner' else local_data_dir, 
                                                  TRANSFORMED_DATA_DIR)
    
    sink_transform_dir = os.path.join(gcs_model_dir if runner=='DataflowRunner' else local_models_dir,
                                      TRANSFORM_ARTEFACTS_DIR)
    
    temporary_dir = os.path.join(gcs_data_dir if runner=='DataflowRunner' else local_data_dir,
                                      TEMP_DIR)
    
    print("Sink data files prefix: {}".format(sink_transformed_data_location))
    print("Sink transformation directory: {}".format(sink_transform_dir))
    print("Temporary directory: {}".format(temporary_dir))
    
    opts = beam.pipeline.PipelineOptions(flags=[], **options)
    
    with beam.Pipeline(runner, options=opts) as pipeline:
        with impl.Context(temporary_dir):
            
            raw_metadata = create_raw_metadata()

            ###### analyze & transform train #########################################################
            if(runner=='DirectRunner'):
                print("")
                print("Transform training data....")
                print("")
            
            step = 'train'
            source_query = get_source_query(step)
            train_transformed_data_location = os.path.join(sink_transformed_data_location,'{}-data'.format(step))
            
            # Read raw train data from BQ and cleanup
            raw_train_data = (
              pipeline
              | '{} - Read Data from BigQuery'.format(step) >> beam.io.Read(beam.io.BigQuerySource(query=source_query, use_standard_sql=True))
              | '{} - Clean up Data'.format(step) >> beam.Map(cleanup)
            )
            
            # create a train dataset from the data and schema
            raw_train_dataset = (raw_train_data, raw_metadata)
            
            # analyze and transform raw_train_dataset to produced transformed_train_dataset and transform_fn
            transformed_train_dataset, transform_fn = (
                raw_train_dataset 
                | '{} - Analyze & Transform'.format(step) >> impl.AnalyzeAndTransformDataset(preprocess_tft)
            )
            
            # get data and schema separately from the transformed_train_dataset
            transformed_train_data, transformed_metadata = transformed_train_dataset

            # write transformed train data to sink
            _ = (
                transformed_train_data 
                | '{} - Write Transformed Data'.format(step) >> beam.io.tfrecordio.WriteToTFRecord(
                    file_path_prefix=train_transformed_data_location,
                    file_name_suffix=".tfrecords",
                    coder=tft_coders.example_proto_coder.ExampleProtoCoder(transformed_metadata.schema))
            )
            
            ###### transform eval ##################################################################
            
            if(runner=='DirectRunner'):
                print("")
                print("Transform eval data....")
                print("")
            
            step = 'eval'
            source_query = get_source_query(step)
            eval_transformed_data_location = os.path.join(sink_transformed_data_location,'{}-data-'.format(step))
            
            # Read raw eval data from BQ and cleanup
            raw_eval_data = (
              pipeline
              | '{} - Read Data from BigQuery'.format(step) >> beam.io.Read(beam.io.BigQuerySource(query=source_query, use_standard_sql=True))
              | '{} - Clean up Data'.format(step) >> beam.Map(cleanup)
            )
            
            # create a eval dataset from the data and schema
            raw_eval_dataset = (raw_eval_data, raw_metadata)
            
            # transform eval data based on produced transform_fn (from analyzing train_data)
            transformed_eval_dataset = (
                (raw_eval_dataset, transform_fn) 
                | '{} - Transform'.format(step) >> impl.TransformDataset()
            )
            
            # get data from the transformed_eval_dataset
            transformed_eval_data, _ = transformed_eval_dataset
            
            # write transformed eval data to sink
            _ = (
                transformed_eval_data 
                | '{} - Write Transformed Data'.format(step) >> beam.io.tfrecordio.WriteToTFRecord(
                    file_path_prefix=eval_transformed_data_location,
                    file_name_suffix=".tfrecords",
                    coder=tft_coders.example_proto_coder.ExampleProtoCoder(transformed_metadata.schema))
            )
        
            ###### write transformation metadata #######################################################
            if(runner=='DirectRunner'):
                print("")
                print("Saving transformation artefacts ....")
                print("")
            
            # write transform_fn as tf.graph
            _ = (
                transform_fn 
                | 'Write Transform Artefacts' >> transform_fn_io.WriteTransformFn(sink_transform_dir)
            )

    if runner=='DataflowRunner':
        pipeline.run()

### 5- Run Transformation Pipeline

In [None]:
%%writefile requirements.txt
tensorflow-transform==0.6.0

In [None]:
%%bash

gsutil -m rm -r gs://ksalama-gcs-cloudml/data/babyweight/transformed
gsutil -m rm -r gs://ksalama-gcs-cloudml/models/babyweight/transform

In [None]:
import shutil
from datetime import datetime

tf.logging.set_verbosity(tf.logging.WARN)

runner = 'DirectRunner' # DirectRunner | DataflowRunner

job_name = 'preprocess-babweight-data-tft-{}'.format(datetime.utcnow().strftime('%y%m%d-%H%M%S'))
print 'Launching {} job {} ... hang on'.format(runner, job_name)
print("")

options = {
    'region': 'europe-west1',
    'staging_location': os.path.join(gcs_data_dir, 'tmp', 'staging'),
    'temp_location': os.path.join(gcs_data_dir, 'tmp'),
    'job_name': job_name,
    'project': PROJECT,
    'worker_machine_type': 'n1-standard-1',
    #'num_workers': 1,
    'max_num_workers': 20,
    'requirements_file': 'requirements.txt'
}

if runner == 'DirectRunner':
    
    shutil.rmtree(os.path.join(local_models_dir,TRANSFORM_ARTEFACTS_DIR), ignore_errors=True)
    shutil.rmtree(os.path.join(local_data_dir,TRANSFORMED_DATA_DIR), ignore_errors=True)
    shutil.rmtree(os.path.join(local_data_dir,TEMP_DIR), ignore_errors=True)

run_transformation_pipeline(runner, options)
print("Done!")

In [None]:
%%bash

echo "***Transformed Data:"
ls data/babyweight/transformed
echo ""

echo "***Transform Artefacts:"
ls models/babyweight/transform
echo ""

echo "***Transform function:"
ls models/babyweight/transform/transform_fn
echo ""

echo "***Transform assets:"
ls models/babyweight/transform/transform_fn/assets

## B.2: Train a DNN Liner Combined Regression Model + Feature Engineering

1. Define dataset metadata + input function (to read and parse the data files, + **process features**) 

2. Create feature columns based on metadata + **Extended Feature Columns**

3. Initialise the Estimator + **Wide & Deep Columns for the combined DNN model**

4. Setup an experiment with TrainSpec, EvalSepc, Serving_fn, run_config, and params

5. Run **train_and_evaluate** experiment 

6. Use the SavedModel for predictions


In [None]:
import tensorflow as tf
from tensorflow import data

print(tf.__version__)

### 1- Define input function with process features

In [None]:
transformed_metadata = metadata_io.read_metadata(
    os.path.join(local_models_dir,TRANSFORM_ARTEFACTS_DIR,"transformed_metadata"))

transformed_feature_spec = transformed_metadata.schema.as_feature_spec()

print(transformed_feature_spec)

In [None]:
def parse_tf_example(example_proto):
    
    parsed_features = tf.parse_example(serialized=example_proto, features=transformed_feature_spec)
    parsed_features.pop(KEY_COLUMN)
    target = parsed_features.pop(TARGET_FEATURE_NAME)
    
    return parsed_features, target

In [None]:
# to be applied in traing and serving
# ideally, you put this logic in preprocess_tft, to avoid transforming the records during training several times

def process_features(features):
    return features

In [None]:
def tfrecords_input_fn(files_name_pattern, mode=tf.estimator.ModeKeys.EVAL,  
                 num_epochs=1, 
                 batch_size=500):
    
    shuffle = True if mode == tf.estimator.ModeKeys.TRAIN else False
    
    file_names = data.Dataset.list_files(files_name_pattern)

    dataset = data.TFRecordDataset(filenames=file_names)
    if shuffle:
        dataset = dataset.shuffle(buffer_size=2 * batch_size + 1)

    dataset = dataset.batch(batch_size)
    dataset = dataset.map(lambda tf_example: parse_tf_example(tf_example))
    dataset = dataset.map(lambda features, target: (process_features(features), target))
    dataset = dataset.repeat(num_epochs)
    iterator = dataset.make_one_shot_iterator()
    
    features, target = iterator.get_next()
    return features, target

### 2- Create Feature Columns with Extensions

In [None]:
def get_deep_and_wide_columns():

    assets_dir = os.path.join(local_models_dir, TRANSFORM_ARTEFACTS_DIR, 'transform_fn/assets')
    
    categorical_feature_columns = {feature_name: 
      tf.feature_column.categorical_column_with_vocabulary_file(feature_name, vocabulary_file=os.path.join(assets_dir,feature_name ))
      for feature_name in CATEGORICAL_FEATURE_NAMES}
    
    is_multiple = tf.feature_column.categorical_column_with_identity('is_multiple', num_buckets=2)
    gestation_weeks_scaled =  tf.feature_column.numeric_column('gestation_weeks_scaled')
    mother_age_log = tf.feature_column.numeric_column('mother_age_log')
    mother_age_normalized = tf.feature_column.numeric_column('mother_age_normalized')
    
    # extended feature columns
    cigarette_use_X_alcohol_use = tf.feature_column.crossed_column(
      [categorical_feature_columns['cigarette_use'], categorical_feature_columns['alcohol_use']], 9)
    
    #mother_age_bucketized = tf.feature_column.bucketized_column(mother_age, boundaries=[18, 22, 28, 32, 36, 40, 42, 45, 50])
    mother_age_bucketized = tf.feature_column.categorical_column_with_identity('mother_age_bucketized', num_buckets=5)
    
    mother_race_X_mother_age_bucketized = tf.feature_column.crossed_column( [mother_age_bucketized,categorical_feature_columns['mother_race']],  120)
    
    mother_race_X_mother_age_bucketized_embedded = tf.feature_column.embedding_column(mother_race_X_mother_age_bucketized, 5)
    
    # wide and deep columns
    wide_columns = categorical_feature_columns.values() + [is_multiple, cigarette_use_X_alcohol_use, mother_age_bucketized, mother_race_X_mother_age_bucketized] 
    deep_columns = [mother_age_log, gestation_weeks_scaled, mother_race_X_mother_age_bucketized_embedded]
    
    return wide_columns, deep_columns

# w,d = get_deep_and_wide_columns()
# print w

### 3- Create a DNN Regression Estimator

In [None]:
def metric_fn(labels, predictions):

    metrics = {}
    
    pred_values = predictions['predictions']
    
    metrics['rmse'] = tf.metrics.root_mean_squared_error(
      labels=labels,
      predictions=pred_values)
    
    metrics['mae'] = tf.metrics.mean_absolute_error(
      labels=labels,
      predictions=pred_values)
    
    
    return metrics


def create_DNNLinearCombinedRegressor(run_config, hparams):
  
    wide_columns, deep_columns = get_deep_and_wide_columns()

    dnn_optimizer = tf.train.AdamOptimizer(learning_rate=hparams.learning_rate)
    
    estimator = tf.estimator.DNNLinearCombinedRegressor(
                linear_feature_columns = wide_columns,
                dnn_feature_columns = deep_columns,
                dnn_optimizer=dnn_optimizer,
                dnn_hidden_units=hparams.hidden_units,
                config = run_config
                )
    
    
    estimator = tf.contrib.estimator.add_metrics(estimator, metric_fn)
    
    return estimator

### 4- Setup Local Experiment

##### a) RunConfig and Hyper-params

In [None]:
# Hyper-parameters
hparams  = tf.contrib.training.HParams(
    num_epochs=10,
    batch_size=500,
    hidden_units=[32, 16],
    max_steps=100,
    learning_rate=0.1,
    evaluate_after_sec=10
)

# RunConfig
model_dir = os.path.join(local_models_dir,"dnn_estimator")

run_config = tf.estimator.RunConfig(
    tf_random_seed=19830610,
    model_dir=model_dir
)

##### b) Serving Function

In [None]:
def generate_serving_input_fn():
    
    def _serving_fn():
        
        # get the feature_spec of raw data
        raw_metadata = create_raw_metadata()
        raw_placeholder_spec = raw_metadata.schema.as_batched_placeholders()
        raw_placeholder_spec.pop(TARGET_FEATURE_NAME)
    
        raw_input_fn = tf.estimator.export.build_raw_serving_input_receiver_fn(raw_placeholder_spec)
        raw_features, recevier_tensors, _ = raw_input_fn()
        
        # apply tranform_fn on raw features
        _, transformed_features = (
            saved_transform_io.partially_apply_saved_transform(
                os.path.join(local_models_dir,TRANSFORM_ARTEFACTS_DIR,transform_fn_io.TRANSFORM_FN_DIR),
            raw_features)
        )
        
        # apply the process_features function to transformed features
        transformed_features = process_features(transformed_features)
        
        return tf.estimator.export.ServingInputReceiver(
            transformed_features, raw_features)
    
    return _serving_fn

##### c) TrainSpec and EvalSpec

In [None]:
train_data_files = os.path.join(local_data_dir,TRANSFORMED_DATA_DIR)+"/train-*.tfrecords"
eval_data_files = os.path.join(local_data_dir,TRANSFORMED_DATA_DIR)+"/eval-*.tfrecords"

# TrainSpec
train_spec = tf.estimator.TrainSpec(
  input_fn = lambda: tfrecords_input_fn(
    train_data_files,
    mode=tf.estimator.ModeKeys.TRAIN,
    num_epochs= hparams.num_epochs,
    batch_size = hparams.batch_size
  ),
  max_steps=hparams.max_steps,
)

# EvalSpec
eval_spec = tf.estimator.EvalSpec(
  input_fn =lambda: tfrecords_input_fn(eval_data_files),
  exporters=[tf.estimator.LatestExporter(
      name="estimate",  # the name of the folder in which the model will be exported to under export
      serving_input_receiver_fn=generate_serving_input_fn(),
      exports_to_keep=1,
      as_text=True)],
  steps = None,
  throttle_secs = hparams.evaluate_after_sec # evalute after each 10 training seconds!
)

### >> TensorBoard - Start

In [None]:
from google.datalab.ml import TensorBoard
TensorBoard().start(model_dir)
TensorBoard().list()

### 5- Run train_and_evaluate

In [None]:
import shutil
from datetime import datetime

# remove the following line of code to resume training
shutil.rmtree(model_dir, ignore_errors=True)

dnn_estimator = create_DNNLinearCombinedRegressor(run_config, hparams)

tf.logging.set_verbosity(tf.logging.INFO)

time_start = datetime.utcnow() 
print("")
print("Experiment started at {}".format(time_start.strftime("%H:%M:%S")))
print(".......................................") 

# run train and evaluate experiment
tf.estimator.train_and_evaluate(
  dnn_estimator,
  train_spec,
  eval_spec
)


time_end = datetime.utcnow() 
print(".......................................")
print("Experiment finished at {}".format(time_end.strftime("%H:%M:%S")))
print("")
time_elapsed = time_end - time_start
print("Experiment elapsed time: {} seconds".format(time_elapsed.total_seconds()))
    


In [None]:
%%bash

ls models/babyweight/dnn_estimator/export/estimate


### >> TensorBoard - Stop

In [None]:
#to stop TensorBoard
TensorBoard().stop(23002)
print('stopped TensorBoard')
TensorBoard().list()

### 6- Use SavedModel for Predictions

In [None]:
saved_model_base_dir=os.path.join(model_dir,'export/estimate')
SAVED_MODEL_DIR=os.path.join(saved_model_base_dir, os.listdir(saved_model_base_dir)[0])

def estimate_local(instance):
 
    predictor_fn = tf.contrib.predictor.from_saved_model(
        export_dir=SAVED_MODEL_DIR,
        signature_def_key="predict"
    )
    
    instance = dict((k, [v]) for k, v in instance.items())
    value = predictor_fn(instance)['predictions'][0][0]
    return value

instance = {
        'is_male': 'True',
        'mother_age': 26.0,
        'mother_race': 'Asian Indian',
        'plurality': 1.0,
        'gestation_weeks': 39,
        'mother_married': 'True',
        'cigarette_use': 'False',
        'alcohol_use': 'False'
}

prediction = estimate_local(instance)
print(prediction)

## 5. Evaluate the model using TFMA

In [None]:
import tensorflow_model_analysis as tfma

### 5.1 Evaluate input function

In [None]:
def generate_eval_receiver_fn(transform_artefacts_dir):
    
    transformed_metadata = metadata_io.read_metadata(transform_artefacts_dir+"/transformed_metadata")
    transformed_feature_spec = transformed_metadata.schema.as_feature_spec()
    
    def _eval_receiver_fn():
        
        serialized_tf_example = tf.placeholder(
            dtype=tf.string, shape=[None], name='input_example_placeholder')

        receiver_tensors = {'examples': serialized_tf_example}
        transformed_features = tf.parse_example(serialized_tf_example, transformed_feature_spec)

        return tfma.export.EvalInputReceiver(
            features=transformed_features,
            receiver_tensors=receiver_tensors,
            labels=transformed_features[TARGET_FEATURE_NAME])

    return _eval_receiver_fn

### 5.2 Export Evaluation Saved Model

In [None]:
eval_model_dir = model_dir +"/export/evaluate"

transform_artefacts_dir = os.path.join(local_models_dir,TRANSFORM_ARTEFACTS_DIR)

tfma.export.export_eval_savedmodel(
        estimator=dnn_estimator,
        export_dir_base=eval_model_dir,
        eval_input_receiver_fn=generate_eval_receiver_fn(transform_artefacts_dir)
)

### 5.3 Produce Evaluation Results using the Saved Model

In [None]:
slice_spec = [tfma.SingleSliceSpec()]
for feature_name, feature_spec in transformed_feature_spec.items():
    if feature_name not in [KEY_COLUMN] + [TARGET_FEATURE_NAME] and feature_spec.dtype == tf.string:
        slice_spec += [tfma.SingleSliceSpec(columns=[feature_name])]

# print slice_spec
# print ""

saved_model_base_dir=os.path.join(model_dir,'export/evaluate')
model_location=os.path.join(saved_model_base_dir, os.listdir(saved_model_base_dir)[0])
data_location = os.path.join(local_data_dir, TRANSFORMED_DATA_DIR)+"/eval-*.tfrecords"

tf.logging.set_verbosity(tf.logging.ERROR)

eval_results = tfma.run_model_analysis(
    model_location=model_location , 
    data_location=data_location, 
    file_format='tfrecords', 
    slice_spec=slice_spec, 
#     output_path=None
)

print "Evaluation results are ready!"

### 5.4 Visualise the Results

In [None]:
print eval_results.slicing_metrics

In [None]:
tfma.view.render_slicing_metrics(
        eval_results, 
    slicing_column='mother_age_bucketized'
)

## 6. Train the Model on Cloud ML Engine

In [None]:
%%bash

echo "Submitting a Cloud ML Engine job..."

REGION=europe-west1
TIER=BASIC # BASIC | BASIC_GPU | STANDARD_1 | PREMIUM_1
BUCKET=ksalama-gcs-cloudml

MODEL_NAME="babyweight_estimator"

PACKAGE_PATH=packages/babyweight-tf1.4/trainer
TRAIN_FILES=gs://${BUCKET}/data/babyweight/train-data.csv
VALID_FILES=gs://${BUCKET}/data/babyweight/eval-data.csv
MODEL_DIR=gs://${BUCKET}/models/babyweight/${MODEL_NAME}

#remove model directory, if you don't want to resume training, or if you have changed the model structure
#gsutil -m rm -r ${MODEL_DIR}

CURRENT_DATE=`date +%Y%m%d_%H%M%S`
JOB_NAME=train_${MODEL_NAME}_${TIER}_${CURRENT_DATE}

gcloud ml-engine jobs submit training ${JOB_NAME} \
        --job-dir=${MODEL_DIR} \
        --runtime-version=1.4 \
        --region=${REGION} \
        --scale-tier=${TIER} \
        --module-name=trainer.task \
        --package-path=${PACKAGE_PATH} \
        -- \
        --train-files=${TRAIN_FILES} \
        --num-epochs=100 \
        --train-batch-size=500 \
        --eval-files=${VALID_FILES} \
        --eval-batch-size=500 \
        --learning-rate=0.01 \
        --hidden-units="64,0,0" \
        --layer-sizes-scale-factor=0.5 \
        --num-layers=3 \
        --job-dir=${MODEL_DIR}

### Train the Model on Cloud ML Engine + GPUs

In [None]:
%%bash

echo "Submitting a Cloud ML Engine job..."

REGION=europe-west1
TIER=BASIC_GPU # BASIC | BASIC_GPU | STANDARD_1 | PREMIUM_1
BUCKET=ksalama-gcs-cloudml

MODEL_NAME="babyweight_estimator"

PACKAGE_PATH=packages/babyweight-tf1.4/trainer
TRAIN_FILES=gs://${BUCKET}/data/babyweight/train-*.csv
VALID_FILES=gs://${BUCKET}/data/babyweight/eval-*.csv
MODEL_DIR=gs://${BUCKET}/models/babyweight/${MODEL_NAME}_${TIER}

CURRENT_DATE=`date +%Y%m%d_%H%M%S`
JOB_NAME=train_${MODEL_NAME}_${TIER}_${CURRENT_DATE}

gcloud ml-engine jobs submit training ${JOB_NAME} \
        --job-dir=${MODEL_DIR} \
        --runtime-version=1.4 \
        --region=${REGION} \
        --scale-tier=${TIER} \
        --module-name=trainer.task \
        --package-path=${PACKAGE_PATH} \
        -- \
        --train-files=${TRAIN_FILES} \
        --num-epochs=10 \
        --train-batch-size=1000 \
        --eval-files=${VALID_FILES} \
        --eval-batch-size=1000 \
        --learning-rate=0.01 \
        --hidden-units="64,0,0" \
        --layer-sizes-scale-factor=0.5 \
        --num-layers=3 \
        --job-dir=${MODEL_DIR}

### Train the Model on Cloud ML Engine + Custom GPUs Cluster

In [None]:
%%bash

echo "Submitting a Cloud ML Engine job..."

REGION=europe-west1
TIER=CUSTOM # BASIC | BASIC_GPU | STANDARD_1 | PREMIUM_1
BUCKET=ksalama-gcs-cloudml

MODEL_NAME="babyweight_estimator"

PACKAGE_PATH=packages/babyweight-tf1.4/trainer
TRAIN_FILES=gs://${BUCKET}/data/babyweight/big_data/train-*.csv
VALID_FILES=gs://${BUCKET}/data/babyweight/big_data/eval-*.csv
MODEL_DIR=gs://${BUCKET}/models/babyweight/${MODEL_NAME}_${TIER}

CURRENT_DATE=`date +%Y%m%d_%H%M%S`
JOB_NAME=train_${MODEL_NAME}_${TIER}_${CURRENT_DATE}

gcloud ml-engine jobs submit training ${JOB_NAME} \
        --job-dir=${MODEL_DIR} \
        --runtime-version=1.4 \
        --region=${REGION} \
        --module-name=trainer.task \
        --package-path=${PACKAGE_PATH} \
        --config=ml-packages/babyweight-tf1.4/custom.yaml \
        -- \
        --train-files=${TRAIN_FILES} \
        --num-epochs=100 \
        --train-batch-size=1000 \
        --eval-files=${VALID_FILES} \
        --eval-batch-size=1000 \
        --learning-rate=0.001 \
        --hidden-units="64,0,0" \
        --layer-sizes-scale-factor=0.5 \
        --num-layers=3 \
        --job-dir=${MODEL_DIR}

### Hyper-parameters Tuning on Cloud ML Engine

In [None]:
%%bash

echo "Submitting a Cloud ML Engine job..."

REGION=europe-west1
BUCKET=ksalama-gcs-cloudml

MODEL_NAME="babyweight_estimator"

PACKAGE_PATH=packages/babyweight-tf1.4/trainer
TRAIN_FILES=gs://${BUCKET}/data/babyweight/big_data/train-*.csv
VALID_FILES=gs://${BUCKET}/data/babyweight/big_data/eval-*.csv
MODEL_DIR=gs://${BUCKET}/models/babyweight/${MODEL_NAME}_tune

CURRENT_DATE=`date +%Y%m%d_%H%M%S`
JOB_NAME=tune_${MODEL_NAME}_${TIER}_${CURRENT_DATE}

gcloud ml-engine jobs submit training ${JOB_NAME} \
        --job-dir=${MODEL_DIR} \
        --runtime-version=1.4 \
        --region=${REGION} \
        --module-name=trainer.task \
        --package-path=${PACKAGE_PATH} \
        --config=ml-packages/babyweight-tf1.4/hyperparams.yaml \
        -- \
        --train-files=${TRAIN_FILES} \
        --num-epochs=100 \
        --train-batch-size=1000 \
        --eval-files=${VALID_FILES} \
        --eval-batch-size=1000 \
        --job-dir=${MODEL_DIR}

## 7. Deploy the Model on Cloud ML Engine

In [None]:
%%bash

REGION=europe-west1
BUCKET=ksalama-gcs-cloudml

MODEL_NAME="babyweight_estimator"
MODEL_VERSION="v1"

MODEL_BINARIES=$(gsutil ls gs://${BUCKET}/models/babyweight/${MODEL_NAME}/export/estimate | tail -1)

gsutil ls ${MODEL_BINARIES}

# #delete model version
# gcloud ml-engine versions delete ${MODEL_VERSION} --model=${MODEL_NAME}

# #delete model
# gcloud ml-engine models delete ${MODEL_NAME}

# #deploy model to GCP
# gcloud ml-engine models create ${MODEL_NAME} --regions=${REGION}

# #deploy model version
# gcloud ml-engine versions create ${MODEL_VERSION} --model=${MODEL_NAME} --origin=${MODEL_BINARIES} --runtime-version=1.4

# echo  ${MODEL_NAME} ${MODEL_VERSION} 
# #invoke deployed model to make prediction given new data instances
# gcloud ml-engine predict --model=${MODEL_NAME} --version=${MODEL_VERSION} --json-instances=data/babyweight/new-data.json

## 8. Consume the Depoyed Model as API

In [None]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials

def estimate(project, model_name, version, instances):

    credentials = GoogleCredentials.get_application_default()
    api = discovery.build('ml', 'v1', credentials=credentials,
                discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json')

    request_data = {'instances': instances}

    model_url = 'projects/{}/models/{}/versions/{}'.format(project, model_name, version)
    response = api.projects().predict(body=request_data, name=model_url).execute()

    estimates = list(map(lambda item: round(item["scores"],2)
        ,response["predictions"]
    ))

    return estimates

In [None]:
PROJECT='ksalama-gcp-playground'
MODEL_NAME='babyweight_estimator'
VERSION='v1'

instances = [
      {
        'is_male': 'True',
        'mother_age': 26.0,
        'mother_race': 'Asian Indian',
        'plurality': 1.0,
        'gestation_weeks': 39,
        'mother_married': 'True',
        'cigarette_use': 'False',
        'alcohol_use': 'False'
      },
      {
        'is_male': 'False',
        'mother_age': 29.0,
        'mother_race': 'Asian Indian',
        'plurality': 1.0,
        'gestation_weeks': 38,
        'mother_married': 'True',
        'cigarette_use': 'False',
        'alcohol_use': 'False'
      },
      {
        'is_male': 'True',
        'mother_age': 26.0,
        'mother_race': 'White',
        'plurality': 1.0,
        'gestation_weeks': 39,
        'mother_married': 'True',
        'cigarette_use': 'False',
        'alcohol_use': 'False'
      },
      {
        'is_male': 'True',
        'mother_age': 26.0,
        'mother_race': 'White',
        'plurality': 2.0,
        'gestation_weeks': 37,
        'mother_married': 'True',
        'cigarette_use': 'False',
        'alcohol_use': 'True'
      }
  ]

estimates = estimate(instances=instances
                     ,project=PROJECT
                     ,model_name=MODEL_NAME
                     ,version=VERSION)

print(estimates)

# The End :-)