<h1> 2d. Distributed training and monitoring </h1>

In this notebook, we refactor to call ```train_and_evaluate``` instead of hand-coding our ML pipeline. This allows us to carry out evaluation as part of our training loop instead of as a separate step. It also adds in failure-handling that is necessary for distributed training capabilities.

We also use TensorBoard to monitor the training.

In [11]:
import tensorflow as tf
import numpy as np
import shutil
import tensorflow as tf
print(tf.__version__)

1.8.0


<h2> Input </h2>

Read data created in Lab1a, but this time make it more general, so that we are reading in batches.  Instead of using Pandas, we will use Datasets.

In [12]:
CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']
LABEL_COLUMN = 'fare_amount'
DEFAULTS = [[0.0], [-74.0], [40.0], [-74.0], [40.7], [1.0], ['nokey']]

def read_dataset(filename, mode, batch_size = 512):
  def _input_fn():
    def decode_csv(value_column):
      columns = tf.decode_csv(value_column, record_defaults = DEFAULTS)
      features = dict(zip(CSV_COLUMNS, columns))
      label = features.pop(LABEL_COLUMN)
      return features, label
    
    # Create list of files that match pattern
    file_list = tf.gfile.Glob(filename)

    # Create dataset from file list
    dataset = tf.data.TextLineDataset(file_list).map(decode_csv)
    
    if mode == tf.estimator.ModeKeys.TRAIN:
        num_epochs = None # indefinitely
        dataset = dataset.shuffle(buffer_size = 10 * batch_size)
    else:
        num_epochs = 1 # end-of-input after this
 
    dataset = dataset.repeat(num_epochs).batch(batch_size)
    return dataset.make_one_shot_iterator().get_next()
  return _input_fn

<h2> Create features out of input data </h2>

For now, pass these through.  (same as previous lab)

In [13]:
INPUT_COLUMNS = [
    tf.feature_column.numeric_column('pickuplon'),
    tf.feature_column.numeric_column('pickuplat'),
    tf.feature_column.numeric_column('dropofflat'),
    tf.feature_column.numeric_column('dropofflon'),
    tf.feature_column.numeric_column('passengers'),
]

def add_more_features(feats):
  # Nothing to add (yet!)
  return feats

feature_cols = add_more_features(INPUT_COLUMNS)

<h2> train_and_evaluate </h2>

In [14]:
def serving_input_fn():
  feature_placeholders = {
    'pickuplon' : tf.placeholder(tf.float32, [None]),
    'pickuplat' : tf.placeholder(tf.float32, [None]),
    'dropofflat' : tf.placeholder(tf.float32, [None]),
    'dropofflon' : tf.placeholder(tf.float32, [None]),
    'passengers' : tf.placeholder(tf.float32, [None]),
  }
  features = {
      key: tf.expand_dims(tensor, -1)
      for key, tensor in feature_placeholders.items()
  }
  return tf.estimator.export.ServingInputReceiver(features, feature_placeholders)

In [15]:
def train_and_evaluate(output_dir, num_train_steps):
  estimator = tf.estimator.LinearRegressor(
                       model_dir = output_dir,
                       feature_columns = feature_cols)
  train_spec=tf.estimator.TrainSpec(
                       input_fn = read_dataset('./taxi-train.csv', mode = tf.estimator.ModeKeys.TRAIN),
                       max_steps = num_train_steps)
  exporter = tf.estimator.LatestExporter('exporter', serving_input_fn)
  eval_spec=tf.estimator.EvalSpec(
                       input_fn = read_dataset('./taxi-valid.csv', mode = tf.estimator.ModeKeys.EVAL),
                       steps = None,
                       start_delay_secs = 1, # start evaluating after N seconds
                       throttle_secs = 10,  # evaluate every N seconds
                       exporters = exporter)
  tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

In [16]:
# Run training    
OUTDIR = 'taxi_trained'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
train_and_evaluate(OUTDIR, num_train_steps = 5000)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_master': '', '_keep_checkpoint_every_n_hours': 10000, '_global_id_in_cluster': 0, '_is_chief': True, '_task_type': 'worker', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f8488b5cb38>, '_num_worker_replicas': 1, '_tf_random_seed': None, '_save_summary_steps': 100, '_session_config': None, '_evaluation_master': '', '_model_dir': 'taxi_trained', '_task_id': 0, '_service': None, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_save_checkpoints_steps': None, '_log_step_count_steps': 100, '_save_checkpoints_secs': 600, '_train_distribute': None}
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after 10 secs (eval_spec.throttle_secs) or training is finished.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow

INFO:tensorflow:'regression' : Regression input must be a single string Tensor; got {'dropofflon': <tf.Tensor 'Placeholder_3:0' shape=(?,) dtype=float32>, 'pickuplat': <tf.Tensor 'Placeholder_1:0' shape=(?,) dtype=float32>, 'passengers': <tf.Tensor 'Placeholder_4:0' shape=(?,) dtype=float32>, 'dropofflat': <tf.Tensor 'Placeholder_2:0' shape=(?,) dtype=float32>, 'pickuplon': <tf.Tensor 'Placeholder:0' shape=(?,) dtype=float32>}
INFO:tensorflow:'serving_default' : Regression input must be a single string Tensor; got {'dropofflon': <tf.Tensor 'Placeholder_3:0' shape=(?,) dtype=float32>, 'pickuplat': <tf.Tensor 'Placeholder_1:0' shape=(?,) dtype=float32>, 'passengers': <tf.Tensor 'Placeholder_4:0' shape=(?,) dtype=float32>, 'dropofflat': <tf.Tensor 'Placeholder_2:0' shape=(?,) dtype=float32>, 'pickuplon': <tf.Tensor 'Placeholder:0' shape=(?,) dtype=float32>}
INFO:tensorflow:Restoring parameters from taxi_trained/model.ckpt-1120
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No asse

INFO:tensorflow:Saving dict for global step 2214: average_loss = 94.46533, global_step = 2214, loss = 37101.258
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Signatures INCLUDED in export for Regress: None
INFO:tensorflow:Signatures INCLUDED in export for Classify: None
INFO:tensorflow:Signatures INCLUDED in export for Predict: ['predict']
INFO:tensorflow:Signatures EXCLUDED from export because they cannot be be served via TensorFlow Serving APIs:
INFO:tensorflow:'regression' : Regression input must be a single string Tensor; got {'dropofflon': <tf.Tensor 'Placeholder_3:0' shape=(?,) dtype=float32>, 'pickuplat': <tf.Tensor 'Placeholder_1:0' shape=(?,) dtype=float32>, 'passengers': <tf.Tensor 'Placeholder_4:0' shape=(?,) dtype=float32>, 'dropofflat': <tf.Tensor 'Placeholder_2:0' shape=(?,) dtype=float32>, 'pickuplon': <tf.Tensor 'Placeholder:0' shape=(?,) dtype=float32>}
INFO:tensorflow:'serving_default' : Regression input must be a single stri

INFO:tensorflow:Loss for final step: 41612.195.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-07-17-20:22:27
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from taxi_trained/model.ckpt-3333
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-07-17-20:22:27
INFO:tensorflow:Saving dict for global step 3333: average_loss = 94.69978, global_step = 3333, loss = 37193.34
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Signatures INCLUDED in export for Regress: None
INFO:tensorflow:Signatures INCLUDED in export for Classify: None
INFO:tensorflow:Signatures INCLUDED in export for Predict: ['predict']
INFO:tensorflow:Signatures EXCLUDED from export because they cannot be be served via TensorFlow Serving APIs:
INFO:tensorflow:'regression' : Regression input must be a single string Tens

INFO:tensorflow:Saving checkpoints for 4067 into taxi_trained/model.ckpt.
INFO:tensorflow:loss = 33137.59, step = 4067
INFO:tensorflow:global_step/sec: 39.7572
INFO:tensorflow:loss = 47440.008, step = 4167 (2.524 sec)
INFO:tensorflow:global_step/sec: 41.7199
INFO:tensorflow:loss = 50062.902, step = 4267 (2.395 sec)
INFO:tensorflow:global_step/sec: 40.599
INFO:tensorflow:loss = 45026.508, step = 4367 (2.463 sec)
INFO:tensorflow:Saving checkpoints for 4439 into taxi_trained/model.ckpt.
INFO:tensorflow:Loss for final step: 53195.02.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-07-17-20:23:01
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from taxi_trained/model.ckpt-4439
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-07-17-20:23:01
INFO:tensorflow:Saving dict for global step 4439: average_loss = 94.457214, global

<h2> Monitoring with TensorBoard </h2>

In [17]:
from google.datalab.ml import TensorBoard
TensorBoard().start('./taxi_trained')
TensorBoard().list()

Unnamed: 0,logdir,pid,port
0,./taxi_trained,7826,50361
1,./taxi_trained,11780,41303


In [18]:
# to stop TensorBoard
for pid in TensorBoard.list()['pid']:
    TensorBoard().stop(pid)
    print('Stopped TensorBoard with pid {}'.format(pid))

Stopped TensorBoard with pid 7826
Stopped TensorBoard with pid 11780


Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License