## Training

Training requires a tarball python package that includes your training program based on TensorFlow. While CloudML provides several generic purpose model training, for this sample we will use a package that is used to train Iris sample.



### Training Package

You can use existing tarball package (locally or in GCS), or use your own tarball package. You can define a python module use "%%ml module". In the following two cells, we will define two python modules: "iris" and "task".

In [5]:
%%ml module --name iris

import google.cloud.ml.features as features
import json
import math
import tensorflow as tf

def read_examples(input_files, batch_size, shuffle, num_epochs=None):
  # The minimum number of instances in a queue from which examples are drawn
  # randomly. The larger this number, the more randomness at the expense of
  # higher memory requirements.
  MIN_AFTER_DEQUEUE = 100
  # When batching data, the queue's capacity will be larger than the batch_size
  # by some factor. The recommended formula is (num_threads + a small safety
  # margin). For now, we use a single thread for reading, so this can be small.
  QUEUE_SIZE_MULTIPLIER = 3

  # Convert num_epochs == 0 -> num_epochs is None, if necessary
  num_epochs = num_epochs or None

  # Build a queue of the filenames to be read.
  filename_queue = tf.train.string_input_producer(input_files, num_epochs,
                                                  shuffle)
  options = tf.python_io.TFRecordOptions(
      compression_type=tf.python_io.TFRecordCompressionType.ZLIB)
  example_id, encoded_example = tf.TFRecordReader(options=options).read(
      filename_queue)

  if shuffle:
    capacity = MIN_AFTER_DEQUEUE + QUEUE_SIZE_MULTIPLIER * batch_size
    return tf.train.shuffle_batch([example_id, encoded_example], batch_size,
                                  capacity, MIN_AFTER_DEQUEUE)
  else:
    capacity = QUEUE_SIZE_MULTIPLIER * batch_size
    return tf.train.batch([example_id, encoded_example],
                          batch_size,
                          capacity=capacity)

def create_inputs(metadata, input_data=None):
  with tf.name_scope('inputs'):
    if input_data is None:
      input_data = tf.placeholder(tf.string, name='input', shape=(None,))
    parsed = features.FeatureMetadata.parse_features(metadata, input_data)
    return (input_data, parsed['measurements'], tf.squeeze(parsed['species']),
            tf.identity(parsed['key']))

def inference(measurements, metadata, hidden_layer_size):
  input_size = metadata.features['measurements']['size']
  output_size = hidden_layer_size

  with tf.name_scope('hidden_layer'):
    initial_weights = tf.truncated_normal([ input_size, output_size ],
                                          stddev = 1.0 / math.sqrt(input_size))
    weights = tf.Variable(initial_weights, name = 'weights')

    initial_biases = tf.zeros([ output_size ])
    biases = tf.Variable(initial_biases, name = 'biases')

    xw = tf.matmul(measurements, weights)

    xwb = tf.nn.bias_add(xw, biases)
    hidden_layer = tf.nn.relu(xwb)

  input_size = hidden_layer_size
  output_size = metadata.labels

  with tf.name_scope('logits_layer'):
    initial_weights = tf.truncated_normal([ input_size, output_size ],
                                          stddev = 1.0 / math.sqrt(input_size))
    weights = tf.Variable(initial_weights, name = 'weights')

    initial_biases = tf.zeros([ output_size ])
    biases = tf.Variable(initial_biases, name = 'biases')

    return tf.nn.bias_add(tf.matmul(hidden_layer, weights), biases, name='logits')

def loss(logits, labels):
  labels = tf.to_int64(labels)
  cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
      logits, labels, name='xentropy')
  loss = tf.reduce_mean(cross_entropy, name='loss')
  return loss

def create_outputs(logits):
  with tf.name_scope('outputs'):
    scores = tf.nn.softmax(logits, name = 'scores')
    predictions = tf.arg_max(logits, 1, name = 'prediction')
    return scores, predictions

def training(loss, learning_rate):
  with tf.name_scope('train'):
    tf.scalar_summary(loss.op.name, loss)
    global_step = tf.Variable(0, name='global_step', trainable=False)
    optimizer = tf.train.AdagradOptimizer(learning_rate)
    train_op = optimizer.minimize(loss, global_step)
    return train_op, global_step

def evaluation(logits, labels):
  # For a classifier model, we can use the in_top_k Op.
  # It returns a bool tensor with shape [batch_size] that is true for
  # the examples where the label is in the top k (here k=1)
  # of all logits for that example.
  correct = tf.nn.in_top_k(logits, labels, 1)
  # Return the number of true entries.
  return tf.reduce_sum(tf.cast(correct, tf.int32))

Define "task" module. "--main" indicates this is the entry point of the program.

In [2]:
%%ml module --name task --main

import argparse
import itertools
import json
import logging
import os
import random
import sys
import subprocess
import tempfile
import time

import iris
import tensorflow as tf
from tensorflow.contrib.metrics.python.ops import metric_ops

import google.cloud.ml.features as features
import google.cloud.ml.util as cloudml_util

flags = tf.app.flags
FLAGS = flags.FLAGS


def main():
  config = json.loads(os.environ.get('TF_CONFIG', '{}'))
  cluster = config.get('cluster', None)
  task = config.get('task', None)
  job = config.get('job', None)
  trial_id = task.get('trial', '')
  
  parser = argparse.ArgumentParser()
  parser.add_argument("--train_data_paths", type=str, action='append')
  parser.add_argument("--eval_data_paths", type=str, action='append')
  parser.add_argument("--metadata_path", type=str)
  parser.add_argument("--output_path", type=str)
  parser.add_argument("--learning_rate", type=float, default=0.01)
  parser.add_argument("--hidden", type=int, default=10)
  parser.add_argument("--max_steps", type=int, default=2000)
  args = parser.parse_args()
  dispatch(args, cluster, task, job, trial_id)

EXPORT_SUBDIRECTORY = 'model'
BATCH_SIZE = 16
EVAL_SET_SIZE = 30
EVAL_INTERVAL_SECS = 10

def write_label_file(dir, args, trial_id):
  cloudml_util._file.create_directory(dir)
  label_file = os.path.join(dir, 'label')
  data = 'trial:' + trial_id + '/' if trial_id else ''
  data += 'learning_rate:%0.5f/hidden:%d' % (args.learning_rate, args.hidden)
  with tempfile.NamedTemporaryFile() as temp:
    temp.write('data')
    temp.flush()
    subprocess.check_call(['gsutil', 'cp', temp.name, label_file])


def print_to_console(msg):
  print msg
  sys.stdout.flush()


def start_server(cluster, task):
  # Create and start a server.
  return tf.train.Server(cluster,
                         protocol="grpc",
                         job_name=task['type'],
                         task_index=task['index'])


def dispatch(args, cluster, task, job, trial_id):
  if not cluster:
    # Run locally.
    run_training(args, target="", is_chief=True, device_fn="", trial_id=trial_id)
    return

  if task['type'] == "ps":
    server = start_server(cluster, task)
    server.join()
  elif task['type'] == "worker":
    server = start_server(cluster, task)
    is_chief = False
    device_fn = tf.train.replica_device_setter(
        ps_device="/job:ps",
        worker_device="/job:worker/task:%d" % task['index'],
        cluster=cluster)
    run_training(args, server.target, is_chief, device_fn, task['index'], trial_id)
  elif task['type'] == "master":
    server = start_server(cluster, task)
    is_chief = (task['index'] == 0)
    device_fn = tf.train.replica_device_setter(
        ps_device="/job:ps",
        worker_device="/job:master/task:%d" % task['index'],
        cluster=cluster)
    run_training(args, server.target, is_chief, device_fn, task['index'], trial_id)
  else:
    raise ValueError("invalid job_type %s" % task['type'])


def run_training(args, target, is_chief, device_fn, task_index, trial_id):

  output_path = os.path.join(args.output_path, trial_id)
  # Get the sets of examples and labels for training, validation, and
  # test on Iris.
  training_data = args.train_data_paths

  if is_chief:
    # A generator over accuracies. Each call to next(accuracies) forces an
    # evaluation of the model.
    accuracies = evaluate(args, trial_id)

  # Tell TensorFlow that the model will be built into the default Graph.
  with tf.Graph().as_default() as graph:
    # Assigns ops to the local worker by default.
    with tf.device(device_fn):

      metadata = features.FeatureMetadata.get_metadata(args.metadata_path)

      _, train_examples = iris.read_examples(
          training_data, BATCH_SIZE, shuffle=False)

      # Generate placeholders for the examples.
      placeholder, measurements, labels, _ = (
          iris.create_inputs(metadata, input_data=train_examples))

      # Build a Graph that computes predictions from the inference model.
      logits = iris.inference(measurements, metadata, args.hidden)

      # Add to the Graph the Ops for loss calculation.
      loss = iris.loss(logits, labels)

      # Add to the Graph the Ops that calculate and apply gradients.
      train_op, global_step = iris.training(loss, args.learning_rate)

      # Add the Op to compare the logits to the labels during evaluation.
      eval_correct = iris.evaluation(logits, labels)

      # Build the summary operation based on the TF collection of Summaries.
      summary_op = tf.merge_all_summaries()

      # Add the variable initializer Op.
      init_op = tf.initialize_all_variables()

      # Create a saver for writing training checkpoints.
      saver = tf.train.Saver()

      # Instantiate a SummaryWriter to output summaries and the Graph.
      summary_writer = tf.train.SummaryWriter(os.path.join(
          output_path, 'summaries'), graph)

      if is_chief:
        write_label_file(os.path.join(output_path, 'summaries'), args, trial_id)

      # Create a "supervisor", which oversees the training process.
      sv = tf.train.Supervisor(is_chief=is_chief,
                               logdir=os.path.join(output_path, 'logdir'),
                               init_op=init_op,
                               saver=saver,
                               summary_op=None,
                               global_step=global_step,
                               save_model_secs=60)

    print_to_console("Starting the loop.")
    device_filters = ["/job:ps", "/job:worker/task:%d" % (task_index)]
    config = tf.ConfigProto(device_filters=device_filters)

    should_retry = True
    while should_retry:
      try:
        should_retry = False
        with sv.managed_session(target, config=config) as sess:
          start_time = time.time()
          last_save = start_time

          # Loop until the supervisor shuts down or 1000 steps have completed.
          step = 0
          while not sv.should_stop() and step < args.max_steps:
            start_time = time.time()

            _, step, loss_value = sess.run([train_op, global_step, loss])

            duration = time.time() - start_time
            if is_chief and time.time() - last_save > EVAL_INTERVAL_SECS:
              last_save = time.time()
              saver.save(sess, sv.save_path, global_step)
              accuracy = next(accuracies)
              logging.info("Eval, step %d: accuracy = %0.3f", step, accuracy)
              print_to_console("Eval, step %d: accuracy = %0.3f" % (step, accuracy))

            # Write the summaries and log an overview fairly often.
            if step % 100 == 0 and is_chief:
              # Log status.
              logging.info("Step %d: loss = %.2f (%.3f sec)",
                           step, loss_value, duration)
              print_to_console("Step %d: loss = %.2f (%.3f sec)" % (step, loss_value, duration))

              # Update the events file.
              summary_str = sess.run(summary_op)
              summary_writer.add_summary(summary_str, step)
              summary_writer.flush()

          if is_chief:
            # Force a save at the end of our loop.
            sv.saver.save(sess, sv.save_path, global_step=global_step,
                          write_meta_graph=False)
            accuracy_value = next(accuracies)
            logging.info("Final accuracy after %d steps = %0.3f", step, accuracy_value)
            print_to_console("Final accuracy after %d steps = %0.3f" % (step, accuracy_value))

            # Save the model for inference
            export_model(args, sess, sv.saver, trial_id)
      except tf.errors.AbortedError:
        should_retry = True

    # Ask for all the services to stop.
    sv.stop()
    print_to_console("Done training.")


def export_model(args, sess, training_saver, trial_id):
  output_path = os.path.join(args.output_path, trial_id)
  with tf.Graph().as_default() as inference_graph:
    metadata = features.FeatureMetadata.get_metadata(args.metadata_path)
    placeholder, measurements, _, keys = iris.create_inputs(metadata)
    logits = iris.inference(measurements, metadata, args.hidden)
    scores, predictions = iris.create_outputs(logits)

    inference_saver = tf.train.Saver()

    # Mark the inputs and the outputs
    tf.add_to_collection("inputs",
                         json.dumps({"examples": placeholder.name}))
    tf.add_to_collection("outputs",
                         json.dumps({"score": scores.name,
                                     "predictions": predictions.name}))
    tf.add_to_collection("keys", json.dumps({"key": keys.name}))

    model_dir = os.path.join(output_path, EXPORT_SUBDIRECTORY)

    # Save a copy of the metadata file used for this model with the exported
    # model, so that online and batch prediction can use it.
    subprocess.check_call(['gsutil', 'cp', args.metadata_path,
                           os.path.join(model_dir, "metadata.yaml")])

    # We need to save the variables from the training session, but we need
    # to serialize the serving graph.

    # Serialize the graph (MetaGraphDef)
    inference_saver.export_meta_graph(
        filename=os.path.join(model_dir, "export.meta"))

    # Save the variables. Don't write the MetaGraphDef, because that is
    # actually the training graph.
    training_saver.save(sess,
                        os.path.join(model_dir, "export"),
                        write_meta_graph=False)


def evaluate(args, trial_id):
  """Run one round of evaluation, yielding accuracy."""
  output_path = os.path.join(args.output_path, trial_id)
  eval_data = args.eval_data_paths

  with tf.Graph().as_default() as g:
    metadata = features.FeatureMetadata.get_metadata(args.metadata_path)

    _, examples = iris.read_examples(
        eval_data, BATCH_SIZE,
        shuffle=False, num_epochs=1)

    # Generate placeholders for the examples.
    placeholder, measurements, labels, _ = (
        iris.create_inputs(metadata, input_data=examples))

    # Build a Graph that computes predictions from the inference model.
    logits = iris.inference(measurements, metadata, args.hidden)

    # Add to the Graph the Ops for loss calculation.
    loss = iris.loss(logits, labels)

    # Add the Op to compute accuracy.
    accuracy_op, eval_op = metric_ops.streaming_accuracy(
        tf.argmax(logits, 1), labels)

    # The global step is useful for summaries.
    with tf.name_scope('train'):
      global_step = tf.Variable(0, name="global_step", trainable=False)

    summary = tf.scalar_summary("accuracy", accuracy_op)
    saver = tf.train.Saver()

  num_eval_batches = float(EVAL_SET_SIZE) / BATCH_SIZE
  summary_writer = tf.train.SummaryWriter(os.path.join(
      output_path, 'eval'))

  write_label_file(os.path.join(output_path, 'eval'), args, trial_id)

  sv = tf.train.Supervisor(graph=g,
                           logdir=os.path.join(output_path, 'eval'),
                           summary_op=summary,
                           summary_writer=summary_writer,
                           global_step=None,
                           saver=saver)
  step = 0
  while step < args.max_steps:
    last_checkpoint = tf.train.latest_checkpoint(os.path.join(
        output_path, 'logdir'))
    with sv.managed_session(master="",
                            start_standard_services=False) as session:
      sv.start_queue_runners(session)
      sv.saver.restore(session, last_checkpoint)
      accuracy = tf_evaluation(session,
                               num_evals=num_eval_batches,
                               eval_op=eval_op,
                               final_op=accuracy_op,
                               summary_op=summary,
                               summary_writer=summary_writer,
                               global_step=global_step)

      step = tf.train.global_step(session, global_step)
      yield accuracy


def tf_evaluation(sess,
                  num_evals=1,
                  init_op=None,
                  init_op_feed_dict=None,
                  eval_op=None,
                  eval_op_feed_dict=None,
                  final_op=None,
                  final_op_feed_dict=None,
                  summary_op=None,
                  summary_op_feed_dict=None,
                  summary_writer=None,
                  global_step=None):
  if init_op is not None:
    sess.run(init_op, init_op_feed_dict)

  if eval_op is not None:
    for i in range(int(num_evals)):
      sess.run(eval_op, eval_op_feed_dict)

  if final_op is not None:
    final_op_value = sess.run(final_op, final_op_feed_dict)
  else:
    final_op_value = None

  if summary_op is not None:
    if global_step is None:
      raise ValueError("must specify global step")

    global_step = tf.train.global_step(sess, global_step)
    summary = sess.run(summary_op, summary_op_feed_dict)
    hptuning_summary = tf.Summary(value=[
      tf.Summary.Value(tag='training/hptuning/metric', simple_value=float(final_op_value))
    ])
    summary_writer.add_summary(summary, global_step)
    summary_writer.add_summary(hptuning_summary, global_step)
    summary_writer.flush()
  
  return final_op_value

if __name__ == "__main__":
  main()

### Local Training


Run "%ml train" to generate the training cell template.

In [None]:
%%ml train

Fill in the required fields and run. <br><br>
Datalab will simulate the CloudML service by creating master, worker, and ps processes (in cloud they are different VMs) to perform a distributed training, although all these processes run in the local container VM.<br>
You can set replica_count to 0 to not using a certain job type, such as ps. But master is required.<br>
The output of the training will be links to the processes output logs, and also refreshed every 3 seconds to show last few lines of the logs. You can use the local run to quickly validate your training program and parameters before submitting it to cloud to do large scale training.<br>
If for any reasons the training is stuck, just click "Reset Session" to reset the kernel. All training processes will be cleaned up. <br>

There are two ways you could specify a trainer program: you can specify "package_uris" and "python_module" in the input cell for existing tarball package. Or, if these are absent, it will look for all "%%ml module" cells and create a temp tarball package to run. <br>

Since we already defined our training modules, let's run the training program without explicitly specifying package. Datalab will create a temp package and will run the entrypoint module specified by "--main" flag. <br>


In [9]:
%%ml train
worker_count: 1
parameter_server_count: 1
args:
  train_data_paths:
    - /content/datalab/tmp/ml/iris/preprocessed/features_train
  eval_data_paths:
    - /content/datalab/tmp/ml/iris/preprocessed/features_eval
  metadata_path: /content/datalab/tmp/ml/iris/preprocessed/metadata.yaml
  output_path: /content/datalab/tmp/ml/iris/model
  max_steps: 2000

Note that after training is completed, you can increment "max_steps" and run it again. Training will resume from previous checkpoint.

Check the output of the training. "model" dir includes the model file (last checkpoint, graph metadata, etc). "summaries" dir includes summary events.

In [3]:
!ls /content/datalab/tmp/ml/iris/model

eval  logdir  model  summaries


You can start TensorBoard to view training results.

In [11]:
%tensorboard start --logdir /content/datalab/tmp/ml/iris/model/

Shut down the tensorboard server.

In [5]:
%tensorboard stop --pid 142438

Let's train another one for fun (with learning_rate equal to 0.001). learning_rate is an arg defined in training program in the package and default value is 0.01.

Instead of running the modules defined by "%%ml modules" directly, we will package the modules first.

In [12]:
%%ml package --out /content/datalab/tmp/ml/iris/ --name trainer

Package created at /content/datalab/tmp/ml/iris/trainer-0.1.tar.gz.


Specify the package explicitly by package_uris. Since we don't specify 'parameter_server_count' or 'worker_count', we will use one master only.

In [14]:
%ml train
package_uris: /content/datalab/tmp/ml/iris/trainer-0.1.tar.gz
python_module: trainer.task
args:
  train_data_paths:
    - /content/datalab/tmp/ml/iris/preprocessed/features_train
  eval_data_paths:
    - /content/datalab/tmp/ml/iris/preprocessed/features_eval
  metadata_path: /content/datalab/tmp/ml/iris/preprocessed/metadata.yaml
  output_path: /content/datalab/tmp/ml/iris/model_lr
  max_steps: 3000
  learning_rate: 0.001

### Cloud Training

Cloud training is similar but with "--cloud" flag, and use all GCS paths instead of local paths. <br>
We will use the preprocessed files created by cloud preprocessing in previous "Preprocess" notebook.

Define variables that will be used later.

In [1]:
import os

bucket = 'gs://' + datalab_project_id() + '-sampledata'
package_path = os.path.join(bucket, 'iris', 'model', 'trainer-0.1.tar.gz')
train_data_path = os.path.join(bucket, 'iris', 'preprocessed', 'features_train')
eval_data_path = os.path.join(bucket, 'iris', 'preprocessed', 'features_eval')
metadata_path = os.path.join(bucket, 'iris', 'preprocessed', 'metadata.yaml')
output_path = os.path.join(bucket, 'iris', 'trained')

Copy trainer package to a GCS path.

In [16]:
!gsutil cp /content/datalab/tmp/ml/iris/trainer-0.1.tar.gz $package_path

Copying file:///content/datalab/tmp/ml/iris/trainer-0.1.tar.gz [Content-Type=application/x-tar]...
/ [1 files][  5.6 KiB/  5.6 KiB]                                                
Operation completed over 1 objects/5.6 KiB.                                      


For cloud training, there are extra required fields that need to be filled. It needs explicit trainer package so 'package_uris' and 'python_module' are required. 'scale_tier' and 'region' are also required to indicate training scale requirements and location.

In [2]:
%ml train --cloud
package_uris: $package_path
python_module: trainer.task
scale_tier: BASIC
region: us-central1
args:
  train_data_paths:
    - $train_data_path
  eval_data_paths:
    - $eval_data_path
  metadata_path: $metadata_path
  output_path: $output_path

View the job status as described in the output. You can also run "%ml jobs --filter state!=SUCCEEDED" to see all active ML jobs in that project.

In [15]:
%ml jobs --name trainer_task_160921_225339

View the trained model once the state is 'SUCCEEDED':

In [13]:
!gsutil ls $output_path

gs://cloud-ml-test-automated-sampledata/iris/trained/eval/
gs://cloud-ml-test-automated-sampledata/iris/trained/logdir/
gs://cloud-ml-test-automated-sampledata/iris/trained/model/
gs://cloud-ml-test-automated-sampledata/iris/trained/summaries/


TensorBoard works with GCS path so it works with Cloud training too.