## Training

Training requires a tarball python package that includes your training program based on TensorFlow. While CloudML provides several generic purpose model training, for this sample we will use a package that is specifically created to train Census sample.

### Training Package

You can use existing tarball package (locally or in GCS), or use your own tarball package. You can define a python module use "%%ml module". In the following two cells, we will define two python modules: "census" and "task".

In [1]:
%%ml module --name census

import google.cloud.ml.features as features
import json
import math
import tensorflow as tf

def read_examples(input_files, batch_size, shuffle, num_epochs=None):
  # The minimum number of instances in a queue from which examples are drawn
  # randomly. The larger this number, the more randomness at the expense of
  # higher memory requirements.
  MIN_AFTER_DEQUEUE = 100

  # When batching data, the queue's capacity will be larger than the batch_size
  # by some factor. The recommended formula is (num_threads + a small safety
  # margin). For now, we use a single thread for reading, so this can be small.
  QUEUE_SIZE_MULTIPLIER = 3

  # Convert num_epochs == 0 -> num_epochs is None, if necessary
  num_epochs = num_epochs or None

  # Build a queue of the filenames to be read.
  filename_queue = tf.train.string_input_producer(input_files, num_epochs,
                                                  shuffle)
  options = tf.python_io.TFRecordOptions(
      compression_type=tf.python_io.TFRecordCompressionType.ZLIB)
  example_id, encoded_example = tf.TFRecordReader(options=options).read(
      filename_queue)

  if shuffle:
    capacity = MIN_AFTER_DEQUEUE + QUEUE_SIZE_MULTIPLIER * batch_size
    return tf.train.shuffle_batch([example_id, encoded_example], batch_size,
                                  capacity, MIN_AFTER_DEQUEUE)
  else:
    capacity = QUEUE_SIZE_MULTIPLIER * batch_size
    return tf.train.batch([example_id, encoded_example],
                          batch_size,
                          capacity=capacity)

def create_inputs(metadata, input_data=None):
  with tf.name_scope('inputs'):
    if input_data is None:
      input_data = tf.placeholder(tf.string, name='input', shape=(None,))
    parsed = features.FeatureMetadata.parse_features(metadata, input_data)
    return (input_data, parsed['inputs'], tf.squeeze(parsed['target']),
            tf.identity(parsed['key']))

def _create_layer(inputs, input_size, output_size):
  with tf.name_scope('layer'):
    initial_weights = tf.truncated_normal([input_size, output_size],
                                          stddev = 1.0 / math.sqrt(input_size))
    weights = tf.Variable(initial_weights, name = 'weights')

    initial_biases = tf.zeros([ output_size ])
    biases = tf.Variable(initial_biases, name = 'biases')

    xw = tf.matmul(inputs, weights)

    return tf.nn.bias_add(xw, biases)

def inference(inputs, metadata, hyperparams):
  input_size = metadata.features['inputs']['size']
  output_size = metadata.features['target']['size']

  hidden_layer1 = tf.nn.relu(_create_layer(inputs, input_size,
                                           hyperparams['hidden_layer1_size']))
  hidden_layer2 = tf.nn.relu(_create_layer(hidden_layer1,
                                           hyperparams['hidden_layer1_size'],
                                           hyperparams['hidden_layer2_size']))
  hidden_layer3 = tf.nn.relu(_create_layer(hidden_layer2,
                                           hyperparams['hidden_layer2_size'],
                                           hyperparams['hidden_layer3_size']))
  output = _create_layer(hidden_layer3, hyperparams['hidden_layer3_size'],
                         output_size)
  return output


def loss(output, targets):
  """Calculates the loss from the output and the labels.
  Args:
    output: output layer tensor, float - [batch_size].
    targets: Target value tensor, float - [batch_size].
  Returns:
    loss: Loss tensor of type float.
  """
  loss = tf.reduce_mean(tf.abs(output - targets), name = 'loss')
  return loss

def training(loss, learning_rate):
  with tf.name_scope('train'):
    tf.scalar_summary(loss.op.name, loss)
    global_step = tf.Variable(0, name='global_step', trainable=False)
    optimizer = tf.train.AdagradOptimizer(learning_rate)
    train_op = optimizer.minimize(loss, global_step)
    return train_op, global_step

In [2]:
%%ml module --name task --main

import argparse
import itertools
import json
import logging
import os
import random
import subprocess
import sys
import time

import census
import tensorflow as tf
from tensorflow.contrib.metrics.python.ops import metric_ops

import google.cloud.ml.features as features
import google.cloud.ml.util as cloudml_util

EXPORT_SUBDIRECTORY = 'model'
HYPERPARAMS = {
  'batch_size': 64,
  'learning_rate': 0.003,
}
EVAL_SET_SIZE = 2767
EVAL_INTERVAL_SECS = 15


def print_to_console(msg):
  print msg
  sys.stdout.flush()


def main():
  config = json.loads(os.environ.get('TF_CONFIG', '{}'))
  cluster = config.get('cluster', None)
  task = config.get('task', None)
  job = config.get('job', None)
  trial_id = task.get('trial', '')
  logging.info("start trial %s.", trial_id)

  parser = argparse.ArgumentParser()
  parser.add_argument("--train_data_paths", type=str, action='append')
  parser.add_argument("--eval_data_paths", type=str, action='append')
  parser.add_argument("--metadata_path", type=str)
  parser.add_argument("--output_path", type=str)
  parser.add_argument("--max_steps", type=int, default=2000)
  parser.add_argument("--hidden1", type=int, default=300)
  parser.add_argument("--hidden2", type=int, default=200)
  parser.add_argument("--hidden3", type=int, default=100)
  args = parser.parse_args()

  dispatch(args, cluster, task, job, trial_id)


def start_server(cluster, task):
  # Create and start a server.
  return tf.train.Server(cluster,
                         protocol="grpc",
                         job_name=task['type'],
                         task_index=task['index'])


def dispatch(args, cluster, task, job, trial_id):
  if not cluster:
    # Run locally.
    run_training(args, target="", is_chief=True, device_fn="", trial_id=trial_id)
    return

  if task['type'] == "ps":
    server = start_server(cluster, task)
    server.join()
  elif task['type'] == "worker":
    server = start_server(cluster, task)
    is_chief = False
    device_fn = tf.train.replica_device_setter(
        ps_device="/job:ps",
        worker_device="/job:worker/task:%d" % task['index'],
        cluster=cluster)
    run_training(args, server.target, is_chief, device_fn, trial_id)
  elif task['type'] == "master":
    server = start_server(cluster, task)
    is_chief = (task['index'] == 0)
    device_fn = tf.train.replica_device_setter(
        ps_device="/job:ps",
        worker_device="/job:master/task:%d" % task['index'],
        cluster=cluster)
    run_training(args, server.target, is_chief, device_fn, trial_id)
  else:
    raise ValueError("invalid job_type %s" % task['type'])


def run_training(args, target, is_chief, device_fn, trial_id):
  """Train Census for a number of steps."""
  output_path = os.path.join(args.output_path, trial_id)
  # Get the sets of examples and targets for training, validation, and
  # test on Census.
  training_data = args.train_data_paths

  if is_chief:
    # A generator over accuracies. Each call to next(accuracies) forces an
    # evaluation of the model.
    accuracies = evaluate(args, trial_id)

  # Tell TensorFlow that the model will be built into the default Graph.
  with tf.Graph().as_default() as graph:
    # Assigns ops to the local worker by default.
    with tf.device(device_fn):

      metadata = features.FeatureMetadata.get_metadata(args.metadata_path)

      _, train_examples = census.read_examples(
          training_data, HYPERPARAMS['batch_size'], shuffle=False)

      # Generate placeholders for the examples.
      placeholder, inputs, targets, _ = (
          census.create_inputs(metadata, input_data=train_examples))

      # Build a Graph that computes predictions from the inference model.
      layer_sizes = {
        'hidden_layer1_size': args.hidden1,
        'hidden_layer2_size': args.hidden2,
        'hidden_layer3_size': args.hidden3,
      }
      output = census.inference(inputs, metadata, layer_sizes)

      # Add to the Graph the Ops for loss calculation.
      loss = census.loss(output, targets)

      # Add to the Graph the Ops that calculate and apply gradients.
      train_op, global_step = census.training(loss,
                                              HYPERPARAMS['learning_rate'])

      # Build the summary operation based on the TF collection of Summaries.
      summary_op = tf.merge_all_summaries()

      # Add the variable initializer Op.
      init_op = tf.initialize_all_variables()

      # Create a saver for writing training checkpoints.
      saver = tf.train.Saver()

      # Instantiate a SummaryWriter to output summaries and the Graph.
      summary_writer = tf.train.SummaryWriter(os.path.join(
          output_path, 'summaries'), graph)

      # Create a "supervisor", which oversees the training process.
      sv = tf.train.Supervisor(is_chief=is_chief,
                               logdir=os.path.join(output_path, 'logdir'),
                               init_op=init_op,
                               saver=saver,
                               summary_op=None,
                               global_step=global_step,
                               save_model_secs=60)

      # The supervisor takes care of session initialization, restoring from
      # a checkpoint, and closing when done or an error occurs.
      print_to_console("Starting the loop.")
      with sv.managed_session(target) as sess:
        start_time = time.time()
        last_save = start_time

        # Loop until the supervisor shuts down or max_steps have completed.
        step = 0
        while not sv.should_stop() and step < args.max_steps:
          start_time = time.time()

          # Run one step of the model.  The return values are the activations
          # from the `train_op` (which is discarded) and the `loss` Op.  To
          # inspect the values of your Ops or variables, you may include them
          # in the list passed to sess.run() and the value tensors will be
          # returned in the tuple from the call.
          _, step, loss_value = sess.run([train_op, global_step, loss])

          duration = time.time() - start_time
          if is_chief and time.time() - last_save > EVAL_INTERVAL_SECS:
            last_save = time.time()
            saver.save(sess, sv.save_path, global_step)
            accuracy = next(accuracies)
            logging.info("Eval, step %d: error = %0.3f", step, accuracy)
            print_to_console("Eval, step %d: error = %0.3f" % (step, accuracy))

          # Write the summaries and log an overview fairly often.
          if step % 200 == 0 and is_chief:
            logging.info("Step %d: loss = %.2f (%.3f sec)",
                         step, loss_value, duration)
            print_to_console("Step %d: loss = %.2f (%.3f sec)" % (step, loss_value, duration))

            # Update the events file.
            summary_str = sess.run(summary_op)
            summary_writer.add_summary(summary_str, step)
            summary_writer.flush()

        if is_chief:
          # Force a save at the end of our loop.
          sv.saver.save(sess, sv.save_path, global_step=global_step,
                        write_meta_graph=False)
          accuracy_value = next(accuracies)
          logging.info("Final error after %d steps = %0.3f", step, accuracy_value)
          print_to_console("Final error after %d steps = %0.3f" % (step, accuracy_value))

          # Save the model for inference
          export_model(args, sess, sv.saver, trial_id)

      # Ask for all the services to stop.
      sv.stop()
      print_to_console("Done training.")


def export_model(args, sess, training_saver, trial_id):
  output_path = os.path.join(args.output_path, trial_id)
  with tf.Graph().as_default() as inference_graph:
    metadata = features.FeatureMetadata.get_metadata(args.metadata_path)
    placeholder, inputs, _, keys = census.create_inputs(metadata)
    layer_sizes = {
      'hidden_layer1_size': args.hidden1,
      'hidden_layer2_size': args.hidden2,
      'hidden_layer3_size': args.hidden3,
    }
    output = census.inference(inputs, metadata, layer_sizes)

    inference_saver = tf.train.Saver()

    # Mark the inputs and the outputs
    tf.add_to_collection("inputs",
                         json.dumps({"examples": placeholder.name}))
    tf.add_to_collection("outputs",
                         json.dumps({"score": output.name}))
    tf.add_to_collection("keys", json.dumps({"key": keys.name}))

    model_dir = os.path.join(output_path, EXPORT_SUBDIRECTORY)

    # Save a copy of the metadata file used for this model with the exported
    # model, so that online and batch prediction can use it.
    subprocess.check_call(['gsutil', 'cp', args.metadata_path,
                           os.path.join(model_dir, "metadata.yaml")])
    
    inference_saver.export_meta_graph(
        filename=os.path.join(model_dir, "export.meta"))

    # Save the variables. Don't write the MetaGraphDef, because that is
    # actually the training graph.
    training_saver.save(sess,
                        os.path.join(model_dir, "export"),
                        write_meta_graph=False)


def evaluate(args, trial_id):
  """Run one round of evaluation, yielding accuracy."""
  output_path = os.path.join(args.output_path, trial_id)
  eval_data = args.eval_data_paths

  with tf.Graph().as_default() as g:
    metadata = features.FeatureMetadata.get_metadata(args.metadata_path)

    _, examples = census.read_examples(
        eval_data, HYPERPARAMS['batch_size'],
        shuffle=False)

    # Generate placeholders for the examples.
    placeholder, inputs, targets, _ = (
        census.create_inputs(metadata, input_data=examples))

    # Build a Graph that computes predictions from the inference model.
    layer_sizes = {
      'hidden_layer1_size': args.hidden1,
      'hidden_layer2_size': args.hidden2,
      'hidden_layer3_size': args.hidden3,
    }
    output = census.inference(inputs, metadata, layer_sizes)

    # Add to the Graph the Ops for loss calculation.
    loss = census.loss(output, targets)

    # Add the Op to compute accuracy.
    error, eval_op = metric_ops.streaming_mean_relative_error(
        output, targets, tf.ones(HYPERPARAMS['batch_size']))

    # The global step is useful for summaries.
    with tf.name_scope('train'):
      global_step = tf.Variable(0, name="global_step", trainable=False)

    summary = tf.scalar_summary("error", error)
    saver = tf.train.Saver()

  num_eval_batches = float(EVAL_SET_SIZE) // HYPERPARAMS['batch_size']
  summary_writer = tf.train.SummaryWriter(os.path.join(
      output_path, 'eval'))

  sv = tf.train.Supervisor(graph=g,
                           logdir=os.path.join(output_path, 'eval'),
                           summary_op=summary,
                           summary_writer=summary_writer,
                           global_step=None,
                           saver=saver)

  step = 0
  while step < args.max_steps:
    last_checkpoint = tf.train.latest_checkpoint(os.path.join(
        output_path, 'logdir'))
    with sv.managed_session(master="",
                            start_standard_services=False) as session:
      sv.start_queue_runners(session)
      sv.saver.restore(session, last_checkpoint)
      accuracy = tf_evaluation(session,
                               max_num_evals=num_eval_batches,
                               eval_op=eval_op,
                               final_op=error,
                               summary_op=summary,
                               summary_writer=summary_writer,
                               global_step=global_step)

      step = tf.train.global_step(session, global_step)
      yield accuracy


def tf_evaluation(sess,
                  max_num_evals=1000,
                  eval_op=None,
                  final_op=None,
                  summary_op=None,
                  summary_writer=None,
                  global_step=None):
  if eval_op is not None:
    try:
      for i in range(int(max_num_evals)):
        (_, final_op_value) = sess.run((eval_op, final_op))
    except tf.errors.OutOfRangeError:
      # We've hit the end of our epoch.  Unfortunately, if we hit this
      # tensorflow has already logged a warning to stderr, so we try to avoid
      # hitting it in this sample.
      pass

  if summary_op is not None:
    if global_step is None:
      raise ValueError("must specify global step")

    global_step = tf.train.global_step(sess, global_step)
    summary = sess.run(summary_op)
    hptuning_summary = tf.Summary(value=[
      tf.Summary.Value(tag='training/hptuning/metric', simple_value=float(final_op_value))
    ])
    summary_writer.add_summary(summary, global_step)
    summary_writer.add_summary(hptuning_summary, global_step)
    summary_writer.flush()

  return final_op_value

if __name__ == "__main__":
  main()

### Local Training

Run "%ml train" to generate the training cell template.

In [None]:
%%ml train

Fill in the required fields and run. <br>
Datalab will simulate the CloudML service by creating master, worker, and ps processes (in cloud they are different VMs) to perform a distributed training, although all these processes run in the local container VM.<br>
You can set replica_count to 0 to not using a certain job type, such as ps. But master is required. In this case, we only enable master.<br>
The output of the training will be links to the processes output logs, and also refreshed every 3 seconds to show last few lines of the logs. You can use the local run to quickly validate your training program and parameters before submitting it to cloud to do large scale training.<br>
If for any reasons the training is stuck, just click "Reset Session" to reset the kernel. All training processes will be cleaned up.<br><br>

There are two ways you could specify a trainer program: you can specify "package_uris" and "python_module" in the input cell for existing tarball package. Or, if these are absent, it will look for all "%%ml module" cells and create a temp tarball package to run. <br>

Since we already defined our training modules, let's run the training program without explicitly specifying package. Datalab will create a temp package and will run the entrypoint module specified by "--main" flag. <br>

In [3]:
%ml train
worker_count: 1
parameter_server_count: 1
args:
  train_data_paths:
    - /content/datalab/tmp/ml/census/preprocessed/features_train
  eval_data_paths:
    - /content/datalab/tmp/ml/census/preprocessed/features_eval
  metadata_path: /content/datalab/tmp/ml/census/preprocessed/metadata.yaml
  output_path: /content/datalab/tmp/ml/census/model
  hidden1: 100
  hidden2: 60
  hidden3: 30

Check the training output.

In [7]:
!ls /content/datalab/tmp/ml/census/model

eval  logdir  model  summaries


You can start TensorBoard to view training results.

In [4]:
%tensorboard start --logdir /content/datalab/tmp/ml/census/model

Shut down the tensorboard serverwhen you are done with it.

In [13]:
%tensorboard stop --pid 115536

Let's train another model with larger hidden layer sizes.
Instead of running the modules defined by "%%ml modules" directly, we will package the modules first.

In [3]:
%%ml package --out /content/datalab/tmp/ml/census/ --name trainer

Package created at /content/datalab/tmp/ml/census/trainer-0.1.tar.gz.


Specify the package explicitly by package_uris. Since we don't specify 'parameter_server_count' or 'worker_count', we will use one master only.

In [4]:
%ml train
package_uris: /content/datalab/tmp/ml/census/trainer-0.1.tar.gz
python_module: trainer.task
args:
  train_data_paths:
    - /content/datalab/tmp/ml/census/preprocessed/features_train
  eval_data_paths:
    - /content/datalab/tmp/ml/census/preprocessed/features_eval
  metadata_path: /content/datalab/tmp/ml/census/preprocessed/metadata.yaml
  output_path: /content/datalab/tmp/ml/census/largermodel
  hidden1: 200
  hidden2: 100
  hidden3: 50

### Cloud Training

Cloud training is similar but with "--cloud" flag, and use all GCS paths instead of local paths. <br>
You also need to make sure you have a project whitelisted for CloudML, and use "%projects set project-id" to set it.

Define variables that will be used later.

In [5]:
import os

bucket = 'gs://' + datalab_project_id() + '-sampledata'
package_path = os.path.join(bucket, 'census', 'model', 'trainer-0.1.tar.gz')
train_data_path = os.path.join(bucket, 'census', 'preprocessed', 'features_train')
eval_data_path = os.path.join(bucket, 'census', 'preprocessed', 'features_eval')
metadata_path = os.path.join(bucket, 'census', 'preprocessed', 'metadata.yaml')
output_path = os.path.join(bucket, 'census', 'trained')

In [6]:
!gsutil cp /content/datalab/tmp/ml/census/trainer-0.1.tar.gz $package_path

Copying file:///content/datalab/tmp/ml/census/trainer-0.1.tar.gz [Content-Type=application/x-tar]...
/ [1 files][  5.4 KiB/  5.4 KiB]                                                
Operation completed over 1 objects/5.4 KiB.                                      


Start training using the Cloud DataFlow output from the "2. Preprocessing" notebook. We choose a set of hidden layer sizes, and later we will show how to sweep hyperparameter values using CloudML service using hyperparameter tuning feature.

In [9]:
%ml train --cloud
package_uris: $package_path
python_module: trainer.task
scale_tier: BASIC
region: us-west1
args:
  train_data_paths:
    - $train_data_path
  eval_data_paths:
    - $eval_data_path
  metadata_path: $metadata_path
  output_path: $output_path
  hidden1: 200
  hidden2: 100
  hidden3: 50

View the job status as described in the output. You can also run "%ml jobs --filter state!=SUCCEEDED" to see all active ML jobs in that project.

In [19]:
%ml jobs --name trainer_task_160922_012016

View the trained model:

In [18]:
!gsutil ls gs://cloud-ml-test-automated-sampledata/census/trained

gs://cloud-ml-test-automated-sampledata/census/trained/eval/
gs://cloud-ml-test-automated-sampledata/census/trained/logdir/
gs://cloud-ml-test-automated-sampledata/census/trained/model/
gs://cloud-ml-test-automated-sampledata/census/trained/summaries/
