## Training

Training requires a tarball python package that includes your training program based on TensorFlow. In this example, we will use <a href="http://tflearn.org/">tf.learn</a> to implement the model.

### Training Package

You can use existing tarball package (locally or in GCS), or use your own tarball package. You can define a python module use "%%mlalpha module". In the following two cells, we will define two python modules: "iris" and "task".

In [1]:
%%mlalpha module --name iris

import multiprocessing
import tensorflow as tf
from tensorflow.python.lib.io import file_io
import google.cloud.ml.features as features

def read_examples(input_files, batch_size, shuffle, num_epochs=None):
  """Creates readers and queues for reading example protos."""
  files = []
  for e in input_files:
    for path in e.split(','):
      files.extend(file_io.get_matching_files(path))
  thread_count = multiprocessing.cpu_count()

  # The minimum number of instances in a queue from which examples are drawn
  # randomly. The larger this number, the more randomness at the expense of
  # higher memory requirements.
  min_after_dequeue = 1000

  # When batching data, the queue's capacity will be larger than the batch_size
  # by some factor. The recommended formula is (num_threads + a small safety
  # margin). For now, we use a single thread for reading, so this can be small.
  queue_size_multiplier = thread_count + 3

  # Convert num_epochs == 0 -> num_epochs is None, if necessary
  num_epochs = num_epochs or None

  # Build a queue of the filenames to be read.
  filename_queue = tf.train.string_input_producer(files, num_epochs, shuffle)

  options = tf.python_io.TFRecordOptions(
      compression_type=tf.python_io.TFRecordCompressionType.GZIP)
  example_id, encoded_example = tf.TFRecordReader(options=options).read_up_to(
      filename_queue, batch_size)

  if shuffle:
    capacity = min_after_dequeue + queue_size_multiplier * batch_size
    return tf.train.shuffle_batch(
        [example_id, encoded_example],
        batch_size,
        capacity,
        min_after_dequeue,
        enqueue_many=True,
        num_threads=thread_count)

  else:
    capacity = queue_size_multiplier * batch_size
    return tf.train.batch(
        [example_id, encoded_example],
        batch_size,
        capacity=capacity,
        enqueue_many=True,
        num_threads=thread_count)


Define "task" module. "--main" indicates this is the entry point of the program.

In [2]:
%%mlalpha module --name task --main

# Copyright 2016 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Example implementation of code to run on the Cloud ML service.
"""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import json
import os
import re
import subprocess
import tempfile

from . import iris
import tensorflow as tf

from tensorflow.contrib import metrics as metrics_lib
from tensorflow.contrib.learn.python.learn import learn_runner
from tensorflow.contrib.session_bundle import gc
from tensorflow.contrib.session_bundle import manifest_pb2

import google.cloud.ml as ml

NUM_CLASSES = 3

# This determines a single column that is used to obtain features
# after parsing TF.EXamples,
FEATURES_KEY = 'measurements'

# The following keys determine columns from the parsed Examples to be included
# in the output.
TARGET_KEY = 'species'
KEY_ = 'key'

# This is used to map to unparsed tf.Examples so we can output them.
EXAMPLES_KEY = '_EXAMPLES_'


def _get_input_fn(args, mode):
  """Input function used by the experiment."""

  def input_fn():
    """Estimator `input_fn`.

    Returns:
      A tuple of:
      - Dictionary of string feature name to `Tensor`.
      - `Tensor` of target labels.
    """
    metadata = ml.features.FeatureMetadata.get_metadata(args.metadata_path)

    if mode == tf.contrib.learn.ModeKeys.INFER:
      # Generate placeholders for the examples.
      examples = tf.placeholder(
          dtype=tf.string,
          shape=(None,),
          name='input_example')

      parsed = ml.features.FeatureMetadata.parse_features(metadata, examples)
      parsed[EXAMPLES_KEY] = examples
      parsed[TARGET_KEY] = tf.ones_like(parsed[TARGET_KEY]) * (-1)
      return parsed, None

    if mode == tf.contrib.learn.ModeKeys.TRAIN:
      _, examples = iris.read_examples(
          args.train_data_paths, args.batch_size, shuffle=True)
    else:
      _, examples = iris.read_examples(
          args.eval_data_paths, args.eval_batch_size, shuffle=False)

    parsed = ml.features.FeatureMetadata.parse_features(metadata, examples)

    s = parsed.pop(TARGET_KEY)
    return parsed, s
  return input_fn


def _generic_signature_fn(examples, features, predictions):
  """Create a generic signature function with input and output signatures."""
  # Mark the outputs.
  predicted_index = tf.argmax(predictions, 1)
  predicted_label = tf.contrib.lookup.index_to_string(predicted_index,
                                                      mapping=["setosa", "versicolor", "virginica"],
                                                      default_value="UNKNOWN")
  outputs = {'scores': predictions.name,
             'key': tf.squeeze(features[KEY_]).name,
             'predicted_index': predicted_index.name,
             'predicted_label': predicted_label.name}

  inputs = {'examples': features[EXAMPLES_KEY].name}

  tf.add_to_collection('outputs', json.dumps(outputs))
  tf.add_to_collection('inputs', json.dumps(inputs))

  input_signature = manifest_pb2.Signature()
  output_signature = manifest_pb2.Signature()

  for name, tensor_name in outputs.iteritems():
    output_signature.generic_signature.map[name].tensor_name = tensor_name

  for name, tensor_name in inputs.iteritems():
    input_signature.generic_signature.map[name].tensor_name = tensor_name

  return None, {'inputs': input_signature,
                'outputs': output_signature}


def _get_export_monitor(args, output_dir):
  """Create an export monitor."""
  export_input_fn = _get_input_fn(args, tf.contrib.learn.ModeKeys.INFER)

  class ExportWithMetadataMonitor(tf.contrib.learn.monitors.ExportMonitor):

    def every_n_step_end(self, step, outputs):
      # Don't export every n steps.
      pass

    def end(self, session=None):
      super(ExportWithMetadataMonitor, self).end(session=session)
      # Save a copy of the metadata file used for this model with the exported
      # model, so that online and batch prediction can use it.

      subprocess.check_call([
          'gsutil', '-q', 'cp', args.metadata_path,
          os.path.join(output_dir, 'model', 'metadata.yaml')])
      # Copy the final model to the model/ directory, so that we can find it.
      def parser(path):
        match = re.match('^' + self.export_dir + '/(\\d+)$', path.path)
        if not match:
          return None
        return path._replace(export_version=int(match.group(1)))

      dir_list = gc.get_paths(self.export_dir, parser=parser)
      subprocess.check_call(['gsutil', '-q', 'cp', '-r',
                             dir_list[-1].path + '/*',
                             os.path.join(output_dir, 'model')])

  local_output_dir = tempfile.mkdtemp()
  return ExportWithMetadataMonitor(
      input_fn=export_input_fn,
      input_feature_key=EXAMPLES_KEY,
      every_n_steps=400,
      export_dir=os.path.join(local_output_dir, 'export'),
      signature_fn=_generic_signature_fn)


def _get_experiment_fn(args):
  """Create the experiment function."""

  def _experiment_fn(output_dir):
    """Experiment function used by learn_runner to run training/eval/etc.

    Args:
      output_dir: String path of directory to use for outputs (model
        checkpoints, summaries, etc).

    Returns:
      tf.learn `Experiment`.
    """
    config = tf.contrib.learn.RunConfig()
    # Write checkpoints more often for more granular evals, since the toy data
    # set is so small and simple. Most normal use cases should not set this and
    # just use the default (600).
    config.save_checkpoints_secs = 120

    # Specify that all features have real-valued data
    feature_columns = [tf.contrib.layers.real_valued_column(FEATURES_KEY,
                                                            dimension=4)]

    train_dir = os.path.join(output_dir, 'train')

    classifier = tf.contrib.learn.DNNClassifier(
        feature_columns=feature_columns,
        hidden_units=[args.hidden],
        n_classes=NUM_CLASSES,
        config=config,
        model_dir=train_dir,
        optimizer=tf.train.AdamOptimizer(
            args.learning_rate, epsilon=args.epsilon))

    train_monitors = [_get_export_monitor(args, output_dir)]

    streaming_accuracy = metrics_lib.streaming_accuracy
    return tf.contrib.learn.Experiment(
        estimator=classifier,
        train_input_fn=_get_input_fn(args, tf.contrib.learn.ModeKeys.TRAIN),
        eval_input_fn=_get_input_fn(args, tf.contrib.learn.ModeKeys.EVAL),
        train_steps=args.max_steps,
        train_monitors=train_monitors,
        eval_metrics={
            ('accuracy', 'classes'): streaming_accuracy,
            ('training/hptuning/metric', 'classes'): streaming_accuracy
        })
  return _experiment_fn


def parse_arguments():
  parser = argparse.ArgumentParser()
  parser.add_argument('--train_data_paths', type=str, action='append')
  parser.add_argument('--eval_data_paths', type=str, action='append')
  parser.add_argument('--metadata_path', type=str)
  parser.add_argument('--output_path', type=str)
  parser.add_argument('--max_steps', type=int, default=5000)
  parser.add_argument('--hidden', type=int, default=20)
  parser.add_argument('--learning_rate', type=float, default=0.01)
  parser.add_argument('--epsilon', type=float, default=0.0005)
  parser.add_argument('--batch_size', type=int, default=30)
  parser.add_argument('--eval_batch_size', type=int, default=30)
  return parser.parse_args()


def main():
  env = json.loads(os.environ.get('TF_CONFIG', '{}'))
  # First find out if there's a task value on the environment variable.
  # If there is none or it is empty define a default one.
  task_data = env.get('task', None) or {'type': 'master', 'index': 0}

  args = parse_arguments()


  trial = task_data.get('trial')
  if trial is not None:
    output_dir = os.path.join(args.output_path, trial)
  else:
    output_dir = args.output_path

  learn_runner.run(
      experiment_fn=_get_experiment_fn(args),
      output_dir=output_dir)


if __name__ == '__main__':
  tf.logging.set_verbosity(tf.logging.INFO)
  main()

Start training. Note that we specifies only program args in cell. This is equivelant to starting one role (master) for training. If you want to simulate distribiuted training, we can specify parameter_server_count and worker_count and Datalab will start three processes (master, worker, ps) to simulate cloud training.

In [9]:
%%mlalpha train
args:
  train_data_paths: /content/datalab/tmp/ml/iris/preprocessed/features_train.tfrecord.gz
  eval_data_paths: /content/datalab/tmp/ml/iris/preprocessed/features_eval.tfrecord.gz
  metadata_path: /content/datalab/tmp/ml/iris/preprocessed/metadata.yaml
  output_path: /content/datalab/tmp/ml/iris/model
  max_steps: 2000
  hidden: 20

Note that after training is completed, you can increment "max_steps" and run it again. Training will resume from previous checkpoint.

Check the output of the training. "model" subdir includes the exported model. There are also tf events files under "train" directory.

In [12]:
!ls /content/datalab/tmp/ml/iris/model/

model  train


You can start TensorBoard to view training results. Check the "loss" event.

In [39]:
%tensorboard start --logdir /content/datalab/tmp/ml/iris/model/train

Shut down the tensorboard server.

In [42]:
%tensorboard stop --pid 11606

Let's train another one for fun (with learning_rate equal to 0.001). learning_rate is an arg defined in training program in the package and default value is 0.01.

Instead of running the modules defined by "%%mlalpha modules" directly, we will package the modules first.

In [3]:
%%mlalpha package --out /content/datalab/tmp/ml/iris/ --name trainer

Package created at /content/datalab/tmp/ml/iris/trainer-0.1.tar.gz.


Specify the package explicitly by package_uris. This time we specify 'parameter_server_count' or 'worker_count' so it will be a distributed training. In the cloud run, it will be equivalent to ps, worker, and master running in their own VMs. <br>
Note that in the training output you have three log links.

In [18]:
%mlalpha train
package_uris: /content/datalab/tmp/ml/iris/trainer-0.1.tar.gz
python_module: trainer.task
worker_count: 1
parameter_server_count: 1
args:
  train_data_paths:
    - /content/datalab/tmp/ml/iris/preprocessed/features_train.tfrecord.gz
  eval_data_paths:
    - /content/datalab/tmp/ml/iris/preprocessed/features_eval.tfrecord.gz
  metadata_path: /content/datalab/tmp/ml/iris/preprocessed/metadata.yaml
  output_path: /content/datalab/tmp/ml/iris/model_lr
  max_steps: 2500
  learning_rate: 0.001

### Cloud Training

Cloud training is similar but with "--cloud" flag, and use all GCS paths instead of local paths. <br>
We will use the preprocessed files created by cloud preprocessing in previous "Preprocess" notebook.

Define variables that will be used later.

In [4]:
import os

bucket = 'gs://' + datalab_project_id() + '-sampledata'
package_path = os.path.join(bucket, 'iris', 'model', 'trainer-0.1.tar.gz')
train_data_path = os.path.join(bucket, 'iris', 'preprocessed', 'features_train.tfrecord.gz')
eval_data_path = os.path.join(bucket, 'iris', 'preprocessed', 'features_eval.tfrecord.gz')
metadata_path = os.path.join(bucket, 'iris', 'preprocessed', 'metadata.yaml')
output_dir = os.path.join(bucket, 'iris', 'trained')

In [7]:
import os

bucket = 'gs://' + datalab_project_id() + '-sampledata'
package_path = os.path.join(bucket, 'iris', 'model', 'trainer-0.1.tar.gz')
train_data_path = os.path.join(bucket, 'iris', 'preprocessed', 'features_train.tfrecord.gz')
eval_data_path = os.path.join(bucket, 'iris', 'preprocessed', 'features_eval.tfrecord.gz')
metadata_path = os.path.join(bucket, 'iris', 'preprocessed', 'metadata.yaml')
output_dir = os.path.join(bucket, 'iris', 'trained2')

In [8]:
%mlalpha train --cloud
package_uris: $package_path
python_module: trainer.task
scale_tier: BASIC
region: us-central1
args:
  train_data_paths:
    - $train_data_path
  eval_data_paths:
    - $eval_data_path
  metadata_path: $metadata_path
  output_path: $output_dir
  max_steps: 1000

Copy trainer package to a GCS path.

In [37]:
!gsutil cp /content/datalab/tmp/ml/iris/trainer-0.1.tar.gz $package_path

Copying file:///content/datalab/tmp/ml/iris/trainer-0.1.tar.gz [Content-Type=application/x-tar]...
/ [1 files][  4.4 KiB/  4.4 KiB]                                                
Operation completed over 1 objects/4.4 KiB.                                      


For cloud training, there are extra required fields that need to be filled. It needs explicit trainer package so 'package_uris' and 'python_module' are required. 'scale_tier' and 'region' are also required to indicate training scale requirements and location. 'scale_tier: BASIC' means only one master. <br>
Note "--cloud" indicates it will run in cloud instead of local.

In [38]:
%mlalpha train --cloud
package_uris: $package_path
python_module: trainer.task
scale_tier: BASIC
region: us-central1
args:
  train_data_paths:
    - $train_data_path
  eval_data_paths:
    - $eval_data_path
  metadata_path: $metadata_path
  output_path: $output_dir

View the job status as described in the output. You can also run "%mlalpha jobs --filter state!=SUCCEEDED" to see all active ML jobs in that project.

In [34]:
%mlalpha jobs --name trainer_task_160929_034432

View the trained model once the state is 'SUCCEEDED':

In [40]:
!gsutil ls $output_dir

gs://cloud-ml-test-automated-sampledata/iris/trained/
gs://cloud-ml-test-automated-sampledata/iris/trained/model/
gs://cloud-ml-test-automated-sampledata/iris/trained/train/


Tensorboard works too with GCS path.

In [41]:
%%tensorboard start --logdir $output_dir