# Hello World notebook for Amazon SageMaker pipe mode with TensorFlow

New data scientists and machine learning engineers have a treasure trove of examples available on the internet to help get started. These examples tend to leverage small public datasets and demonstrate common use cases and approaches. The data in these examples can be downloaded quickly to a training instance and training can be completed from there. However, many customers have large scale datasets for machine learning that make the simple approach of downloading the full dataset prohibitive. Imagine your training algorithm waiting for the download to complete for 100GB of medical images, or 100TB of video.

Amazon SageMaker provides Pipe mode for exactly this purpose. Pipe mode lets you establish a channel to your dataset and feed your training algorithm batches of that data at a time. Your training can start quickly, and you can train on an infinite size dataset.

While there are several examples available on the use of Pipe mode, it may be difficult to get it working with your specific scenario given the vast number of possible use cases. This notebook provides an end to end example for training with a fairly common combination:

1. Custom TensorFlow neural network.
2. Script mode using SageMaker's TensorFlow container.
3. Pipe mode to incrementally stream data to the neural network.
4. Data stored in TFRecords format.

## Simple classification dataset

For this example, we use a simple numeric dataset that we will use for binary classification. With the focus of this notebook on quickly and easily demonstrating pipe mode, we use a handful of features and limit the number of samples. Feel free to scale it up to see the approach in action on large datasets. To get started, we create a synthetic dataset and split it into train, test, and validation.

In [18]:
NUM_FEATURES = 5
NUM_SAMPLES  = 1000
NUM_FILES    = 10

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

X1, Y1 = make_classification(n_samples=NUM_SAMPLES, n_features=NUM_FEATURES, n_redundant=0, 
                             n_informative=1, n_classes=2, n_clusters_per_class=1, 
                             shuffle=True, class_sep=2.0)

# split data into train and test sets
seed = 7
val_size  = 0.20
test_size = 0.10

# Give 70% to train
X_train, X_test, y_train, y_test = \
    train_test_split(X1, Y1, test_size=(test_size + val_size), random_state=seed)
# Of the remaining 30%, give 2/3 to validation and 1/3 to test
X_test, X_val, y_test, y_val     = \
    train_test_split(X_test, y_test, test_size=(test_size / (test_size + val_size)), 
                     random_state=seed)

print('Train shape: {}, Test shape: {}, Val shape: {}'.format(X_train.shape, 
                                                              X_test.shape, X_val.shape))
print('Train target: {}, Test target: {}, Val target: {}'.format(y_train.shape, 
                                                                 y_test.shape, y_val.shape))
print('\nSample observation: {}\nSample target: {}'.format(X_test[0], y_test[0]))


Train shape: (699, 5), Test shape: (200, 5), Val shape: (101, 5)
Train target: (699,), Test target: (200,), Val target: (101,)

Sample observation: [-0.40218888 -2.29887905 -0.20157193  0.52203548 -1.09193016]
Sample target: 0


In [103]:
num_train_samples = X_train.shape[0]
num_val_samples = X_val.shape[0]
num_test_samples = X_test.shape[0]

## Saving data to TFRecord files

Pipe mode supports RecordIO, TFRecord, and TextLine. Here we will use TFRecord format, and for each of train, test, and val, we generate a set of files so we can see how Pipe mode is able to deal with sets of files.

In [21]:
import tensorflow as tf

def _int64_feature(value):
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def _float_feature(value):
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def convert_to_tfr(x, y, out_file):
    with tf.python_io.TFRecordWriter(out_file) as record_writer:
      num_samples = len(x)
      for i in range(num_samples):
        example = tf.train.Example()
        example.features.feature['features'].float_list.value.extend(x[i])
        example.features.feature['label'].int64_list.value.append(int(y[i]))
        record_writer.write(example.SerializeToString())

for i in range(NUM_FILES):
    convert_to_tfr(X_train, y_train, './data/train/train{}.tfrecords'.format(i))
    convert_to_tfr(X_test,  y_test,  './data/test/test{}.tfrecords'.format(i))
    convert_to_tfr(X_val,   y_val,   './data/val/val{}.tfrecords'.format(i))

## Upload the data to S3

In [22]:
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
role = get_execution_role()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/DEMO-hello-pipe-mode'

inputs = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-hello-pipe-mode')
print(inputs)

s3://sagemaker-us-east-1-355151823911/data/DEMO-hello-pipe-mode


# Create a training job using the `TensorFlow` estimator

The `sagemaker.tensorflow.TensorFlow` estimator handles locating the script mode container, uploading your script to a S3 location and creating a SageMaker training job. 

In [110]:
from sagemaker.tensorflow import TensorFlow

train_instance_type = 'ml.c5.xlarge' 
serve_instance_type = 'ml.m4.xlarge'

hyperparameters = {'epochs': 5, 'batch_size': 8, 'model_dir': '/opt/ml/model',
                   'num_train_samples': num_train_samples,
                   'num_val_samples': num_val_samples,
                   'num_test_samples': num_test_samples}

pipe_estimator = TensorFlow(entry_point='pipe_train.py',
                            source_dir='scripts',
                            input_mode='Pipe', #'File',
                            train_instance_type=train_instance_type,
                            train_instance_count=1,
                            metric_definitions=[
                               {'Name' : 'validation:acc', 
                                'Regex': '- val_acc: (.*?$)'},
                               {'Name' : 'validation:loss', 
                                'Regex': '- val_loss: (.*?) '}],
                           hyperparameters=hyperparameters,
                           role=sagemaker.get_execution_role(),
                           framework_version='1.12',
                           py_version='py3',
                           script_mode=True)

In [None]:
remote_inputs = {'train' : inputs+'/train', 
                 'val'   : inputs+'/val', 
                 'test'  : inputs+'/test'}
pipe_estimator.fit(remote_inputs, wait=True)

2019-05-24 21:12:02 Starting - Starting the training job..