# Amazon SageMaker pipe mode with TensorFlow

New data scientists and machine learning engineers have a treasure trove of examples available on the internet to help them get started. These examples typically leverage small public datasets and demonstrate common use cases and approaches. The data in these examples can be downloaded quickly to a training instance and training can be completed from there. However, many customers have large scale datasets for machine learning that make the simple approach of downloading the full dataset prohibitive. Imagine your training algorithm waiting for a download of 100GB of medical images, or 100TB of video.

Amazon SageMaker provides Pipe mode for exactly this purpose. Pipe mode lets you establish a channel to your dataset and feed your training algorithm batches of that data incrementally. Your training can start quickly, and you can train on an infinite size dataset.

While there are several examples available on the use of Pipe mode, not all possible scenarios and use cases are covered. This notebook provides an end to end example for training with a fairly common combination:

1. Custom TensorFlow neural network.
2. Script mode using SageMaker's TensorFlow container.
3. Pipe mode to incrementally stream data to the neural network.
4. Data stored in TFRecords format.
5. Data channels containing multiple files

## Simple synthetic classification dataset

For this example, we use a simple numeric dataset that we will use for binary classification. With the focus of this notebook on quickly and easily demonstrating pipe mode, our synthetic dataset has a configurable number of features and samples. Feel free to scale it up to see the approach in action on large datasets. To get started, we create a synthetic dataset and split it into train, test, and validation.

Our training script uses the number of features
to define the input shape for a simple TensorFlow neural network. Here we use a `sed` script
ensure the training script is consistent with our generated dataset.

In [122]:
import os
import shutil

In [123]:
NUM_SAMPLES  = 5000
NUM_FILES    = 50
NUM_EPOCHS   = 25
BATCH_SIZE   = 32
INPUT_MODE   = 'Pipe' # Can try it with 'File' mode as well

NUM_FEATURES = 7499
!sed 's/NUM_FEATURES = /NUM_FEATURES = {NUM_FEATURES} \#/' scripts/train.py > scripts/tmp.py
!mv scripts/tmp.py scripts/train.py

In [124]:
!pygmentize scripts/train.py

[37m#     Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.[39;49;00m
[37m#[39;49;00m
[37m#     Licensed under the Apache License, Version 2.0 (the "License").[39;49;00m
[37m#     You may not use this file except in compliance with the License.[39;49;00m
[37m#     A copy of the License is located at[39;49;00m
[37m#[39;49;00m
[37m#         https://aws.amazon.com/apache-2-0/[39;49;00m
[37m#[39;49;00m
[37m#     or in the "license" file accompanying this file. This file is distributed[39;49;00m
[37m#     on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either[39;49;00m
[37m#     express or implied. See the License for the specific language governing[39;49;00m
[37m#     permissions and limitations under the License.[39;49;00m

[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mfrom[39;49;00m [04m[36mos[39;49;00m [34mimport[39;49;00m listdir
[34mfrom[3

In [125]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

X1, Y1 = make_classification(n_samples=NUM_SAMPLES, n_features=NUM_FEATURES, n_redundant=0, 
                             n_informative=1, n_classes=2, n_clusters_per_class=1, 
                             shuffle=True, class_sep=2.0)

# split data into train and test sets
seed = 7
val_size  = 0.20
test_size = 0.10

# Give 70% to train
X_train, X_test, y_train, y_test = \
    train_test_split(X1, Y1, test_size=(test_size + val_size), random_state=seed)
# Of the remaining 30%, give 2/3 to validation and 1/3 to test
X_test, X_val, y_test, y_val     = \
    train_test_split(X_test, y_test, test_size=(test_size / (test_size + val_size)), 
                     random_state=seed)

print('Train shape: {}, Test shape: {}, Val shape: {}'.format(X_train.shape, 
                                                              X_test.shape, X_val.shape))
print('Train target: {}, Test target: {}, Val target: {}'.format(y_train.shape, 
                                                                 y_test.shape, y_val.shape))
print('\nSample observation: {}\nSample target: {}'.format(X_test[0], y_test[0]))

Train shape: (3499, 7499), Test shape: (1000, 7499), Val shape: (501, 7499)
Train target: (3499,), Test target: (1000,), Val target: (501,)

Sample observation: [ 0.10480941  0.20289936 -0.35678267 ...  0.89810205 -0.59653065
 -0.59281759]
Sample target: 0


In [126]:
num_train_samples = X_train.shape[0]
num_val_samples   = X_val.shape[0]
num_test_samples  = X_test.shape[0]

## Saving data to TFRecord files

Pipe mode supports RecordIO, TFRecord, and TextLine. Here we will use TFRecord format, and for each of train, test, and val, we generate a set of files so we can see how Pipe mode is able to deal with sets of files. We divide the dataset into a configurable set of slices and save each slice to a separate file. If we were dealing with a massive dataset, dividing the data into separate files makes it easier to feed the data to your training algorithm, as well as facilitating training across a cluster of machines to reduce training time.

In [127]:
import tensorflow as tf
from sagemaker.tensorflow import TensorFlow

In [128]:
def convert_to_tfr(x, y, out_file):
    with tf.python_io.TFRecordWriter(out_file) as record_writer:
      num_samples = len(x)
      for i in range(num_samples):
        example = tf.train.Example()
        example.features.feature['features'].float_list.value.extend(x[i])
        example.features.feature['label'].int64_list.value.append(int(y[i]))
        record_writer.write(example.SerializeToString())

Remove old data directories and files if they exist. Recreate a data folder with subfolders for each of the three channels.

In [129]:
shutil.rmtree('data', ignore_errors=True)
os.makedirs('data/train')
os.makedirs('data/test')
os.makedirs('data/val')

Save each of the datasets into their own folder of files based on the configurable number of files. The data will be split as evenly as possible across that number of files in each channel.

In [130]:
def save_to_n_files(x, y, n_files, channel):
    _split_x = np.array_split(x, n_files)
    _split_y = np.array_split(y, n_files)
    for i in range(n_files):
        convert_to_tfr(_split_x[i], _split_y[i], 
                       './data/{}/{}{}.tfrecords'.format(channel, channel, i))

In [131]:
save_to_n_files(X_train, y_train, NUM_FILES, 'train')
save_to_n_files(X_test,  y_test,  NUM_FILES, 'test')
save_to_n_files(X_val,   y_val,   NUM_FILES, 'val')

## Upload the input data to S3
Save the entire data folder hierarchy up to S3. For channels configured with `Pipe` mode, the data will be piped to the training job as the training algorithm progresses. If using `File` mode, the entire set of files for each data channel will be downloaded to the training instance at the start of the job.

In [132]:
%%time
import boto3
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
role = get_execution_role()

bucket = sagemaker_session.default_bucket()
data_prefix = 'data/DEMO-hello-pipe-mode'

CPU times: user 110 ms, sys: 12.6 ms, total: 123 ms
Wall time: 246 ms


In [133]:
# clear out any old data
s3 = boto3.resource('s3')
s3_bucket = s3.Bucket(bucket)
s3_bucket.objects.filter(Prefix=data_prefix + '/').delete()

# upload the entire set of data for all three channels
inputs = sagemaker_session.upload_data(path='data', key_prefix=data_prefix)
print('Data was uploaded to s3 at: {}'.format(inputs))

Data was uploaded to s3 at: s3://sagemaker-us-east-1-355151823911/data/DEMO-hello-pipe-mode


# Create a training job using the `TensorFlow` estimator

The `sagemaker.tensorflow.TensorFlow` estimator handles locating the script mode container, uploading your script to a S3 location and creating a SageMaker training job. 

In [134]:
train_instance_type = 'ml.c5.xlarge' 
serve_instance_type = 'ml.m4.xlarge'

In [135]:
from sagemaker.tensorflow import TensorFlow

hyperparameters = {'epochs'    : NUM_EPOCHS,
                   'batch_size': BATCH_SIZE,
                   'num_train_samples': num_train_samples,
                   'num_val_samples'  : num_val_samples,
                   'num_test_samples' : num_test_samples}

estimator = TensorFlow(entry_point='train.py',
                            source_dir='scripts',
                            input_mode=INPUT_MODE,
                            train_instance_type=train_instance_type,
                            train_instance_count=1,
                            metric_definitions=[
                               {'Name': 'validation:acc',  'Regex': '- val_acc: (.*?$)'},
                               {'Name': 'validation:loss', 'Regex': '- val_loss: (.*?) '}],
                            hyperparameters=hyperparameters,
                            role=sagemaker.get_execution_role(),
                            framework_version='1.12',
                            py_version='py3',
                            base_job_name='hello-pipe-mode')

In [136]:
%%time
remote_inputs = {'train' : inputs+'/train', 
                 'val'   : inputs+'/val', 
                 'test'  : inputs+'/test'}
estimator.fit(remote_inputs, wait=True)

2019-05-28 12:22:30 Starting - Starting the training job...
2019-05-28 12:22:31 Starting - Launching requested ML instances......
2019-05-28 12:23:36 Starting - Preparing the instances for training...
2019-05-28 12:24:10 Downloading - Downloading input data..
[31m2019-05-28 12:24:36,108 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[31m2019-05-28 12:24:36,121 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-05-28 12:24:36,473 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-05-28 12:24:36,487 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-05-28 12:24:36,497 sagemaker-containers INFO     Invoking user script
[0m
[31mTraining Env:
[0m
[31m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "val": "/opt/ml/input/data/val",
        "test": "/opt/ml/input/data/test",
        "train": 

## Deploy and make predictions

In [141]:
%%time
predictor = estimator.deploy(initial_instance_count=1,
                             instance_type=serve_instance_type)

Using already existing model: hello-pipe-mode-2019-05-28-12-22-29-552


--------------------------------------------------------------------------------------!CPU times: user 334 ms, sys: 36.3 ms, total: 370 ms
Wall time: 7min 16s


Make a handful of predictions to ensure the model is being served properly and is making accurate predictions. The endpoint should yield similar accuracy to that reported at the end of the training job, as it evaluates the model using the same test dataset.

In [142]:
total_to_test = 100
num_accurate  = 0

for i in range(total_to_test):
    result = predictor.predict(X_test[i])
    predicted_prob = result['predictions'][0][0]
    predicted_label = round(predicted_prob)
    if y_test[i] == predicted_label:
        num_accurate += 1
        print('PASS. Actual: {:.0f}, Prob: {:.4f}'.format(y_test[i], predicted_prob))
    else:
        print('FAIL. Actual: {:.0f}, Prob: {:.4f}'.format(y_test[i], predicted_prob))
print('Acc: {:.2%}'.format(num_accurate/total_to_test))

PASS. Actual: 0, Prob: 0.0005
PASS. Actual: 1, Prob: 1.0000
PASS. Actual: 0, Prob: 0.0000
PASS. Actual: 1, Prob: 0.9999
PASS. Actual: 0, Prob: 0.0000
PASS. Actual: 0, Prob: 0.0201
PASS. Actual: 0, Prob: 0.0007
FAIL. Actual: 1, Prob: 0.4353
PASS. Actual: 0, Prob: 0.0000
PASS. Actual: 1, Prob: 1.0000
FAIL. Actual: 0, Prob: 0.9999
PASS. Actual: 1, Prob: 0.9977
FAIL. Actual: 1, Prob: 0.0000
PASS. Actual: 0, Prob: 0.0000
FAIL. Actual: 1, Prob: 0.0001
FAIL. Actual: 0, Prob: 0.9926
FAIL. Actual: 1, Prob: 0.0011
FAIL. Actual: 0, Prob: 0.9961
PASS. Actual: 0, Prob: 0.0105
PASS. Actual: 0, Prob: 0.0005
PASS. Actual: 1, Prob: 1.0000
PASS. Actual: 1, Prob: 0.9995
PASS. Actual: 1, Prob: 0.8669
PASS. Actual: 0, Prob: 0.0000
FAIL. Actual: 0, Prob: 0.5256
FAIL. Actual: 1, Prob: 0.1539
PASS. Actual: 0, Prob: 0.0106
FAIL. Actual: 1, Prob: 0.0000
PASS. Actual: 1, Prob: 0.9957
PASS. Actual: 1, Prob: 1.0000
PASS. Actual: 0, Prob: 0.0000
PASS. Actual: 0, Prob: 0.0922
FAIL. Actual: 1, Prob: 0.0000
PASS. Actu

## Clean up

In [139]:
sagemaker_session.delete_endpoint(predictor.endpoint)