- Summary: Keras Sequential NN model on SageMaker
- Source:
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/keras_script_mode_pipe_mode_horovod/tensorflow_keras_CIFAR10.ipynb
- data set: CIFAR-10, 10 classes, 6,000 images/class

In [31]:

import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()

In [12]:
# !python generate_cifar10_tfrecords.py --data-dir ./data

In [35]:
from sagemaker.tensorflow import TensorFlow


# Local

In [41]:
import subprocess

instance_type = 'local'

if subprocess.call('nvidia-smi') == 0:
    instance_type = 'local_gpu'
    
local_hyperparameters = {
    'epochs': 2,
    'batch-size': 64
}

source_dir = os.path.join(os.getcwd(), 'source_dir')

estimator = TensorFlow(
    entry_point='cifar10_keras_main.py',
    source_dir=source_dir,
    role=role,
    framework_version='1.12.0',
    train_instance_count=1,
    train_instance_type='local',
    py_version='py3',
    hyperparameters=local_hyperparameters
)


In [44]:
local_inputs = {
    'train': 'file://' + os.getcwd() + '/data/train',
    'validation': 'file://' + os.getcwd() + '/data/validation',
    'eval': 'file://' + os.getcwd() + '/data/eval'
}

estimator.fit(local_inputs)

Creating tmprg0dav80_algo-1-jpgbf_1 ... 
[1BAttaching to tmprg0dav80_algo-1-jpgbf_12mdone[0m
[36malgo-1-jpgbf_1  |[0m 2020-03-29 10:46:31,029 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training
[36malgo-1-jpgbf_1  |[0m 2020-03-29 10:46:31,035 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-jpgbf_1  |[0m 2020-03-29 10:46:31,211 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-jpgbf_1  |[0m 2020-03-29 10:46:31,229 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-jpgbf_1  |[0m 2020-03-29 10:46:31,242 sagemaker-containers INFO     Invoking user script
[36malgo-1-jpgbf_1  |[0m 
[36malgo-1-jpgbf_1  |[0m Training Env:
[36malgo-1-jpgbf_1  |[0m 
[36malgo-1-jpgbf_1  |[0m {
[36malgo-1-jpgbf_1  |[0m     "additional_framework_parameters": {},
[36malgo-1-jpgbf_1  |[0m     "channel_input_dirs": {
[36malgo-1-jpgbf_1  |[0m  

[36mtmprg0dav80_algo-1-jpgbf_1 exited with code 0
[0mAborting on container exit...
===== Job Complete =====


# Run on SageMaker Cloud

In [None]:
dataset_location = sagemaker_session.upload_data(
    path='data',
    key_prefix='data/DEMO-cifar10-tf'
)
display(dataset_location)

In [None]:
#SageMaker can get training metrics directly from the logs and send them to CloudWatch metrics.
keras_metric_definition = [
    {'Name': 'train:loss', 'Regex': '.*loss: ([0-9\\.]+) - acc: [0-9\\.]+.*'},
    {'Name': 'train:accuracy', 'Regex': '.*loss: [0-9\\.]+ - acc: ([0-9\\.]+).*'},
    {'Name': 'validation:accuracy', 'Regex': '.*step - loss: [0-9\\.]+ - acc: [0-9\\.]+ - val_loss: [0-9\\.]+ - val_acc: ([0-9\\.]+).*'},
    {'Name': 'validation:loss', 'Regex': '.*step - loss: [0-9\\.]+ - acc: [0-9\\.]+ - val_loss: ([0-9\\.]+) - val_acc: [0-9\\.]+.*'},
    {'Name': 'sec/steps', 'Regex': '.* - \d+s (\d+)[mu]s/step - loss: [0-9\\.]+ - acc: [0-9\\.]+ - val_loss: [0-9\\.]+ - val_acc: [0-9\\.]+'}
]

In [None]:
estimator = TensorFlow(
    base_job_name='cifar10-tf',
    entry_point='cifar10_keras_main.py',
    source_dir=source_dir,
    role=role,
    framework_version='1.12.0',
    py_version='py3',
    hyperparameters=hyperparameters,
    train_instance_count=1, 
    train_instance_type='ml.p3.2xlarge',
    tags = [{'Key' : 'Project', 'Value' : 'cifar10'},{'Key' : 'TensorBoard', 'Value' : 'file'}],
    metric_definitions=keras_metric_definition
)

remote_inputs = {
    'train' : dataset_location+'/train', 
    'validation' : dataset_location+'/validation', 
    'eval' : dataset_location+'/eval'}

In [46]:
# view metrics
from IPython.core.display import Markdown

link = 'https://console.aws.amazon.com/cloudwatch/home?region=' \
    + sagemaker_session.boto_region_name \
    + '#metricsV2:query=%7B/aws/sagemaker/TrainingJobs,TrainingJobName%7D%20' \
    + estimator.latest_training_job.job_name
display(Markdown('CloudWatch metrics: [link]('+link+')'))
display(Markdown('After you choose a metric, change the period to 1 Minute (Graphed Metrics -> Period)'))

CloudWatch metrics: [link](https://console.aws.amazon.com/cloudwatch/home?region=eu-west-1#metricsV2:query=%7B/aws/sagemaker/TrainingJobs,TrainingJobName%7D%20sagemaker-tensorflow-scriptmode-2020-03-29-10-46-28-016)

After you choose a metric, change the period to 1 Minute (Graphed Metrics -> Period)

## Pipe mode

In [None]:

estimator_pipe = TensorFlow(
    base_job_name='pipe-cifar10-tf',
   entry_point='cifar10_keras_main.py',
   source_dir=source_dir,
   role=role,
   framework_version='1.12.0',
   py_version='py3',
   hyperparameters=hyperparameters,
   train_instance_count=1, train_instance_type='ml.p3.2xlarge',
   tags = [{'Key' : 'Project', 'Value' : 'cifar10'},{'Key' : 'TensorBoard', 'Value' : 'pipe'}],
   metric_definitions=keras_metric_definition,
   input_mode='Pipe')

## Distributed training with horovod

In [None]:
distributions = {'mpi': {
                    'enabled': True,
                    'processes_per_host': 1
                        }
                }

estimator_dist = TensorFlow(base_job_name='dist-cifar10-tf',
                       entry_point='cifar10_keras_main.py',
                       source_dir=source_dir,
                       role=role,
                       framework_version='1.12.0',
                       py_version='py3',
                       hyperparameters=hyperparameters,
                       train_instance_count=2, train_instance_type='ml.p3.2xlarge',
                       tags = [{'Key' : 'Project', 'Value' : 'cifar10'},{'Key' : 'TensorBoard', 'Value' : 'dist'}],
                       metric_definitions=keras_metric_definition,
                       distributions=distributions)

--------------------------------------------------------------------------------------------------