# Using TensorFlow with Amazon SageMaker's training and hosting services

1. [Introduction](#Introduction)
2. [Setup](#Setup)
3. [Train the model](#Train-the-model)
4. [Host the model](#Host-the-model)
5. [Clean up](#Clean-up)

## Introduction

The previous lab performed training and prediction directly in the Jupyter notebook environment. With this lab, we transition to leveraging SageMaker's managed training and hosting services. To accomplish this, we use [Amazon SageMaker's TensorFlow container](https://sagemaker.readthedocs.io/en/stable/using_tf.html), which lets you provide your training code as a Python script. The container also provides a flexible way for you to customize how inference inputs and outputs are handled over a REST interface. Here is a [blog post](https://aws.amazon.com/blogs/machine-learning/using-tensorflow-eager-execution-with-amazon-sagemaker-script-mode/) describing how TensorFlow eager execution is supported by the container.

## Setup

Before preparing the data, there are some initial steps required for setup. To train the image classification algorithm on Amazon SageMaker, we need to setup and authenticate the use of AWS services. To begin with, we need an AWS account role with SageMaker access. Here we will use the execution role the current notebook instance was given when it was created.  This role has necessary permissions, including access to your data in S3.

In [2]:
import sagemaker
import os
from sagemaker import get_execution_role

role = get_execution_role()
print(role)
sess = sagemaker.Session()

arn:aws:iam::152804913371:role/sma-ms1-noauth-uw2-SageMakerExecutionRole-SDHI2Z7292TI


We also need to identify the S3 bucket that you want to use for providing training and validation datasets.  It will  be used to store the tranied model artifacts as well. In this notebook, we use a default bucket for use with SageMaker in your account. Alternatively, you could use whatever bucket you would like.  We use an object prefix to help organize the bucket content.

In [3]:
bucket = sess.default_bucket() # or use your own custom bucket name
s3_prefix = 'DEMO-TF-image-classification-birds'

# Data Preparation

This notebook assumes you have already downloaded and unpacked the dataset into your notebook instance as part of the first lab. If you have not already done so, please run the first notebook in this lab

## Set some parameters for the rest of the notebook to use

Here we define a few parameters that help drive the rest of the notebook.  For example, `SAMPLE_ONLY` is defaulted to `True`. This will force the notebook to train on only a handful of species.  Setting `SAMPLE_ONLY` to false will make the notebook work with the entire dataset of 200 bird species.  This makes the training a more difficult challenge, and you will need to tune parameters and run more epochs.

An `EXCLUDE_IMAGE_LIST` is defined as a mechanism to address any corrupt images from the dataset and ensure they do not disrupt the process.

In [4]:
import pandas as pd
import numpy as np
import random
from itertools import chain
from pathlib import Path
import os
from datetime import datetime

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

np.random.seed = 12345
random.seed(12345)

BASE_PATH = Path("CUB_200_2011/")

if not BASE_PATH.exists():
    print("Looks like you've not downloaded the data yet. Please run 1_tf_image_classification_birds.ipynb first")
    
# To speed up training and experimenting, you can use a small handful of species.
# To see the full list of the classes available, look at the content of CLASSES_FILE.

SAMPLE_ONLY  = False # Let's try training on the entrie dataset
NUM_CLASSES = 10 # specify the number of species to sample

df_classes = pd.read_csv((BASE_PATH / "classes.txt"), sep=" ", header=None, names=["class_num", "class_id"])

if SAMPLE_ONLY:
    df_classes = df_classes.sample(n=NUM_CLASSES)
SELECTED_CLASSES = df_classes["class_id"].values.tolist() 
IMAGE_FILES = list(chain(*[(BASE_PATH / f"images/{specie}/").glob("*.jpg") for specie in SELECTED_CLASSES]))
    

df_data = pd.DataFrame([img.as_posix() for img in IMAGE_FILES], columns=["file_name"])
df_data["class_id"] = df_data["file_name"].apply(lambda x: x.split("/")[-2])
# df_data = df_data.merge(df_classes, on="class_id")

# Create train/val/test dataframes from our dataset

In [5]:
from sklearn.model_selection import train_test_split
train_size = 0.6
train_data, test_val_data= train_test_split(df_data, train_size=train_size, stratify=df_data["class_id"])

# split the testing and validation files into their respective sets
test_val_ratio = 0.6
test_data = test_val_data.sample(frac=test_val_ratio)
val_data = test_val_data[~test_val_data.index.isin(test_data.index)]

# Prepare the data channels for Amazon SageMaker
When using Amazon SageMaker's managed training service, you need to provide the datasets to the training algorithm. This is primarily handled via populating S3 buckets, and by indicating the location of data channels such as train, test, and validation. You also need to consider the data format. In our case, to keep things simple, we will populate the data channels with folders containing the original JPG images organized by class folders.

In [6]:
# delete any existing data
!aws s3 rm s3://{bucket}/{s3_prefix}/data --recursive > /dev/null

In [7]:
train_s3_prefix = f"{s3_prefix}/data/train"
test_s3_prefix = f"{s3_prefix}/data/test"
val_s3_prefix = f"{s3_prefix}/data/val"

In [8]:
# upload train, test, and validation data to s3 

from functools import partial
from concurrent.futures import ThreadPoolExecutor

upload_check = []
upload_check.append(len(sess.list_s3_files(bucket, train_s3_prefix)) == train_data.shape[0])
upload_check.append(len(sess.list_s3_files(bucket, test_s3_prefix)) == test_data.shape[0])
upload_check.append(len(sess.list_s3_files(bucket, val_s3_prefix)) == val_data.shape[0])

def _upload_data(path, bucket, prefix, sagemaker_session):
    
    class_dir = path.split("/")[-2]
    prefix = f"{prefix}/{class_dir}"
    
    return sagemaker_session.upload_data(path, bucket, prefix)
    

if all(upload_check):
    print("Data has already been uploaded")
else:
    if input("Do you want to delete any existing data before uploading?") == "yes":
        print("Deleting existing data")
        !aws s3 rm s3://{bucket}/{s3_prefix}/data --recursive > /dev/null
    print("uploading data")
    for dataset, prefix in zip([train_data, test_data, val_data], [train_s3_prefix, test_s3_prefix, val_s3_prefix]):

        upload_data = partial(_upload_data, bucket=bucket, prefix=prefix, sagemaker_session=sess)

        with ThreadPoolExecutor(max_workers=16) as executor:
            executor.map(upload_data, dataset["file_name"].values)
print("data has been uploaded")

Do you want to delete any existing data before uploading? yes


Deleting existing data
uploading data
data has been uploaded


# Train the model 
When using SageMaker's TensorFlow container, the custom TensorFlow training code is provided via a Python script in a separate file that gets passed to SageMaker. For our example, that script is shown below for completeness in the notebook. Study that code before proceeding to the actual training. Pay attention to any differences from the code you used in the first lab when training directly in the notebook:

- Command line arguments are defined using teh argparse library. We can pass these as script parameters when we launch a SageMaker training job
- Added`inference.py` and `requirements.txt` to the code directory for use with TensorFlow Serving at inference time.
- Different approach to model saving to be compatible with SageMaker's use of TensorFlow Serving.

In [10]:
!pygmentize "code/train-mobilenet.py" | cat -n

     1	[37m# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.[39;49;00m
     2	[37m#[39;49;00m
     3	[37m# Licensed under the Apache License, Version 2.0 (the "License"). You[39;49;00m
     4	[37m# may not use this file except in compliance with the License. A copy of[39;49;00m
     5	[37m# the License is located at[39;49;00m
     6	[37m#[39;49;00m
     7	[37m#     http://aws.amazon.com/apache2.0/[39;49;00m
     8	[37m#[39;49;00m
     9	[37m# or in the "license" file accompanying this file. This file is[39;49;00m
    10	[37m# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF[39;49;00m
    11	[37m# ANY KIND, either express or implied. See the License for the specific[39;49;00m
    12	[37m# language governing permissions and limitations under the License.[39;49;00m
    13	
    14	[34mimport[39;49;00m [04m[36mtensorflow[39;49;00m [34mas[39;49;00m [04m[36mtf[39;49;00m
    15	[34mfrom[39;49;00m [04m[36mtensorfl

In [11]:
!pygmentize code/inference.py | cat -n

     1	[37m# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.[39;49;00m
     2	[37m#[39;49;00m
     3	[37m# Licensed under the Apache License, Version 2.0 (the "License"). You[39;49;00m
     4	[37m# may not use this file except in compliance with the License. A copy of[39;49;00m
     5	[37m# the License is located at[39;49;00m
     6	[37m#[39;49;00m
     7	[37m#     http://aws.amazon.com/apache2.0/[39;49;00m
     8	[37m#[39;49;00m
     9	[37m# or in the "license" file accompanying this file. This file is[39;49;00m
    10	[37m# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF[39;49;00m
    11	[37m# ANY KIND, either express or implied. See the License for the specific[39;49;00m
    12	[37m# language governing permissions and limitations under the License.[39;49;00m
    13	
    14	[34mfrom[39;49;00m [04m[36mio[39;49;00m [34mimport[39;49;00m BytesIO
    15	[34mimport[39;49;00m [04m[36mjson[39;49;00m
    16	[34m

This is the set of packages that will be pip installed in your endpoint before calling your `input_handler` from your `inference.py` script.

## Create the SageMaker training job using the TensorFlow container
Here we establish the Tensorflow estimator object. Metric definitions are provided so that you can visualize metrics from the SageMaker console as well as from CloudWatch. These same metrics can be used when optimizing your model with automatic model tuning.

### A note about instance types and account limits
You may find yourself running into an error like this:

````
ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p3.2xlarge for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.
````
To avoid customers getting unexpected bills for more powerful and more expensive instance usage, accounts are established by default with limited access to certain instance types. These are soft limits that can be raised by contacting AWS support. This lab defaults to a powerful GPU instance type, but you can run it on a lower-powered instance type. In such a case, you will pay less, but your training jobs will take longer. For training this model, a smaller `ml.g4dn.xlarge` instance should be sufficient

In [9]:
%pip install sagemaker-experiments -Uqq

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Note: you may need to restart the kernel to use updated packages.


In [10]:
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent

In [11]:
bird_classification_experiment = Experiment.create(
    experiment_name=f"bird-classification-{random.randint(0,1000)}",
    description="Classification of bird species", 
    sagemaker_boto_client=sess.sagemaker_client)
print(bird_classification_experiment)

Experiment(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f883d076850>,experiment_name='bird-classification-426',description='Classification of bird species',tags=None,experiment_arn='arn:aws:sagemaker:us-west-2:152804913371:experiment/bird-classification-426',response_metadata={'RequestId': '4094af7d-6051-4d5f-9cc5-ad654436f1cc', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '4094af7d-6051-4d5f-9cc5-ad654436f1cc', 'content-type': 'application/x-amz-json-1.1', 'content-length': '95', 'date': 'Fri, 14 Jan 2022 20:29:18 GMT'}, 'RetryAttempts': 1})


In [12]:
trial_name = f"bird-classification-mobilenet-{random.randint(0,1000)}"
trial = Trial.create(
        trial_name=trial_name,
        experiment_name=bird_classification_experiment.experiment_name,
        sagemaker_boto_client=sess.sagemaker_client,
    )

In [13]:
from sagemaker.tensorflow import TensorFlow

TF_FRAMEWORK_VERSION = '2.4'

hyperparameters = {'epochs': 5, 
                   'dropout': 0.4,
                   'num_fully_connected_layers': 1,
                   'num_unit_per_layer': 256,
                   'debug': None}

metric_definitions=[{'Name' : 'validation:acc', 
                     'Regex': 'validation_accuracy: (.*$)'},
                    {'Name' : 'validation:loss', 
                     'Regex': 'validation_loss: (.*$)'}
                   ]

estimator = TensorFlow(entry_point='train-mobilenet.py',
                       source_dir='code',
                       train_instance_type="ml.g4dn.xlarge",
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       metric_definitions=metric_definitions,
                       role=sagemaker.get_execution_role(),
                       framework_version=TF_FRAMEWORK_VERSION, 
                       py_version='py37',
                       base_job_name="bird-classification"
                      )

## To experiment with the use of Spot instances for SageMaker training, add this set of parameters to your
## call above when creating the TensorFlow estimator object:
##
######    train_use_spot_instances=True, train_max_run=2*60*60, train_max_wait=3*60*60,

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Here we establish pointers to where each data channel is located on S3

In [14]:
inputs = {'train':f"s3://{bucket}/{train_s3_prefix}/", 'test': f"s3://{bucket}/{test_s3_prefix}/"}

Here we tell the estimator to fit the model. A sepperate training job instance is launched and training is performed on that instance

In [15]:
estimator.fit(inputs, 
              experiment_config={
            "TrialName": trial_name,
            "TrialComponentDisplayName": "Training",
        }
             ) 

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: bird-classification-2022-01-14-20-29-40-383


2022-01-14 20:29:41 Starting - Starting the training job...
2022-01-14 20:30:08 Starting - Launching requested ML instancesProfilerReport-1642192180: InProgress
......
2022-01-14 20:31:08 Starting - Preparing the instances for training......
2022-01-14 20:32:12 Downloading - Downloading input data...............
2022-01-14 20:34:29 Training - Downloading the training image..............[34m2022-01-14 20:36:50.455118: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[34m2022-01-14 20:36:50.458817: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.[0m
[34m2022-01-14 20:36:50.522118: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0[0m
[34m2022-01-14 20:36:50.592837: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:46

In [16]:
print('Completed training job: {}'.format(estimator.latest_training_job.name))

Completed training job: bird-classification-2022-01-14-20-29-40-383


In [17]:
from sagemaker.analytics import ExperimentAnalytics
trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sess,
    experiment_name=bird_classification_experiment.experiment_name
)
analytic_table = trial_component_analytics.dataframe()
analytic_table

Unnamed: 0,TrialComponentName,DisplayName,SourceArn,SageMaker.ImageUri,SageMaker.InstanceCount,SageMaker.InstanceType,SageMaker.VolumeSizeInGB,debug,dropout,epochs,...,test - MediaType,test - Value,train - MediaType,train - Value,SageMaker.DebugHookOutput - MediaType,SageMaker.DebugHookOutput - Value,SageMaker.ModelArtifact - MediaType,SageMaker.ModelArtifact - Value,Trials,Experiments
0,bird-classification-2022-01-14-20-29-40-383-aw...,Training,arn:aws:sagemaker:us-west-2:152804913371:train...,763104351884.dkr.ecr.us-west-2.amazonaws.com/t...,1.0,ml.g4dn.xlarge,30.0,,0.4,5.0,...,,s3://sagemaker-us-west-2-152804913371/DEMO-TF-...,,s3://sagemaker-us-west-2-152804913371/DEMO-TF-...,,s3://sagemaker-us-west-2-152804913371/,,s3://sagemaker-us-west-2-152804913371/bird-cla...,[bird-classification-mobilenet-839],[bird-classification-426]


# Host the model

Here we deploy the model to a SageMaker endpoint.

If iterating on changes to your `inference.py` script, we re-create the SageMaker model object with the latest version of your script and deploy the endpoint using this latest model. This avoids having to run a new training job (`estimator.fit(inputs)`) just to grab your latest script copy.

Otherwise, simply deploy the model directly from the original estimator. You can use this approach (`estimator.deploy()`) once your `inference.py` code is stable. The deploy method will automatically create a SageMaker Model object on your behalf before creating your endpoint.

In [18]:
from sagemaker.tensorflow.serving import TensorFlowModel

# here we'll build a new model object rather than invoking estimator.deploy
# This will give us more flexibility to iterate on inference.py

tf_model = TensorFlowModel(model_data=estimator.model_data,
                           role=role, 
                           source_dir="code", 
                           entry_point="inference.py",                        
                           framework_version=TF_FRAMEWORK_VERSION
               )
predictor = tf_model.deploy(initial_instance_count=1, 
                           instance_type="ml.m5.xlarge"
                       )

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
INFO:sagemaker:Creating model with name: tensorflow-inference-2022-01-14-20-41-24-358
INFO:sagemaker:Creating endpoint with name tensorflow-inference-2022-01-14-20-41-24-785


-----!

In [19]:
# identity serializer will allow us to send images as bytes and have inference.py handle conversion to numpy
from sagemaker.serializers import IdentitySerializer
predictor.serializer = IdentitySerializer(content_type="application/x-image")

In [20]:
# Let's test out the endpoint on a single random image from the validation data set
sample_val_image = val_data.sample(1)["file_name"].values[0]
with open(sample_val_image, "rb") as f:
    sample_image_data = f.read()

prediction = predictor.predict(sample_image_data)

print(f"Inference endpoint returned a payload with {list(prediction.keys())} for {sample_val_image}")
print(f"The prediction is {prediction['predicted_class']} with a probability of {np.array(prediction['probabilities']).max():.2%}")

Inference endpoint returned a payload with ['predicted_class', 'class_labels', 'probabilities'] for CUB_200_2011/images/001.Black_footed_Albatross/Black_Footed_Albatross_0067_170.jpg
The prediction is 139.Scarlet_Tanager with a probability of 7.79%


## Batch Transform
In lieu of a realtime endpoint, a trained model can be utilized in a Batch Transform job. Here a Batch Transform job is invoked to run inference on the validation images. We simply need to instantiate a Transformer object and pass in the location of the data. In this case, batch transform will output a json file with the output results for each image

In [21]:
transformer = tf_model.transformer(instance_count=1, 
                                   instance_type="ml.g4dn.xlarge", 
                                   accept="application/json", 
                                   max_payload=1, 
                                   max_concurrent_transforms=10
                              )

INFO:sagemaker:Creating model with name: tensorflow-inference-2022-01-14-20-44-14-708


In [22]:
transformer.transform(f"s3://{bucket}/{val_s3_prefix}/", content_type="application/x-image", wait=True, logs=False) # BT produces a lot of logs which can overwhelm the notebook output

INFO:sagemaker:Creating transform job with name: tensorflow-inference-2022-01-14-20-44-15-144


......................................................................................................!


In [23]:
*_, transformer_output_bucket, transformer_output_key = transformer.output_path.split("/")
# Let's print out the S3 paths for the first 5 output files
batch_output_files = sess.list_s3_files(transformer_output_bucket, transformer_output_key)
print("First 5 batch transform output files:\n", "\n".join(batch_output_files[:5]))

First 5 batch transform output files:
 tensorflow-inference-2022-01-14-20-44-15-144/001.Black_footed_Albatross/Black_Footed_Albatross_0001_796111.jpg.out
tensorflow-inference-2022-01-14-20-44-15-144/001.Black_footed_Albatross/Black_Footed_Albatross_0007_796138.jpg.out
tensorflow-inference-2022-01-14-20-44-15-144/001.Black_footed_Albatross/Black_Footed_Albatross_0017_796098.jpg.out
tensorflow-inference-2022-01-14-20-44-15-144/001.Black_footed_Albatross/Black_Footed_Albatross_0025_796057.jpg.out
tensorflow-inference-2022-01-14-20-44-15-144/001.Black_footed_Albatross/Black_Footed_Albatross_0035_796140.jpg.out


In [24]:
# Check what the output file looks like 
print(sess.read_s3_file(transformer_output_bucket, batch_output_files[0])) # same output as what we've seen with the realtime endpoint

{"predicted_class": "088.Western_Meadowlark", "class_labels": ["120.Fox_Sparrow", "127.Savannah_Sparrow", "169.Magnolia_Warbler", "125.Lincoln_Sparrow", "035.Purple_Finch", "063.Ivory_Gull", "061.Heermann_Gull", "092.Nighthawk", "163.Cape_May_Warbler", "038.Great_Crested_Flycatcher", "126.Nelson_Sharp_tailed_Sparrow", "107.Common_Raven", "066.Western_Gull", "124.Le_Conte_Sparrow", "142.Black_Tern", "134.Cape_Glossy_Starling", "102.Western_Wood_Pewee", "005.Crested_Auklet", "145.Elegant_Tern", "053.Western_Grebe", "146.Forsters_Tern", "012.Yellow_headed_Blackbird", "078.Gray_Kingbird", "026.Bronzed_Cowbird", "164.Cerulean_Warbler", "173.Orange_crowned_Warbler", "001.Black_footed_Albatross", "056.Pine_Grosbeak", "036.Northern_Flicker", "019.Gray_Catbird", "071.Long_tailed_Jaeger", "171.Myrtle_Warbler", "193.Bewick_Wren", "170.Mourning_Warbler", "172.Nashville_Warbler", "040.Olive_sided_Flycatcher", "135.Bank_Swallow", "016.Painted_Bunting", "017.Cardinal", "167.Hooded_Warbler", "003.Soot

# Clean up
Finally, and importantly, to avoid being billed for an idle endpoint, here we delete the SageMaker endpoint.

In [25]:
predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint configuration with name: tensorflow-inference-2022-01-14-20-41-24-785
INFO:sagemaker:Deleting endpoint with name: tensorflow-inference-2022-01-14-20-41-24-785
