# TensorFlow script mode training with SageMaker, and serving with AWS Lambda

Script mode is a training script format for TensorFlow that lets you execute any TensorFlow training script in SageMaker with minimal modification. The [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) handles transferring your script to a SageMaker training instance. On the training instance, SageMaker's native TensorFlow support sets up training-related environment variables and executes your training script. In this tutorial, we use the SageMaker Python SDK to launch a training job and deploy the trained model.

Script mode supports training with a Python script, a Python module, or a shell script. In this example, we use a Python script to train a classification model on the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). In this example, we will show how easily you can train a SageMaker using TensorFlow 1.x and TensorFlow 2.x scripts with SageMaker Python SDK. In addition, this notebook demonstrates how to perform real time inference with AWS Lambda Function.


# Set up the environment

Let's start by setting up the environment:

In [None]:
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()
region = sagemaker_session.boto_session.region_name

## Training Data

The MNIST dataset has been loaded to the public S3 buckets ``sagemaker-sample-data-<REGION>`` under the prefix ``tensorflow/mnist``. There are four ``.npy`` file under this prefix:
* ``train_data.npy``
* ``eval_data.npy``
* ``train_labels.npy``
* ``eval_labels.npy``

In [None]:
training_data_uri = 's3://sagemaker-sample-data-{}/tensorflow/mnist'.format(region)
print(training_data_uri)

# Construct a script for distributed training

This tutorial's training script was adapted from TensorFlow's official [CNN MNIST example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/layers/cnn_mnist.py). We have modified it to handle the ``model_dir`` parameter passed in by SageMaker. This is an S3 path which can be used for data sharing during distributed training and checkpointing and/or model persistence. We have also added an argument-parsing function to handle processing training-related variables.

At the end of the training job we have added a step to export the trained model to the path stored in the environment variable ``SM_MODEL_DIR``, which always points to ``/opt/ml/model``. This is critical because SageMaker uploads all the model artifacts in this folder to S3 at end of training.

Here is the entire script:

In [None]:
# TensorFlow 2.3.1 script
!pygmentize 'mnist-2.py'

# Create a training job using the `TensorFlow` estimator

The `sagemaker.tensorflow.TensorFlow` estimator handles locating the script mode container, uploading your script to a S3 location and creating a SageMaker training job. Let's call out a couple important parameters here:

* `py_version` is set to `'py3'` to indicate that we are using script mode since legacy mode supports only Python 2. Though Python 2 will be deprecated soon, you can use script mode with Python 2 by setting `py_version` to `'py2'` and `script_mode` to `True`.

* `distributions` is used to configure the distributed training setup. It's required only if you are doing distributed training either across a cluster of instances or across multiple GPUs. Here we are using parameter servers as the distributed training schema. SageMaker training jobs run on homogeneous clusters. To make parameter server more performant in the SageMaker setup, we run a parameter server on every instance in the cluster, so there is no need to specify the number of parameter servers to launch. Script mode also supports distributed training with [Horovod](https://github.com/horovod/horovod). You can find the full documentation on how to configure `distributions` [here](https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow#distributed-training). 



You can also initiate an estimator to train with TensorFlow 2.3 script. The only things that you will need to change are the script name and ``framewotk_version``

In [None]:
from sagemaker.tensorflow import TensorFlow

mnist_estimator2 = TensorFlow(entry_point='mnist-2.py',
                             role=role,
                             instance_count=2,
                             instance_type='ml.p3.2xlarge',
                             framework_version='2.3.1',
                             py_version='py37',
                             distribution={'parameter_server': {'enabled': True}})

## Calling ``fit``

To start a training job, we call `estimator.fit(training_data_uri)`.

An S3 location is used here as the input. `fit` creates a default channel named `'training'`, which points to this S3 location. In the training script we can then access the training data from the location stored in `SM_CHANNEL_TRAINING`. `fit` accepts a couple other types of input as well. See the API doc [here](https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.EstimatorBase.fit) for details.

When training starts, the TensorFlow container executes mnist.py, passing `hyperparameters` and `model_dir` from the estimator as script arguments. Because we didn't define either in this example, no hyperparameters are passed, and `model_dir` defaults to `s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>`, so the script execution is as follows:
```bash
python mnist.py --model_dir s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>
```
When training is complete, the training job will upload the saved model for TensorFlow serving.

Calling fit to train a model with TensorFlow 2.3 scroipt.

In [None]:
mnist_estimator2.fit(training_data_uri)

# Deploy the trained model to an AWS Lambda

Next step is to deploy the model to AWS Lambda, for serverless inference, and prepare a test event.

This is the location of the model file created by the training job on S3

In [None]:
mnist_estimator2.model_data

After a TensorFlow estimator has been fit, it saves a TensorFlow SavedModel bundle in the S3 location in [SavedModel format](https://www.tensorflow.org/guide/saved_model).
Download the model created by the training job to build the Docker image used by the Lambda function

In [None]:
!aws s3 cp $mnist_estimator2.model_data ./container/model/

Extract the `model.tar.gz` so you can see the details of the model inputs and outputs

In [None]:
!tar -xzf ./container/model/model.tar.gz

The command output should also show details of the model inputs and outputs.

Note the `serving_default` SignatureDefs and the `dense_1` output in SavedModel. Both will be used later in the Lambda function for the inference code.

In [None]:
!saved_model_cli show --all --dir ./000000001/

## Building and registering the container

The following shell code shows how to build the container image using `docker build` and push the container image to ECR using `docker push`. 

In [None]:
%%sh

# The name of our lambda function
lambda_function_name=tensorflow-mnist-inference-docker-lambda

cd container

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-east-1 if none defined)
region=$(aws configure get region)
region=${region:-us-east-1}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${lambda_function_name}:latest"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${lambda_function_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${lambda_function_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build  -t ${lambda_function_name} .
docker tag ${lambda_function_name} ${fullname}

docker push ${fullname}

This is the URI of the Docker image in ECR

In [None]:
import boto3

client = boto3.client('sts')
account_id = client.get_caller_identity()['Account']

my_session = boto3.session.Session()
region = my_session.region_name

lambda_function_name = 'tensorflow-mnist-inference-docker-lambda'

ecr_image = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account_id, region, lambda_function_name)

print(ecr_image)

## Create AWS Lambda IAM Role

In [None]:
iam = boto3.Session().client(service_name='iam', region_name=region)

In [None]:
iam_lambda_role_name = 'TensorFlow_MNIST_Lambda'

In [None]:
iam_lambda_role_passed = False

In [None]:
assume_role_policy_doc = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "lambda.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

In [None]:
import time
import json

from botocore.exceptions import ClientError

try:
    iam_role_lambda = iam.create_role(
        RoleName=iam_lambda_role_name,
        AssumeRolePolicyDocument=json.dumps(assume_role_policy_doc),
        Description='TensorFlow MNIST Lambda Role'
    )
    print('Role succesfully created.')
    iam_lambda_role_passed = True
except ClientError as e:
    if e.response['Error']['Code'] == 'EntityAlreadyExists':
        iam_role_lambda = iam.get_role(RoleName=iam_lambda_role_name)
        print('Role already exists. This is OK.')
        iam_lambda_role_passed = True
    else:
        print('Unexpected error: %s' % e)
        
time.sleep(30)

In [None]:
iam_role_lambda_name = iam_role_lambda['Role']['RoleName']
print('Role Name: {}'.format(iam_role_lambda_name))

In [None]:
iam_role_lambda_arn = iam_role_lambda['Role']['Arn']
print('Role ARN: {}'.format(iam_role_lambda_arn))

## Create AWS Lambda IAM Policy

In [None]:
lambda_policy_doc = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "UseLambdaFunction",
            "Effect": "Allow",
            "Action": [
                "lambda:InvokeFunction",
                "lambda:GetFunctionConfiguration"
            ],
            "Resource": "arn:aws:lambda:{}:{}:function:*".format(region, account_id)
        },
        {
            "Effect": "Allow",
            "Action": "cloudwatch:*",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "logs:CreateLogGroup",
            "Resource": "arn:aws:logs:{}:{}:*".format(region, account_id)
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:{}:{}:log-group:/aws/lambda/*".format(region, account_id)
        },
        {
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": "*"
        }
    ]
}

In [None]:
print(json.dumps(lambda_policy_doc, indent=4, sort_keys=True, default=str))

In [None]:
import time

response = iam.put_role_policy(
    RoleName=iam_role_lambda_name,
    PolicyName='TensorFlow_MNIST_Lambda_Policy',
    PolicyDocument=json.dumps(lambda_policy_doc)
)

time.sleep(30)

## Create The Lambda Function

In [None]:
import time
client = boto3.client('lambda')

try: 
    response = client.create_function(
        FunctionName=lambda_function_name,
        Role=iam_role_lambda_arn,
        Code={
            'ImageUri': ecr_image
        },
        PackageType='Image',
        Timeout=120,
        MemorySize=1536,
    )
    print('Creating Lambda Function {}. Please wait while it is being created.'.format(lambda_function_name))
    time.sleep(90)
    print('Lambda Function {} successfully created.'.format(lambda_function_name))
except ClientError as e:
    if e.response['Error']['Code'] == 'ResourceConflictException':
        print('Lambda Function {} already exists. This is OK.'.format(lambda_function_name))
    else:
        print('Error: {}'.format(e))

## Prepare test event for the Lambda function

In [None]:
event = {
      "bucket": 'sagemaker-sample-data-{}'.format(region),
      "prefix": 'tensorflow/mnist/',
      "file": 'train_data.npy'
    }
json.dumps(event)

# Invoke the Lambda function

In [None]:
response = client.invoke(
    FunctionName=lambda_function_name,
    InvocationType='RequestResponse',
    Payload=json.dumps(event)
)

In [None]:
print(response)

In [None]:
print('HTTPStatusCode: {}'.format(response['ResponseMetadata']['HTTPStatusCode']))

In [None]:
response = json.loads(response["Payload"].read())

In [None]:
predictions = json.loads(response['body'])

In [None]:
predictions

Let's download the training labels to use it to evaluate the model.

In [None]:
import numpy as np

!aws --region {region} s3 cp s3://sagemaker-sample-data-{region}/tensorflow/mnist/train_labels.npy train_labels.npy

train_labels = np.load('train_labels.npy')

Examine the prediction result from the TensorFlow 2.3 model.

In [None]:
for i in range(0, 50):
    label = train_labels[i]
    print('prediction is {}, label is {}, matched: {}'.format(predictions[i], label, predictions[i] == label))

# Delete the Lambda function

Let's delete the Lambda Function

In [None]:
response = client.delete_function(
    FunctionName=lambda_function_name,
)

In [None]:
print(response)