## Option 1: Bring your own Script

### Overview
Amazon SageMaker provides both (1) built-in algorithms and (2) an easy path to train your own custom models. Although the built-in algorithms cover many domains (computer vision, natural language processing etc.) and are easy to use (just provide your data), <b>sometimes training a custom model is the preferred approach. This notebook will focus on training a custom model using TensorFlow 2 </b>.



### TensorFlow script mode training and serving
Script mode is a training script format for TensorFlow that lets you execute any TensorFlow training script in SageMaker with minimal modification. The SageMaker Python SDK  handles transferring your script to a SageMaker training instance. On the training instance, SageMaker's native TensorFlow support sets up training-related environment variables and executes your training script. In this tutorial, we use the SageMaker Python SDK to launch a training job and deploy the trained model.

Script mode supports training with a Python script, a Python module, or a shell script. In this example, we use a Python script to train a classification model on the MNIST dataset . In this example, we will show how easily you can train a SageMaker using TensorFlow 1.x and TensorFlow 2.0 scripts with SageMaker Python SDK. In addition, this notebook demonstrates how to perform real time inference with the SageMaker TensorFlow Serving container. The TensorFlow Serving container is the default inference method for script mode.

# Set up the environment
Let's start by setting up the environment:

In [15]:
# cell 01: Setting up the environment
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()
region = sagemaker_session.boto_session.region_name

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


# Training Data
The MNIST dataset has been loaded to the public S3 buckets `sagemaker-sample-data-<REGION>` under the prefix `tensorflow/mnist`. There are four .npy file under this prefix:

- train_data.npy
- eval_data.npy
- train_labels.npy
- eval_labels.npy

In [16]:
# cell 02: 
training_data_uri = 's3://sagemaker-sample-data-{}/tensorflow/mnist'.format(region)
print(training_data_uri)

s3://sagemaker-sample-data-eu-central-1/tensorflow/mnist


# Construct a script for distributed training
This tutorial's training script was adapted from TensorFlow's official [CNN MNIST example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/layers/cnn_mnist.py). We have modified it to handle the `model_dir` parameter passed in by SageMaker. This is an S3 path which can be used for data sharing during distributed training and checkpointing and/or model persistence. We have also added an argument-parsing function to handle processing training-related variables.

At the end of the training job we have added a step to export the trained model to the path stored in the environment variable `SM_MODEL_DIR`, which always points to `/opt/ml/model`. This is critical because SageMaker uploads all the model artifacts in this folder to S3 at end of training.

The following cell will output the contents of the files mnist.py and mnist-2.py. Here is the entire script:

In [17]:
# cell 03: Peek at the custom scripts

!pygmentize 'mnist.py'

# TensorFlow 2.1 script
!pygmentize 'mnist-2.py'

[37m# Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.[39;49;00m[37m[39;49;00m
[37m#[39;49;00m[37m[39;49;00m
[37m# Licensed under the Apache License, Version 2.0 (the "License"). You[39;49;00m[37m[39;49;00m
[37m# may not use this file except in compliance with the License. A copy of[39;49;00m[37m[39;49;00m
[37m# the License is located at[39;49;00m[37m[39;49;00m
[37m#[39;49;00m[37m[39;49;00m
[37m#     http://aws.amazon.com/apache2.0/[39;49;00m[37m[39;49;00m
[37m#[39;49;00m[37m[39;49;00m
[37m# or in the "license" file accompanying this file. This file is[39;49;00m[37m[39;49;00m
[37m# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF[39;49;00m[37m[39;49;00m
[37m# ANY KIND, either express or implied. See the License for the specific[39;49;00m[37m[39;49;00m
[37m# language governing permissions and limitations under the License.[39;49;00m[37m[39;49;00m
[33m"""Convolutional Neural Netwo

[37m# Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.[39;49;00m[37m[39;49;00m
[37m#[39;49;00m[37m[39;49;00m
[37m# Licensed under the Apache License, Version 2.0 (the "License"). You[39;49;00m[37m[39;49;00m
[37m# may not use this file except in compliance with the License. A copy of[39;49;00m[37m[39;49;00m
[37m# the License is located at[39;49;00m[37m[39;49;00m
[37m#[39;49;00m[37m[39;49;00m
[37m#     http://aws.amazon.com/apache2.0/[39;49;00m[37m[39;49;00m
[37m#[39;49;00m[37m[39;49;00m
[37m# or in the "license" file accompanying this file. This file is[39;49;00m[37m[39;49;00m
[37m# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF[39;49;00m[37m[39;49;00m
[37m# ANY KIND, either express or implied. See the License for the specific[39;49;00m[37m[39;49;00m
[37m# language governing permissions and limitations under the License.[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[34mimport[39;49;00m [04m[36mt

# Create a training job using the TensorFlow estimator
The `sagemaker.tensorflow.TensorFlow` estimator handles locating the script mode container, uploading your script to a S3 location and creating a SageMaker training job. Let's call out a couple important parameters here:

- `py_version` is set to `'py3'` to indicate that we are using script mode since legacy mode supports only Python 2. Though Python 2 will be deprecated soon, you can use script mode with Python 2 by setting py_version to `py2` and `script_mode` to True.

- `distribution` is used to configure the distributed training setup. It's required only if you are doing distributed training either across a cluster of instances or across multiple GPUs. Here we are using parameter servers as the distributed training schema. SageMaker training jobs run on homogeneous clusters. To make parameter server more performant in the SageMaker setup, we run a parameter server on every instance in the cluster, so there is no need to specify the number of parameter servers to launch. 

`instance_type` specify the EC2 instance used for training. You should right-size your training instance based on the size of your data, algorithm and tasks. Here we choose `ml.c5.xlarge`.

NB: `use_spot_instances`(Optional): For further cost optimization, you can leverage [managed Amazon EC2 Spot instances](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html) by setting this parameter to `True`. Managed spot training can optimize the cost of training models up to 90% over on-demand instances. SageMaker manages the Spot interruptions on your behalf. You can specify which training jobs use spot instances and a stopping condition that specifies how long Amazon SageMaker waits for a job to run using Amazon EC2 Spot instances. 

In [22]:
# cell 04: Create estimator for TF v1 (model 1)
from sagemaker.tensorflow import TensorFlow

mnist_estimator = TensorFlow(entry_point='mnist.py',
                             role=role,
                             instance_count=2,
                             instance_type='ml.c5.xlarge',
                             framework_version='1.15.2',
                             py_version='py3',
                             distribution={'parameter_server': {'enabled': True}})

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


You can also initiate an estimator to train with TensorFlow 2.1 script. The only things that you will need to change are the script name and `framework_version`

In [5]:
# cell 05: Create estimator for TF v2 (model 2)
mnist_estimator2 = TensorFlow(entry_point='mnist-2.py',
                             role=role,
                             instance_count=2,
                             instance_type='ml.c5.xlarge',
                             framework_version='2.1.0',
                             py_version='py3',
                             distribution={'parameter_server': {'enabled': True}})

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


# Calling `fit`
To start a training job, we call estimator.fit(training_data_uri). Execute the contents of cell 06. It will take several minutes to execute. 

Notes:<br>
An S3 location is used here as the input. fit creates a default channel named 'training', which points to this S3 location. In the training script we can then access the training data from the location stored in SM_CHANNEL_TRAINING. fit accepts a couple other types of input as well. See the API doc [here](https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.EstimatorBase.fit) for details.

When training starts, the TensorFlow container executes mnist.py, passing hyperparameters and model_dir from the estimator as script arguments. Because we didn't define either in this example, no hyperparameters are passed, and model_dir defaults to `s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>`, so the script execution is as follows:

`python mnist.py --model_dir s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>`

When training is complete, the training job will upload the saved model for TensorFlow serving.

In [None]:
# cell 06: Train model 1
mnist_estimator.fit(training_data_uri)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Amazon SageMaker Debugger does not currently support Parameter Server distribution
INFO:sagemaker:Amazon SageMaker Debugger does not currently support Parameter Server distribution
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: tensorflow-training-2023-11-16-14-54-46-535


Using provided s3_resource
2023-11-16 14:54:46 Starting - Starting the training job...
2023-11-16 14:55:01 Starting - Preparing the instances for training......
[34m2023-11-16 14:56:27,494 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2023-11-16 14:56:27,502 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-11-16 14:56:27,585 sagemaker_tensorflow_container.training INFO     Running distributed training job with parameter servers[0m
[34m2023-11-16 14:56:27,585 sagemaker_tensorflow_container.training INFO     Launching parameter server process[0m
[34m2023-11-16 14:56:27,585 sagemaker_tensorflow_container.training INFO     Running distributed training job with parameter servers[0m
[34m2023-11-16 14:56:27,628 sagemaker_tensorflow_container.training INFO     Launching worker process[0m
[34m2023-11-16 14:56:27,778 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[


[35m2023-11-16 14:56:30,324 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[35m2023-11-16 14:56:30,332 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[35m2023-11-16 14:56:30,425 sagemaker_tensorflow_container.training INFO     Running distributed training job with parameter servers[0m
[35m2023-11-16 14:56:30,426 sagemaker_tensorflow_container.training INFO     Launching parameter server process[0m
[35m2023-11-16 14:56:30,426 sagemaker_tensorflow_container.training INFO     Running distributed training job with parameter servers[0m
[35m2023-11-16 14:56:30,470 sagemaker_tensorflow_container.training INFO     Launching worker process[0m
[35m2023-11-16 14:56:30,597 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[35m2023-11-16 14:56:30,617 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[35m2023-11-16 14:56:30,637 sagemaker-containe

[34mINFO:tensorflow:loss = 2.290492, step = 142 (10.223 sec)[0m
[34mINFO:tensorflow:loss = 2.290492, step = 142 (10.223 sec)[0m
[35mINFO:tensorflow:global_step/sec: 19.0922[0m
[35mINFO:tensorflow:global_step/sec: 19.0922[0m
[35mINFO:tensorflow:global_step/sec: 19.829[0m
[35mINFO:tensorflow:global_step/sec: 19.829[0m
[35mINFO:tensorflow:loss = 2.269813, step = 265 (11.022 sec)[0m
[35mINFO:tensorflow:loss = 2.269813, step = 265 (11.022 sec)[0m
[34mINFO:tensorflow:loss = 2.2737763, step = 331 (9.749 sec)[0m
[34mINFO:tensorflow:loss = 2.2737763, step = 331 (9.749 sec)[0m
[35mINFO:tensorflow:global_step/sec: 19.0714[0m
[35mINFO:tensorflow:global_step/sec: 19.0714[0m
[35mINFO:tensorflow:global_step/sec: 19.4173[0m
[35mINFO:tensorflow:global_step/sec: 19.4173[0m
[35mINFO:tensorflow:loss = 2.2643905, step = 469 (10.475 sec)[0m
[35mINFO:tensorflow:loss = 2.2643905, step = 469 (10.475 sec)[0m
[34mINFO:tensorflow:loss = 2.2599397, step = 528 (10.676 sec)[0m
[34m

[35mINFO:tensorflow:global_step/sec: 20.1114[0m
[35mINFO:tensorflow:global_step/sec: 20.1114[0m
[35mINFO:tensorflow:loss = 0.3454706, step = 3837 (10.427 sec)[0m
[35mINFO:tensorflow:loss = 0.3454706, step = 3837 (10.427 sec)[0m
[35mINFO:tensorflow:global_step/sec: 20.4088[0m
[35mINFO:tensorflow:global_step/sec: 20.4088[0m
[34mINFO:tensorflow:loss = 0.32789242, step = 3957 (9.519 sec)[0m
[34mINFO:tensorflow:loss = 0.32789242, step = 3957 (9.519 sec)[0m
[35mINFO:tensorflow:global_step/sec: 19.6582[0m
[35mINFO:tensorflow:global_step/sec: 19.6582[0m
[35mINFO:tensorflow:loss = 0.314485, step = 4048 (10.460 sec)[0m
[35mINFO:tensorflow:loss = 0.314485, step = 4048 (10.460 sec)[0m
[35mINFO:tensorflow:global_step/sec: 17.9336[0m
[35mINFO:tensorflow:global_step/sec: 17.9336[0m
[34mINFO:tensorflow:loss = 0.50722253, step = 4147 (9.959 sec)[0m
[34mINFO:tensorflow:loss = 0.50722253, step = 4147 (9.959 sec)[0m
[35mINFO:tensorflow:global_step/sec: 20.4146[0m
[35mINF

[34mINFO:tensorflow:loss = 0.19162863, step = 7335 (9.328 sec)[0m
[34mINFO:tensorflow:loss = 0.19162863, step = 7335 (9.328 sec)[0m
[35mINFO:tensorflow:global_step/sec: 20.0806[0m
[35mINFO:tensorflow:global_step/sec: 20.0806[0m
[35mINFO:tensorflow:global_step/sec: 20.0303[0m
[35mINFO:tensorflow:global_step/sec: 20.0303[0m
[35mINFO:tensorflow:loss = 0.22332221, step = 7476 (10.724 sec)[0m
[35mINFO:tensorflow:loss = 0.22332221, step = 7476 (10.724 sec)[0m
[34mINFO:tensorflow:loss = 0.30459142, step = 7520 (9.256 sec)[0m
[34mINFO:tensorflow:loss = 0.30459142, step = 7520 (9.256 sec)[0m
[35mINFO:tensorflow:global_step/sec: 20.7604[0m
[35mINFO:tensorflow:global_step/sec: 20.7604[0m
[35mINFO:tensorflow:global_step/sec: 20.4728[0m
[35mINFO:tensorflow:global_step/sec: 20.4728[0m
[35mINFO:tensorflow:loss = 0.22084726, step = 7691 (10.468 sec)[0m
[35mINFO:tensorflow:loss = 0.22084726, step = 7691 (10.468 sec)[0m
[34mINFO:tensorflow:loss = 0.22179325, step = 7706 

[35mINFO:tensorflow:global_step/sec: 20.3659[0m
[35mINFO:tensorflow:global_step/sec: 20.3659[0m
[35mINFO:tensorflow:loss = 0.207526, step = 10690 (10.523 sec)[0m
[35mINFO:tensorflow:loss = 0.207526, step = 10690 (10.523 sec)[0m
[34mINFO:tensorflow:loss = 0.1276577, step = 10706 (9.216 sec)[0m
[34mINFO:tensorflow:loss = 0.1276577, step = 10706 (9.216 sec)[0m
[35mINFO:tensorflow:global_step/sec: 20.2547[0m
[35mINFO:tensorflow:global_step/sec: 20.2547[0m
[35mINFO:tensorflow:global_step/sec: 20.2485[0m
[35mINFO:tensorflow:global_step/sec: 20.2485[0m
[34mINFO:tensorflow:loss = 0.23454823, step = 10895 (9.256 sec)[0m
[34mINFO:tensorflow:loss = 0.23454823, step = 10895 (9.256 sec)[0m
[35mINFO:tensorflow:loss = 0.20318785, step = 10902 (10.458 sec)[0m
[35mINFO:tensorflow:loss = 0.20318785, step = 10902 (10.458 sec)[0m
[35mINFO:tensorflow:global_step/sec: 20.6954[0m
[35mINFO:tensorflow:global_step/sec: 20.6954[0m
[35mINFO:tensorflow:global_step/sec: 20.2353[0m


[34mINFO:tensorflow:loss = 0.22270523, step = 13010 (9.230 sec)[0m
[34mINFO:tensorflow:loss = 0.22270523, step = 13010 (9.230 sec)[0m
[35mINFO:tensorflow:global_step/sec: 19.9795[0m
[35mINFO:tensorflow:global_step/sec: 19.9795[0m
[35mINFO:tensorflow:global_step/sec: 20.4888[0m
[35mINFO:tensorflow:global_step/sec: 20.4888[0m
[35mINFO:tensorflow:loss = 0.109329775, step = 13200 (10.700 sec)[0m
[35mINFO:tensorflow:loss = 0.109329775, step = 13200 (10.700 sec)[0m
[34mINFO:tensorflow:loss = 0.1821653, step = 13197 (9.318 sec)[0m
[34mINFO:tensorflow:loss = 0.1821653, step = 13197 (9.318 sec)[0m
[35mINFO:tensorflow:global_step/sec: 20.0431[0m
[35mINFO:tensorflow:global_step/sec: 20.0431[0m
[35mINFO:tensorflow:global_step/sec: 20.0255[0m
[35mINFO:tensorflow:global_step/sec: 20.0255[0m
[34mINFO:tensorflow:loss = 0.25459218, step = 13387 (9.424 sec)[0m
[34mINFO:tensorflow:loss = 0.25459218, step = 13387 (9.424 sec)[0m
[35mINFO:tensorflow:loss = 0.27609864, step =

Execute the contents of next cell to train a model with TensorFlow 2.1 script. It will take several minutes to execute. 

In [7]:
# cell 07: Train model 2
mnist_estimator2.fit(training_data_uri)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Amazon SageMaker Debugger does not currently support Parameter Server distribution
INFO:sagemaker:Amazon SageMaker Debugger does not currently support Parameter Server distribution
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: tensorflow-training-2023-11-16-13-57-52-608


Using provided s3_resource
2023-11-16 13:57:53 Starting - Starting the training job...
2023-11-16 13:58:07 Starting - Preparing the instances for training......
2023-11-16 13:59:15 Downloading - Downloading input data...
2023-11-16 13:59:51 Training - Training image download completed. Training in progress..[35m2023-11-16 13:59:56,925 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[35m2023-11-16 13:59:56,933 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[35m2023-11-16 13:59:57,028 sagemaker_tensorflow_container.training INFO     Running distributed training job with parameter servers[0m
[35m2023-11-16 13:59:57,028 sagemaker_tensorflow_container.training INFO     Launching parameter server process[0m
[35m2023-11-16 13:59:57,028 sagemaker_tensorflow_container.training INFO     Running distributed training job with parameter servers[0m
[35m2023-11-16 13:59:57,060 sagemaker_tensorflow_container.traini

[34m2023-11-16 13:59:56,941 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2023-11-16 13:59:56,948 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-11-16 13:59:57,040 sagemaker_tensorflow_container.training INFO     Running distributed training job with parameter servers[0m
[34m2023-11-16 13:59:57,040 sagemaker_tensorflow_container.training INFO     Launching parameter server process[0m
[34m2023-11-16 13:59:57,040 sagemaker_tensorflow_container.training INFO     Running distributed training job with parameter servers[0m
[34m2023-11-16 13:59:57,076 sagemaker_tensorflow_container.training INFO     Launching worker process[0m
[34m2023-11-16 13:59:57,200 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-11-16 13:59:57,219 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-11-16 13:59:57,236 sagemaker-container

[36m2023-11-16 14:00:08,970 sagemaker_tensorflow_container.training INFO     master algo-1 is down, stopping parameter server[0m
[36mFor details of how to construct your training script see:[0m
[36mhttps://sagemaker.readthedocs.io/en/stable/using_tf.html#adapting-your-local-tensorflow-script[0m
[36m2023-11-16 14:00:08,971 sagemaker-containers INFO     Reporting training SUCCESS[0m

2023-11-16 14:00:23 Stopping - Stopping the training job
2023-11-16 14:00:23 Uploading - Uploading generated training model
2023-11-16 14:00:23 Stopped - Training job stopped
[35m2023-11-16 14:00:09,228 sagemaker_tensorflow_container.training INFO     master algo-1 is down, stopping parameter server[0m
[35mFor details of how to construct your training script see:[0m
[35mhttps://sagemaker.readthedocs.io/en/stable/using_tf.html#adapting-your-local-tensorflow-script[0m
[35m2023-11-16 14:00:09,229 sagemaker-containers INFO     Reporting training SUCCESS[0m




Training seconds: 408
Billable seconds: 408


# Deploy the trained model to an endpoint

Execute the contents of cell 08 to deploy the trained model with TensorFlow 1.15. It will take several minutes to execute. 

Notes:<br>
The `deploy()` method creates a SageMaker model, which is then deployed to an endpoint to serve prediction requests in real time. We will use the TensorFlow Serving container for the endpoint, because we trained with script mode. This serving container runs an implementation of a web server that is compatible with SageMaker hosting protocol. The [Using your own inference code](https://render.githubusercontent.com/view/ipynb?color_mode=auto&commit=a5c9a21e6ed70fd51ab5178f3a35461473f7b379&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f6177732f616d617a6f6e2d736167656d616b65722d6578616d706c65732f613563396132316536656437306664353161623531373866336133353436313437336637623337392f736167656d616b65722d707974686f6e2d73646b2f74656e736f72666c6f775f7363726970745f6d6f64655f747261696e696e675f616e645f73657276696e672f74656e736f72666c6f775f7363726970745f6d6f64655f747261696e696e675f616e645f73657276696e672e6970796e62&nwo=aws%2Famazon-sagemaker-examples&path=sagemaker-python-sdk%2Ftensorflow_script_mode_training_and_serving%2Ftensorflow_script_mode_training_and_serving.ipynb&repository_id=107937815&repository_type=Repository) document explains how SageMaker runs inference containers.

Execute the contents of the next cell to deploy the trained model with TensorFlow 1.15. It will take several minutes to execute. 

In [8]:
# cell 08: deploy model 1
predictor = mnist_estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

INFO:sagemaker.tensorflow.model:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating model with name: tensorflow-training-2023-11-16-14-00-36-676
INFO:sagemaker:Creating endpoint-config with name tensorflow-training-2023-11-16-14-00-36-676
INFO:sagemaker:Creating endpoint with name tensorflow-training-2023-11-16-14-00-36-676


----!

Execute the contents of the next cell to deploy the trained model with TensorFlow 2.1. It will take several minutes to execute. 

In [9]:
# cell 09: deploy model 2
predictor2 = mnist_estimator2.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

INFO:sagemaker.tensorflow.model:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating model with name: tensorflow-training-2023-11-16-14-03-08-645
INFO:sagemaker:Creating endpoint-config with name tensorflow-training-2023-11-16-14-03-08-645
INFO:sagemaker:Creating endpoint with name tensorflow-training-2023-11-16-14-03-08-645


----!

# Invoke the endpoint
Let's download the training data and use that as input for inference.

In [10]:
# cell 10: download training data in SageMaker Studio (it will take a few seconds to execute)
import numpy as np

!aws --region {region} s3 cp s3://sagemaker-sample-data-{region}/tensorflow/mnist/train_data.npy train_data.npy
!aws --region {region} s3 cp s3://sagemaker-sample-data-{region}/tensorflow/mnist/train_labels.npy train_labels.npy

train_data = np.load('train_data.npy')
train_labels = np.load('train_labels.npy')

download: s3://sagemaker-sample-data-eu-central-1/tensorflow/mnist/train_data.npy to ./train_data.npy
download: s3://sagemaker-sample-data-eu-central-1/tensorflow/mnist/train_labels.npy to ./train_labels.npy


Notes: <br>
The formats of the input and the output data correspond directly to the request and response formats of the Predict method in the [TensorFlow Serving REST API](https://www.tensorflow.org/serving/api_rest). SageMaker's TensforFlow Serving endpoints can also accept additional input formats that are not part of the TensorFlow REST API, including the simplified JSON format, line-delimited JSON objects ("jsons" or "jsonlines"), and CSV data.

In this example we are using a numpy array as input, which will be serialized into the simplified JSON format. In addtion, TensorFlow serving can also process multiple items at once as you can see in the following code. You can find the complete documentation on how to make predictions against a TensorFlow serving SageMaker endpoint [here](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#making-predictions-against-a-sagemaker-endpoint).

Now, run the predictions by executing cell 11 which takes about a second to execute:

In [11]:
# cell 11: run the predictions for model 1 (takes about a second to execute)
predictions = predictor.predict(train_data[:50])
for i in range(0, 50):
    prediction = predictions['predictions'][i]['classes']
    label = train_labels[i]
    print('prediction is {}, label is {}, matched: {}'.format(prediction, label, prediction == label))

prediction is 7, label is 7, matched: True
prediction is 3, label is 3, matched: True
prediction is 4, label is 4, matched: True
prediction is 6, label is 6, matched: True
prediction is 1, label is 1, matched: True
prediction is 8, label is 8, matched: True
prediction is 1, label is 1, matched: True
prediction is 0, label is 0, matched: True
prediction is 9, label is 9, matched: True
prediction is 8, label is 8, matched: True
prediction is 0, label is 0, matched: True
prediction is 3, label is 3, matched: True
prediction is 1, label is 1, matched: True
prediction is 3, label is 2, matched: False
prediction is 7, label is 7, matched: True
prediction is 0, label is 0, matched: True
prediction is 2, label is 2, matched: True
prediction is 9, label is 9, matched: True
prediction is 6, label is 6, matched: True
prediction is 0, label is 0, matched: True
prediction is 1, label is 1, matched: True
prediction is 6, label is 6, matched: True
prediction is 7, label is 7, matched: True
prediction

Examine the prediction result from the TensorFlow 2.1 model. Now, run the predictions using TensorFlow 2.1 by executing cell 12 which takes about a second to execute:

In [12]:
# cell 12: run the predictions for model 2 (takes about a second to execute)
predictions2 = predictor2.predict(train_data[:50])
for i in range(0, 50):
    prediction = np.argmax(predictions2['predictions'][i])
    label = train_labels[i]
    print('prediction is {}, label is {}, matched: {}'.format(prediction, label, prediction == label))

prediction is 7, label is 7, matched: True
prediction is 3, label is 3, matched: True
prediction is 4, label is 4, matched: True
prediction is 6, label is 6, matched: True
prediction is 1, label is 1, matched: True
prediction is 8, label is 8, matched: True
prediction is 1, label is 1, matched: True
prediction is 0, label is 0, matched: True
prediction is 9, label is 9, matched: True
prediction is 8, label is 8, matched: True
prediction is 0, label is 0, matched: True
prediction is 3, label is 3, matched: True
prediction is 1, label is 1, matched: True
prediction is 2, label is 2, matched: True
prediction is 7, label is 7, matched: True
prediction is 0, label is 0, matched: True
prediction is 2, label is 2, matched: True
prediction is 9, label is 9, matched: True
prediction is 6, label is 6, matched: True
prediction is 0, label is 0, matched: True
prediction is 1, label is 1, matched: True
prediction is 6, label is 6, matched: True
prediction is 7, label is 7, matched: True
prediction 

# Delete the endpoint
After analyzing the results, you can terminate the endpoints by executing cells 13 and 14. Optionally, you can use AWS console to verify that the endpoints are deleted. Let's delete the endpoint we just created to prevent incurring any extra costs and then [verify](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html)

In [13]:
# cell 13: delete endpoint for model 1
predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint configuration with name: tensorflow-training-2023-11-16-14-00-36-676
INFO:sagemaker:Deleting endpoint with name: tensorflow-training-2023-11-16-14-00-36-676


In [14]:
# cell 14: delete endpoint for model 2
predictor2.delete_endpoint()

INFO:sagemaker:Deleting endpoint configuration with name: tensorflow-training-2023-11-16-14-03-08-645
INFO:sagemaker:Deleting endpoint with name: tensorflow-training-2023-11-16-14-03-08-645


### Conclusion

In this tutorial, we use the SageMaker Python SDK to launch a training job and deploy the trained model. On the training instance, SageMaker's native TensorFlow support sets up training-related environment variables and executes your training script in files mnist.py and mnist-2.py

