# 🚀 SageMaker Basics: Interacting with S3 and Running a Training Job

## 1. Imports

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
import os
import uuid

## 2. Setting up SageMaker Session and Execution Role

In [None]:
# Initialize SageMaker session
sagemaker_session = sagemaker.Session()

# Get the execution role
role = get_execution_role()

# Get the region
region = sagemaker_session.boto_session.region_name

## 3. Basic S3 Interactions

> **Tip:** You can interact with S3 using **boto3** or **SageMaker utilities**.

### 3.1 List all buckets

In [None]:
s3_client = boto3.client('s3')

# List all buckets
buckets = [bucket['Name'] for bucket in s3_client.list_buckets()['Buckets']]
print("S3 Buckets:", buckets)

### 3.2 Create a new bucket

In [None]:
# Create a unique bucket name (bucket names must be globally unique)
bucket_name = f"sagemaker-example-{uuid.uuid4()}"
s3_client.create_bucket(Bucket=bucket_name)
print(f"Created bucket: {bucket_name}")

### 3.3 Upload a file to the bucket

In [None]:
# Create a simple file
filename = 'example.txt'
with open(filename, 'w') as f:
    f.write('This is a sample file to upload to S3.')

# Upload the file
s3_client.upload_file(filename, bucket_name, filename)
print(f"Uploaded {filename} to bucket {bucket_name}")

## 4. Running a Basic Training Job on SageMaker

> **We will use a built-in SageMaker algorithm** (e.g., XGBoost) for simplicity.

### 4.1 Prepare Training Data

In [None]:
# S3 URI for training data (provided by AWS)
training_data_uri = 's3://sagemaker-sample-data-{}/tensorflow/mnist'.format(region) 

### 4.2 Construct a script for distributed training
This tutorial's training script was adapted from TensorFlow's official [CNN MNIST example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/layers/cnn_mnist.py). We have modified it to handle the `model_dir` parameter passed in by SageMaker. This is an S3 path which can be used for data sharing during distributed training and checkpointing and/or model persistence. We have also added an argument-parsing function to handle processing training-related variables.

At the end of the training job we have added a step to export the trained model to the path stored in the environment variable `SM_MODEL_DIR`, which always points to `/opt/ml/model`. This is critical because SageMaker uploads all the model artifacts in this folder to S3 at end of training.

Here is the entire script:

In [None]:
# TensorFlow 2.1 script
!pygmentize 'mnist-2.py'

### 4.3 Create a training job using the TensorFlow estimator

The sagemaker.tensorflow.TensorFlow estimator handles locating the script mode container, uploading your script to a S3 location and creating a SageMaker training job. Let's call out a couple important parameters here:

py_version is set to 'py3' to indicate that we are using script mode since legacy mode supports only Python 2. We do not reccommend to run Tensorflow in legacy mode with Python 2.

distribution is used to configure the distributed training setup. It's required only if you are doing distributed training either across a cluster of instances or across multiple GPUs. Here we are using parameter servers as the distributed training schema. SageMaker training jobs run on homogeneous clusters. To make parameter server more performant in the SageMaker setup, we run a parameter server on every instance in the cluster, so there is no need to specify the number of parameter servers to launch. Script mode also supports distributed training with Horovod. You can find the full documentation on how to configure distributions here.

instance_type specify the EC2 instance used for training. You should right-size your training instance based on the size of your data, algorithm and tasks. Here we choose ml.c5.xlarge. You can also read more about G4dn instances, which feature NVIDIA T4 GPUs and custom Intel Cascade Lake CPUs, and are optimized for machine learning inference and small scale training. Read more on available instance types and pricing.

use_spot_instances(Optional): For further cost optimization, you can leverage managed Amazon EC2 Spot instances by setting this parameter to True. Managed spot training can optimize the cost of training models up to 90% over on-demand instances. SageMaker manages the Spot interruptions on your behalf. You can specify which training jobs use spot instances and a stopping condition that specifies how long Amazon SageMaker waits for a job to run using Amazon EC2 Spot instances. Full documentation here.

You can initialize an estimator to train with TensorFlow 2.1 script and you will need to specify the right framework_version, i.e., 2.1.0.

In [None]:
from sagemaker.tensorflow import TensorFlow

mnist_estimator = TensorFlow(
    entry_point='mnist-2.py',
    role=role,
    instance_count=2,
    instance_type='ml.c5.xlarge',
    framework_version='2.1.0',
    py_version='py3',
    distribution={'parameter_server': {'enabled': True}}
)

### 4.4 Calling fit
To start a training job, we call estimator.fit(training_data_uri).

An S3 location is used here as the input. fit creates a default channel named 'training', which points to this S3 location. In the training script we can then access the training data from the location stored in SM_CHANNEL_TRAINING. fit accepts a couple other types of input as well. See the API doc here for details.

When training starts, the TensorFlow container executes mnist.py, passing hyperparameters and model_dir from the estimator as script arguments. Because we didn't define either in this example, no hyperparameters are passed, and model_dir defaults to s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>, so the script execution is as follows:

python mnist-2.py --model_dir s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>

When training is complete, the training job will upload the saved model for TensorFlow serving.

Calling fit to train a model with TensorFlow 2.1 script.

In [None]:
mnist_estimator.fit(training_data_uri)

## 5. Deploying an endpoint


### 5.1 Deploy the trained model to an endpoint
The deploy() method creates a SageMaker model, which is then deployed to an endpoint to serve prediction requests in real time. We will use the TensorFlow Serving container for the endpoint, because we trained with script mode. This serving container runs an implementation of a web server that is compatible with SageMaker hosting protocol. The Using your own inference code document explains how SageMaker runs inference containers.

In [None]:
predictor = mnist_estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

### 5.2 Invoke the endpoint
Let's download the training data and use that as input for inference.

In [None]:
import numpy as np

!aws --region {region} s3 cp s3://sagemaker-sample-data-{region}/tensorflow/mnist/train_data.npy train_data.npy
!aws --region {region} s3 cp s3://sagemaker-sample-data-{region}/tensorflow/mnist/train_labels.npy train_labels.npy

train_data = np.load('train_data.npy')
train_labels = np.load('train_labels.npy')

The formats of the input and the output data correspond directly to the request and response formats of the Predict method in the TensorFlow Serving REST API. SageMaker's TensforFlow Serving endpoints can also accept additional input formats that are not part of the TensorFlow REST API, including the simplified JSON format, line-delimited JSON objects ("jsons" or "jsonlines"), and CSV data.

In this example we are using a numpy array as input, which will be serialized into the simplified JSON format. In addtion, TensorFlow serving can also process multiple items at once as you can see in the following code. You can find the complete documentation on how to make predictions against a TensorFlow serving SageMaker endpoint here.

In [None]:
# cell 08
predictions = predictor.predict(train_data[:50])
for i in range(0, 50):
    prediction = np.argmax(predictions['predictions'][i])
    label = train_labels[i]
    print('prediction is {}, label is {}, matched: {}'.format(prediction, label, prediction == label))

# Delete the endpoint
Let's delete the endpoint we just created to prevent incurring any extra costs and then [verify](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html)

In [None]:
predictor.delete_endpoint()