# Demystifying AWS SageMaker Training for Sklearn Lovers 
> This post is about using AWS SageMaker to train and deploy models.

- toc: true 
- badges: true
- comments: true
- categories: [aws, ml, sagemaker]
- keyword: [aws, ml, sagemaker]
- image: images/copied_from_nb/images/2022-06-08-sagemaker-training-overview.jpeg

![](images/2022-06-08-sagemaker-training-overview.jpeg)

# Enviornment

This notebook is prepared with Amazon SageMaker Studio using `Python 3 (Data Science)` Kernel and `ml.t3.medium` instance.

# About

This post is about understanding the end-to-end machine learning workflow for AWS SageMaker. We will apply SageMaker builtin [Linear Learner](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html) on [Kaggle Boston Housing dataset](https://www.kaggle.com/c/boston-housing). Our goal will be to understand all the steps involved in training a model with SageMaker.

# Introduction

A typical SageMaker machine learning flow has the following steps. If we have a good understanding of them then we can use this approach to train any model with SageMaker.
1. **Put Data on S3 Bucket**
    In most of the use cases, you will keep your training data on S3 bucket. You may also need to preprocess your data and for this, you can use [SageMaker Data Wrangler](https://hassaanbinaslam.github.io/myblog/aws/ml/sagemaker/2022/05/17/aws-sagemaker-wrangler-p1.html). In this post, we will consider that data has already been processed and is ready for training.
2. **Configure the Training Job**
   While configuring a training job you need to take care of the following requirements
   a. select the algorithm you want to use for training
   b. set the hyperparameters (if any)
   c. define the infrastructure requirements like how many CPUs or GPUs you want to throw at your training run
3. **Launch Training Job**
   Tell you training job where the input data is located, and once training is complete where should the output artifacts be stored. Once input and output are configured you can then start the training run. Once a run is started SageMaker will automatically create and provide the required infrastructure, and once the training is complete it will be terminated, and you will be only billed for what you have used.
4. **Deploy model and make predictions**
   Deploy the model to make real-time HTTPS predictions. Again, you need to define the infrastructure requirements where you want your model to be deployed.
5. Clean Up (Optional)
   If you are experimenting, you may want to terminate the machine on which you have deployed your model for testing purposes to avoid unnecessary charges.

# Put Data on S3 Bucket

## Reading and Checking the Data

In this post we will be using [Boston Housing Dataset](https://www.kaggle.com/c/boston-housing). This is a small dataset with 506 rows and 14 columns. `medv` is the target variable which means `median value of owner-occupied homes in $1000s`. This dataset is also available with this notebook so let's read it and see how it looks.

In [2]:
import pandas as pd
import numpy as np

data_location = "./datasets/2022-06-08-sagemaker-training-overview/"

df = pd.read_csv(data_location + 'housing.csv')
df.head()

Unnamed: 0,crim,zn,indus,chas,nox,age,rm,dis,rad,tax,ptratio,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,5.33,36.2


let's quickly check the dimensions of our loaded dataset. 

In [3]:
df.shape

(506, 13)

Good thing about this dataset is that is does not requires any preprocessing as all the features are already in numerical format (no categorical features), and also there are no missing values. We can quickly verify our assumptions as well.

Check the dataframe feature types.

In [4]:
df.dtypes

crim       float64
zn         float64
indus      float64
chas         int64
nox        float64
age        float64
rm         float64
dis        float64
rad          int64
tax        float64
ptratio    float64
lstat      float64
medv       float64
dtype: object

Check if any value is missing in our dataset.

In [5]:
df.isnull().values.any()

False

## Preparing the Data

We have already decided that we will be using SageMaker Linear Learner algorithm. We know that at this point our data is ready for training 

For training, the linear learner algorithm supports both recordIO-wrapped protobuf and CSV formats. For the application/x-recordio-protobuf input type, only Float32 tensors are supported. For the text/csv input type, the first column is assumed to be the label, which is the target variable for prediction. You can use either File mode or Pipe mode to train linear learner models on data that is formatted as recordIO-wrapped-protobuf or as CSV.

In [26]:
df = pd.concat([df['medv'], df.drop('medv', axis='columns')], axis='columns')

df.head()

Unnamed: 0,medv,crim,zn,indus,chas,nox,age,rm,dis,rad,tax,ptratio,lstat
0,24.0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,4.98
1,21.6,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,9.14
2,34.7,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,4.03
3,33.4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,2.94
4,36.2,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,5.33


Alright, our data is already ready for training, so can proceed with splitting it into train and test set. After splitting we will also export it as CSV files so we could upload it to S3 in the next section.

In [28]:
from sklearn.model_selection import train_test_split

# out data size is very small so we will use a small test set
training_data, validation_data = train_test_split(df, test_size=0.1, random_state=42)

training_data.to_csv(data_location + "training_data.csv", index=False, header=False)
validation_data.to_csv(data_location + "validation_data.csv", index=False, header=False)

Next step is to upload this data to S3 bucket, and for this we will take the help of SageMaker Python SDK. There are two Python SDK (Software Development Kit) available for SageMaker.

1. **SageMaker Python SDK**. It provides high-level API interface, and you can do more with fewer lines of code
2. **AWS SDK for Python (Boto3)**. It provides a low level access to SageMaker APIs.

We will be using **SageMaker Python SDK** for this post, and you will see that it has interface similar to SKlearn, and a more natural choice for Data Scientists. SageMaker Python SDK documentation is super helpful, and it has many examples provided to understand the working of its interface. Make sure that you check it out as well `https://sagemaker.readthedocs.io/en/stable/`. If you don't have much time I would suggest to atleast read following functions from the documentation as we will be using them in the next sections.
* [Initialize a SageMaker Session](https://sagemaker.readthedocs.io/en/stable/api/utility/session.html#sagemaker.session.Session)
* [Upload local file or directory to S3](https://sagemaker.readthedocs.io/en/stable/api/utility/session.html#sagemaker.session.Session.upload_data)
* [default_bucket](https://sagemaker.readthedocs.io/en/stable/api/utility/session.html#sagemaker.session.Session.default_bucket)
* [Create an Amazon SageMaker training job](https://sagemaker.readthedocs.io/en/stable/api/utility/session.html#sagemaker.session.Session.train)
* [A generic Estimator to train using any supplied algorithm](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator)

Since we are already running this notebook from SageMaker environment, we don't need to care about credentials and permissions. We can simply start our new session with SageMaker environment.

In [29]:
import sagemaker

session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = session.default_bucket()
region = session.boto_region_name

print(f"sagemaker.__version__: {sagemaker.__version__}")
print(f"Session: {session}")
print(f"Role: {role}")
print(f"Bucket: {bucket}")
print(f"Region: {region}")

sagemaker.__version__: 2.88.1
Session: <sagemaker.session.Session object at 0x7fe4f3cf6bd0>
Role: arn:aws:iam::801598032724:role/service-role/AmazonSageMaker-ExecutionRole-20220516T161743
Bucket: sagemaker-us-east-1-801598032724
Region: us-east-1


What we have done here is that 
* imported the SageMaker Python SDK into our runtime
* get a session to work with SageMaker API and other AWS services
* get the execution role associated with the user profile. It is the same profile that is available to the user to work from console UI and has `AmazonSageMakerFullAccess` policy attached to it.
* create a default bucket to use and return its name. Default bucket name has the format `sagemaker-{region}-{account_id}`. You may use any other bucket in its place too given that you have enough permissions for read and write.
* Get the region name attached 

Next, we will use this sagemaker session to upload data to our default bucket. 

In [30]:
#
## to properly organize our data. You may choose any other prefix for your bucket.
bucket_prefix = '2022-06-08-sagemaker-training-overview'

Let's upload our training data first. In the output we will get the complete path (S3 URI) for our uploaded data.

In [31]:
s3_train_data_path = session.upload_data(
    path=data_location + "training_data.csv",
    bucket=bucket,
    key_prefix=bucket_prefix + '/input/training'
)

print(s3_train_data_path)

s3://sagemaker-us-east-1-801598032724/2022-06-08-sagemaker-training-overview/input/training/training_data.csv


Let's do the same for our test (or validation) data.

In [32]:
s3_validation_data_path =  session.upload_data(
    path=data_location + "validation_data.csv",
    bucket=bucket,
    key_prefix=bucket_prefix + '/input/validation_data'
)

print(s3_validation_data_path)

s3://sagemaker-us-east-1-801598032724/2022-06-08-sagemaker-training-overview/input/validation_data/validation_data.csv


At this point we have our data placed on S3 bucket. We can now proceed to the next step and configure our training job.

# Configure the Training Job

In this section we will first retieve the Docker container that is relevant to our training algorithm. Then we will create an "sagemaker.estimator.Estimator" class object. This object provided high-level API interface to control end-to-end SageMaker training and deploment tasks. For the estimator we will also define our infrastructure and hyperparameter tuning requirements. So let's get started.


## Finding the right docker container

AWS SageMaker builtin algorithms are fully managed container that can be accessed with one call. Each algorithm has a separate container and is also dependent on the region in which you want to run your training instance. Getting the container URL is not a problem as long as we know about the region and the algorithm framework name. We already have the region name from our SageMaker session. To get the algorithm framework name visit the [AWS Docker Registry Paths page](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html). From this page select your region. In my case it is `us-east-1`. On the [regional docker registry page](https://docs.aws.amazon.com/sagemaker/latest/dg/ecr-us-east-1.html) find the algorithm you want to use `Linear Learner in our case`. This will give you the example code and algorithm framework name as shown below.

![linear-learner-framework-name](images/2022-06-08-sagemaker-training-overview/linear-learner-framework-name.png)

So let's use the provided sample code to get the container url for our linear learner algorithm.

In [33]:
from sagemaker import image_uris

image_uri = image_uris.retrieve(framework='linear-learner',region=region)

print(image_uri)

382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:1


## Configure the Estimator Object

For configuring our estimator we can use the following information.
* Define the output path where we want to store the trained model artifacts
* Since we have a small dataset and not so complex model so a small machine should suffice. 'ml.m5.large' will do. It is a compute optimized instance with 2vCPU and 4GiB RAM. 
* For the hyperparameters check the [Linear Learner model documentaion](https://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html). From the documentation we find that the important parameters for our problem are
   * predictor_type: which should be 'regressor' in our case
   * mini_batch_size: default is 1000 which is too large for our small dataset. Let's use 30 instead

It is also important to note that Estimator class will automatically provision a separate `ml.m5.large` machine to start the training run. This machine will be diffent from the one on which we are running this Jupyter notebook. Once training is complete that machine will be terminated and we will be billed for only the time we have used it. This AWS SageMaker apprach is really useful in keeping small less powerful machines for running Jupyter notebooks, and for training and other heavy workloads we can provision separate machines for short durations and avoid unnecessary bills.

In [34]:
##
# define the output path to store trained model artifacts
s3_output_path = f"s3://{bucket}/{bucket_prefix}/output/"

print(s3_output_path)

s3://sagemaker-us-east-1-801598032724/2022-06-08-sagemaker-training-overview/output/


In [35]:
from sagemaker.estimator import Estimator

ll_estimator = Estimator(
    image_uri = image_uri, # algorithm container
    role = role, # execution role with necessary permissions
    instance_count = 1,
    instance_type = 'ml.m5.large',
    sagemaker_session = session, # SageMaker API session
    output_path = s3_output_path, # training artifacts output path
    hyperparameters = {
        'predictor_type': 'regressor',
        'mini_batch_size': 30
    }
)

In the above cell we have defined the hyperparameters within the `Estimator` object constructor. There is second way to pass the hyperparameters to Estimator object using 'set_hyperparameters' function call. This method can be useful when we have large number of hyperparameters or you want to change in multiple training runs.

```
ll_estimator.set_hyperparameters(
    predictor_type='regressor', 
    mini_batch_size=30)
```

You might ask that for our problem even `ml.t3.medium` or `ml.c5.large` machine should suffice. Why have not we selected that? The answer to this is that AWS SageMaker at this time supports limited number of machine types and both of them are not supported to run training loads. If you configure Estimator object for these instance type you will get an error shown below

```
An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: Value 'ml.t3.medium' at 'resourceConfig.instanceType' failed to satisfy constraint: Member must satisfy enum value set: [ml.p2.xlarge, ml.m5.4xlarge, ml.m4.16xlarge, ml.p4d.24xlarge, ml.g5.2xlarge, ml.c5n.xlarge, ml.p3.16xlarge, ml.m5.large, ml.p2.16xlarge, ml.g5.4xlarge, ml.c4.2xlarge, ml.c5.2xlarge, ml.c4.4xlarge, ml.g5.8xlarge, ml.c5.4xlarge, ml.c5n.18xlarge, ml.g4dn.xlarge, ml.g4dn.12xlarge, ml.c4.8xlarge, ml.g4dn.2xlarge, ml.c5.9xlarge, ml.g4dn.4xlarge, ml.c5.xlarge, ml.g4dn.16xlarge, ml.c4.xlarge, ml.g4dn.8xlarge, ml.g5.xlarge, ml.c5n.2xlarge, ml.g5.12xlarge, ml.g5.24xlarge, ml.c5n.4xlarge, ml.c5.18xlarge, ml.p3dn.24xlarge, ml.g5.48xlarge, ml.g5.16xlarge, ml.p3.2xlarge, ml.m5.xlarge, ml.m4.10xlarge, ml.c5n.9xlarge, ml.m5.12xlarge, ml.m4.xlarge, ml.m5.24xlarge, ml.m4.2xlarge, ml.p2.8xlarge, ml.m5.2xlarge, ml.p3.8xlarge, ml.m4.4xlarge]
```

## Start the training run

To start our trainig run we need to associate our Estimator object with the training data. Our data is available on S3 bucket but we also need to tell our Estimator object in which format data is provided. Is it in CSV format? Is is compressed or not? For these data format issue we can use `Input Channels`. An input channel is configurations for S3 and file system data sources for AWS SageMaker ([check the docs](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html)). All SageMaker built-in algorithms require atleast one training channel, and more can be passed for validation and testing. In our case we have two channels and both are in CSV format. So, configure them.

In [36]:
from sagemaker.session import TrainingInput

train_input = TrainingInput(s3_train_data_path, content_type="text/csv")
validation_input = TrainingInput(s3_validation_data_path, content_type="text/csv")


ll_data = {
    'train': train_input,
    'validation': validation_input
}

Make sure that you use content_type `text/csv`. Only providing `csv` will not work and you will get the exception

`Error for Training job linear-learner-2022-06-15-07-58-01-908: Failed. Reason: ClientError: No iterator has been registered for ContentType ('csv', '1.0'), exit code: 2`

Alright, we are ready so lets start our training run. To start the training run call estimator `fit` function and pass the data input channels. You can read more about `fit` call from docs [sagemaker.estimator.Estimator.fit](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.fit)

In [62]:
ll_estimator.fit(ll_data)

2022-06-15 10:19:35 Starting - Starting the training job...
2022-06-15 10:19:51 Starting - Preparing the instances for trainingProfilerReport-1655288375: InProgress
......
2022-06-15 10:21:04 Downloading - Downloading input data......
2022-06-15 10:21:50 Training - Downloading the training image...
2022-06-15 10:22:35 Training - Training image download completed. Training in progress..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[06/15/2022 10:22:43 INFO 140271381174080] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': '10000000', 'init_method': 'uniform', 'init_scale': '0.07', 'init_sigma

So the training is done but it has also generated a lot of logs in the output. These logs are also available in the cloudwatch and if you want you can disable them from the `fit` function using `logs='None'` parameter. Let us try to analyse what information is presented in these logs.

The first part of these logs is related to infrasture provisiong, downloading the trainig container, downloading the data, and starting the training.

The second part of these logs is related to training performace metrics. This part also shares the bulk of the logs.

The third and final part is telling us about the Training job status and billable time in seconds.

In our case the billable seconds are 127. This billable time is for `ml.m5.large` instance that we have configured for our training run. To find the billable amount we first need to find the price rate for our selected machine. For this go to SageMaker pricing link https://aws.amazon.com/sagemaker/pricing/ and select `On Demand Pricing`. From the given tabs click on the `Training` tab. This will show you the pricing of different training intances. But pricing also varies for different regions so we need to use the correct region `US East (N. Virginia)`. This page can also be used to find the available training instance types. Price for our instance type `ml.m5.large` is $0.115. This price rate is per hour so we will convert it to per socond rate and then multiply it with our billable seconds. i.e. (127/3600) * 0.115 = $0.0041 which is less than a penny. 

In [77]:
ll_estimator.output_path

's3://sagemaker-us-east-1-801598032724/2022-06-08-sagemaker-training-overview/output/'

In [76]:
ll_estimator.latest_training_job.job_name

'linear-learner-2022-06-15-10-19-35-649'

In [78]:
ll_estimator.model_data

's3://sagemaker-us-east-1-801598032724/2022-06-08-sagemaker-training-overview/output/linear-learner-2022-06-15-10-19-35-649/output/model.tar.gz'

In [79]:
ll_estimator.latest_training_job.describe()

{'TrainingJobName': 'linear-learner-2022-06-15-10-19-35-649',
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:801598032724:training-job/linear-learner-2022-06-15-10-19-35-649',
 'ModelArtifacts': {'S3ModelArtifacts': 's3://sagemaker-us-east-1-801598032724/2022-06-08-sagemaker-training-overview/output/linear-learner-2022-06-15-10-19-35-649/output/model.tar.gz'},
 'TrainingJobStatus': 'Completed',
 'SecondaryStatus': 'Completed',
 'HyperParameters': {'mini_batch_size': '30', 'predictor_type': 'regressor'},
 'AlgorithmSpecification': {'TrainingImage': '382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:1',
  'TrainingInputMode': 'File',
  'MetricDefinitions': [{'Name': 'train:progress',
    'Regex': '#progress_metric: host=\\S+, completed (\\S+) %'},
   {'Name': 'validation:mae',
    'Regex': '#quality_metric: host=\\S+, validation mae <loss>=(\\S+)'},
   {'Name': 'train:objective_loss',
    'Regex': '#quality_metric: host=\\S+, epoch=\\S+, train \\S+_objective <loss>=(\\S+)'},
 

logs on cloudwatch

# Deploy the model

Our model is trained and now we can deploy it to AWS SageMaker endpoint. Read more about [sagemaker.estimator.Estimator.deploy](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.deploy)

In [83]:
endpoint_name = ll_estimator.latest_training_job.job_name

In [84]:
ll_predictor = ll_estimator.deploy(
    initial_instance_count=1, 
    instance_type='ml.t2.medium',
    endpoint_name=endpoint_name
)

# no endpoint given then train job name will be used
# check the pricing page for cost and available machines. t2.medium price

----------!

In [80]:
test_sample = '0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,4.98'

Similar to data input channels that tell the Estimator object on how to read the data from S3 bucket, we need to tell our `predictor` object on how to receive (encode) input and return (decode) data during inference. For this we will use [sagemaker.serializers.CSVSerializer](https://sagemaker.readthedocs.io/en/stable/api/inference/serializers.html#sagemaker.serializers.CSVSerializer) object to serialize data of various formats to a CSV-formatted string. We could also pass serializors in our `deploy` call.

In [87]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer

ll_predictor.serializer = CSVSerializer()
ll_predictor.deserializer = CSVDeserializer()

Note that 

```
ll_predictor = ll_estimator.deploy(
    initial_instance_count=1, 
    instance_type='ml.t2.medium',
    endpoint_name=endpoint_name,
    serializer = CSVSerializer(),
    deserializer = CSVDeserializer()
)
```

In [88]:
ll_predictor.predict(test_sample)

[['29.98671531677246']]

This is telling us price for house $29,986

In [91]:
# Cleanup

In [92]:
ll_predictor.delete_endpoint()