# Amazon SageMaker Batch Transform Demo

_**Use SageMaker's XGBoost to train a binary classification model and for a list of tumors in batch file, predict if each is malignant**_

## Setup

After installing the Sagemaker Python SDK
specify:

* The SageMaker role arn which has the SageMakerFullAccess policy attached
* The S3 bucket to use for training and storing model objects.

In [1]:
!pip3 install -U sagemaker

Collecting sagemaker
  Downloading sagemaker-2.226.1-py3-none-any.whl.metadata (15 kB)
Collecting boto3<2.0,>=1.34.142 (from sagemaker)
  Downloading boto3-1.34.149-py3-none-any.whl.metadata (6.6 kB)
Collecting botocore<1.35.0,>=1.34.149 (from boto3<2.0,>=1.34.142->sagemaker)
  Downloading botocore-1.34.149-py3-none-any.whl.metadata (5.7 kB)
Downloading sagemaker-2.226.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading boto3-1.34.149-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.2/139.2 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading botocore-1.34.149-py3-none-any.whl (12.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.4/12.4 MB[0m [31m78.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: botocore, boto3, sagemaker
  Attempting uninstall: botocore
    F

In [18]:
import os
import boto3
import sagemaker

role = sagemaker.get_execution_role()
sess = sagemaker.Session()
region = sess.boto_region_name

bucket = sess.default_bucket()
prefix = "DEMO-breast-cancer-prediction-xgboost-highlevel"

---
## Data sources

> Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

## Data preparation

Download the data and save it in a local folder with the name data.csv

In [4]:
import pandas as pd
import numpy as np

s3 = boto3.client("s3")

filename = "wdbc.csv"
s3.download_file(
    f"sagemaker-example-files-prod-{region}", "datasets/tabular/breast_cancer/wdbc.csv", filename
)
data = pd.read_csv(filename, header=None)

# specify columns extracted from wbdc.names
data.columns = [
    "id",
    "diagnosis",
    "radius_mean",
    "texture_mean",
    "perimeter_mean",
    "area_mean",
    "smoothness_mean",
    "compactness_mean",
    "concavity_mean",
    "concave points_mean",
    "symmetry_mean",
    "fractal_dimension_mean",
    "radius_se",
    "texture_se",
    "perimeter_se",
    "area_se",
    "smoothness_se",
    "compactness_se",
    "concavity_se",
    "concave points_se",
    "symmetry_se",
    "fractal_dimension_se",
    "radius_worst",
    "texture_worst",
    "perimeter_worst",
    "area_worst",
    "smoothness_worst",
    "compactness_worst",
    "concavity_worst",
    "concave points_worst",
    "symmetry_worst",
    "fractal_dimension_worst",
]

# save the data
data.to_csv("data.csv", sep=",", index=False)

data.sample(8)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
216,8811523,B,11.89,18.35,77.32,432.2,0.09363,0.1154,0.06636,0.03142,...,13.25,27.1,86.2,531.2,0.1405,0.3046,0.2806,0.1138,0.3397,0.08365
447,9110944,B,14.8,17.66,95.88,674.8,0.09179,0.0889,0.04069,0.0226,...,16.43,22.74,105.9,829.5,0.1226,0.1881,0.206,0.08308,0.36,0.07285
432,908194,M,20.18,19.54,133.8,1250.0,0.1133,0.1489,0.2133,0.1259,...,22.03,25.07,146.0,1479.0,0.1665,0.2942,0.5308,0.2173,0.3032,0.08075
144,869254,B,10.75,14.97,68.26,355.3,0.07793,0.05139,0.02251,0.007875,...,11.95,20.72,77.79,441.2,0.1076,0.1223,0.09755,0.03413,0.23,0.06769
224,8813129,B,13.27,17.02,84.55,546.4,0.08445,0.04994,0.03554,0.02456,...,15.14,23.6,98.84,708.8,0.1276,0.1311,0.1786,0.09678,0.2506,0.07623
411,905520,B,11.04,16.83,70.92,373.2,0.1077,0.07804,0.03046,0.0248,...,12.41,26.44,79.93,471.4,0.1369,0.1482,0.1067,0.07431,0.2998,0.07881
442,90944601,B,13.78,15.79,88.37,585.9,0.08817,0.06718,0.01055,0.009937,...,15.27,17.5,97.9,706.6,0.1072,0.1071,0.03517,0.03312,0.1859,0.0681
207,879830,M,17.01,20.26,109.7,904.3,0.08772,0.07304,0.0695,0.0539,...,19.8,25.05,130.0,1210.0,0.1111,0.1486,0.1932,0.1096,0.3275,0.06469


#### Note:
* The first field is an 'id' attribute that we'll remove before batch inference since it is not useful for inference
* The second field, 'diagnosis', uses 'M' for Malignant and 'B'for Benign.
* There are 30 other numeric features that will be use for training and inferenc.

Let's replace the M/B diagnosis with a 1/0 boolean value. 

In [5]:
data["diagnosis"] = data["diagnosis"].apply(lambda x: ((x == "M")) + 0)
data.sample(8)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
361,901041,0,13.3,21.57,85.24,546.1,0.08582,0.06373,0.03344,0.02424,...,14.2,29.2,92.94,621.2,0.114,0.1667,0.1212,0.05614,0.2637,0.06658
508,915452,0,16.3,15.7,104.7,819.8,0.09427,0.06712,0.05526,0.04563,...,17.32,17.76,109.8,928.2,0.1354,0.1361,0.1947,0.1357,0.23,0.0723
65,859283,1,14.78,23.94,97.4,668.3,0.1172,0.1479,0.1267,0.09029,...,17.31,33.39,114.6,925.1,0.1648,0.3416,0.3024,0.1614,0.3321,0.08911
504,915186,0,9.268,12.87,61.49,248.7,0.1634,0.2239,0.0973,0.05252,...,10.28,16.38,69.05,300.2,0.1902,0.3441,0.2099,0.1025,0.3038,0.1252
488,913512,0,11.68,16.17,75.49,420.5,0.1128,0.09263,0.04279,0.03132,...,13.32,21.59,86.57,549.8,0.1526,0.1477,0.149,0.09815,0.2804,0.08024
259,88725602,1,15.53,33.56,103.7,744.9,0.1063,0.1639,0.1751,0.08399,...,18.49,49.54,126.3,1035.0,0.1883,0.5564,0.5703,0.2014,0.3512,0.1204
326,89524,0,14.11,12.88,90.03,616.5,0.09309,0.05306,0.01765,0.02733,...,15.53,18.0,98.4,749.9,0.1281,0.1109,0.05307,0.0589,0.21,0.07083
418,906024,0,12.7,12.17,80.88,495.0,0.08785,0.05794,0.0236,0.02402,...,13.65,16.92,88.12,566.9,0.1314,0.1607,0.09385,0.08224,0.2775,0.09464


Split the data as follows: 
80% for training 
10% for validation 
10% for batch inference job

In addition, let's remove the 'id' field on the training set and validation set as 'id' is not a training feature. 
Remove the diagnosis attribute for the batch set because this is what we want to predict.

In [6]:
# data split in three sets, training, validation and batch inference
rand_split = np.random.rand(len(data))
train_list = rand_split < 0.8
val_list = (rand_split >= 0.8) & (rand_split < 0.9)
batch_list = rand_split >= 0.9

data_train = data[train_list].drop(["id"], axis=1)
data_val = data[val_list].drop(["id"], axis=1)
data_batch = data[batch_list].drop(["diagnosis"], axis=1)
data_batch_noID = data_batch.drop(["id"], axis=1)

Upload the data sets to S3

In [7]:
train_file = "train_data.csv"
data_train.to_csv(train_file, index=False, header=False)
sess.upload_data(train_file, key_prefix="{}/train".format(prefix))

validation_file = "validation_data.csv"
data_val.to_csv(validation_file, index=False, header=False)
sess.upload_data(validation_file, key_prefix="{}/validation".format(prefix))

batch_file = "batch_data.csv"
data_batch.to_csv(batch_file, index=False, header=False)
sess.upload_data(batch_file, key_prefix="{}/batch".format(prefix))

batch_file_noID = "batch_data_noID.csv"
data_batch_noID.to_csv(batch_file_noID, index=False, header=False)
sess.upload_data(batch_file_noID, key_prefix="{}/batch".format(prefix))

's3://sagemaker-us-east-1-548734566896/DEMO-breast-cancer-prediction-xgboost-highlevel/batch/batch_data_noID.csv'

---

## Training job and model creation

Start the training job using both training set and validation set. 

The model will output a probability between 0 and 1 which is predicting the probability of a tumor being malignant.

In [8]:
%%time
from time import gmtime, strftime

job_name = "xgb-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
output_location = "s3://{}/{}/output/{}".format(bucket, prefix, job_name)
image = sagemaker.image_uris.retrieve(
    framework="xgboost", region=boto3.Session().region_name, version="1.7-1"
)

sm_estimator = sagemaker.estimator.Estimator(
    image,
    role,
    instance_count=1,
    instance_type="ml.m5.large",
    volume_size=50,
    input_mode="File",
    output_path=output_location,
    sagemaker_session=sess,
)

sm_estimator.set_hyperparameters(
    objective="binary:logistic",
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    verbosity=0,
    num_round=100,
)

train_data = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train".format(bucket, prefix),
    distribution="FullyReplicated",
    content_type="text/csv",
    s3_data_type="S3Prefix",
)
validation_data = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validation".format(bucket, prefix),
    distribution="FullyReplicated",
    content_type="text/csv",
    s3_data_type="S3Prefix",
)
data_channels = {"train": train_data, "validation": validation_data}

# Start training by calling the fit method in the estimator
sm_estimator.fit(inputs=data_channels, logs=True)

INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2024-07-27-14-13-30-936


2024-07-27 14:13:31 Starting - Starting the training job...
2024-07-27 14:13:46 Starting - Preparing the instances for training...
2024-07-27 14:14:17 Downloading - Downloading input data......
2024-07-27 14:15:08 Downloading - Downloading the training image......
2024-07-27 14:16:24 Training - Training image download completed. Training in progress.
2024-07-27 14:16:24 Uploading - Uploading generated training model[34m[2024-07-27 14:16:17.337 ip-10-0-105-4.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2024-07-27 14:16:17.369 ip-10-0-105-4.ec2.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2024-07-27:14:16:17:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2024-07-27:14:16:17:INFO] Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34m[2024-07-27:14:16:17:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-07-27:1

---

## Batch Transform
Instead of deploying an endpoint and running real-time inference, we'll use SageMaker Batch Transform to run inference on an entire data set in one operation. 


#### 1. Create a transform job 


In [9]:
%%time

sm_transformer = sm_estimator.transformer(1, "ml.m5.large")

# start a transform job
input_location = "s3://{}/{}/batch/{}".format(
    bucket, prefix, batch_file_noID
)  # use input data without ID column
sm_transformer.transform(input_location, content_type="text/csv", split_type="Line")
sm_transformer.wait()

INFO:sagemaker:Creating model with name: sagemaker-xgboost-2024-07-27-14-20-52-044
INFO:sagemaker:Creating transform job with name: sagemaker-xgboost-2024-07-27-14-20-52-811


...................................[34m[2024-07-27:14:26:46:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-07-27:14:26:46:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-07-27:14:26:46:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }
  server {
    listen 8080 deferred;
    client_max_body_size 0;
    keepalive_timeout 3;
    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
      proxy_pass http://gunicorn;
    }
    lo

Check the output of the Batch Transform job. It should show the list of probabilities of tumors being malignant.

In [10]:
import re


def get_csv_output_from_s3(s3uri, batch_file):
    file_name = "{}.out".format(batch_file)
    match = re.match("s3://([^/]+)/(.*)", "{}/{}".format(s3uri, file_name))
    output_bucket, output_prefix = match.group(1), match.group(2)
    s3.download_file(output_bucket, output_prefix, file_name)
    return pd.read_csv(file_name, sep=",", header=None)

In [11]:
output_df = get_csv_output_from_s3(sm_transformer.output_path, batch_file_noID)
output_df.head(8)

Unnamed: 0,0
0,0.788581
1,0.992829
2,0.990109
3,0.98906
4,0.098793
5,0.039862
6,0.985368
7,0.523357


#### 2. Join the input and the prediction results 
Now, let's associate the prediction results with their corresponding input records. We can also use the __input_filter__ to exclude the ID column easily and there's no need to have a separate file in S3.

* Set __input_filter__ to "$[1:]": indicates that we are excluding column 0 (the 'ID') before processing the inferences and keeping everything from column 1 to the last column (all the features or predictors)  
  
  
* Set __join_source__ to "Input": indicates our desire to join the input data with the inference results  

* Set __output_filter__ to default "$[1:]", indicating that when presenting the output, we only want to keep column 0 (the 'ID') and the last column (the inference result)

In [12]:
# content_type / accept and split_type / assemble_with are required to use IO joining feature
sm_transformer.assemble_with = "Line"
sm_transformer.accept = "text/csv"

# start a transform job
input_location = "s3://{}/{}/batch/{}".format(
    bucket, prefix, batch_file
)  # use input data with ID column cause InputFilter will filter it out
sm_transformer.transform(
    input_location,
    split_type="Line",
    content_type="text/csv",
    input_filter="$[1:]",
    join_source="Input",
    output_filter="$[0,-1]",
)
sm_transformer.wait()

INFO:sagemaker:Creating transform job with name: sagemaker-xgboost-2024-07-27-14-28-01-439


...............................[34m[2024-07-27:14:33:10:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-07-27:14:33:10:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-07-27:14:33:10:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }
  server {
    listen 8080 deferred;
    client_max_body_size 0;
    keepalive_timeout 3;
    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
      proxy_pass http://gunicorn;
    }
    locati

Let's inspect the output of the Batch Transform job in S3. It should show the list of tumors identified by their original feature columns and their corresponding probabilities of being malignant.

In [13]:
output_df = get_csv_output_from_s3(sm_transformer.output_path, batch_file)
output_df.head(8)

Unnamed: 0,0,1
0,843786,0.788581
1,851509,0.992829
2,852781,0.990109
3,855625,0.98906
4,857156,0.098793
5,858981,0.039862
6,85922302,0.985368
7,861598,0.523357


## Clean up ( replace with your model name ) 
Remember to delete S3 buckets and the Jupyter notebook

In [22]:
# Replace with your model name
model = "sagemaker-xgboost-2024-07-27-14-20-52-044"

In [24]:
sagemaker_client.delete_model(ModelName=model)

{'ResponseMetadata': {'RequestId': '5023d5b0-0470-42df-9f55-6be86877bbb1',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '5023d5b0-0470-42df-9f55-6be86877bbb1',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Mon, 29 Jul 2024 09:36:59 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}