# Built in XGBoost with iris dataset

While trying out different features of SageMaker, and as the focus of the workshop will not be in the data science aspects but mostly on SageMaker operations, we will use a simple public dataset called Iris.

The dataset contains 50 records of 3 species of Iris each provided in CSV format. In our case we want to predict the species of a flower called Iris by looking at four features:

* Sepal length
* Sepal width
* Petal length
* Petal width

<table align='left'>
    <tr>
        <td>Iris setosa<img src="images/Iris_setosa.jpg" width="140"/></td>
        <td>Iris versicolor<img src="images/Iris_versicolor.jpg" width="200"/></td>
        <td>Iris virginica<img src="images/Iris_virginica.jpg" width="200"/></td>
    </tr>
</table>

We will explore different ways in which we can use SageMaker to train a model based on this dataset and XGBoost.

The first thing to do is to set up a session in order to interact with the SageMaker service. Note that the Studio instance where this notebook is running has an IAM role assigned to it, which we will retrieve in the next code block. This notebook could also be run in your own laptop - in which case you would need to have a AWS profile set-up with the correct credentials to access the Amazon SageMaker service on your AWS account.

In [None]:
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

# this will create a 'default' sagemaker bucket if it doesn't exist (sagemaker-region-accountid)
bucket = sagemaker_session.default_bucket()
print(bucket)

# Get the ARN of the IAM role used by this Studio instance to pass to training jobs and other Amazon SageMaker tasks.
role = get_execution_role()
print(role)

Now that we have the role, and created a default sagemaker bucket for storing our training data let's get the Iris data from scikit-learn, use pandas to store it as a Dataframe, visualize the data and upload it to the default SageMaker bucket.

In [None]:
from sklearn import datasets
import pandas as pd
import numpy as np

iris = datasets.load_iris()

X=iris.data
y=iris.target

dataset = np.insert(iris.data, 0, iris.target,axis=1)
df = pd.DataFrame(data=dataset, columns=['iris_id'] + iris.feature_names)
df['species'] = df['iris_id'].map(lambda x: 'setosa' if x == 0 else 'versicolor' if x == 1 else 'virginica')

# Let's have a look at the data
df.head()

In [None]:
# Let's describe some statistics about the data
df.describe()

## Splitting the dataset into Train / Validate / Test

We want to split our dataset into a training, validation and test set. The training set is typically bigger, let's use a 70% - 20% - 10% split. We will output the three sets into local CSV files.

After that, we will upload the training and validation files to our S3 bucket.

In [None]:
from sklearn.model_selection import train_test_split

train_data, validation_data, test_data = np.split(df.drop('species', axis=1).sample(frac=1, random_state=1729), [int(0.7 * len(df)), int(0.9 * len(df))])   # Randomly sort the data then split out first 70%, second 20%, and last 10%

train_data.to_csv('iris_train.csv', index=False, header=False)
validation_data.to_csv('iris_val.csv', index=False, header=False)
test_data.to_csv('iris_test.csv', index=False, header=False)


In [None]:
# Upload the dataset to our S3 bucket
input_train = sagemaker_session.upload_data(path='iris_train.csv', key_prefix='iris/data')
input_val = sagemaker_session.upload_data(path='iris_val.csv', key_prefix='iris/data')

## Train a model on this data with the XGBoost algorithm

In [None]:
import sagemaker
import boto3
from sagemaker import image_uris

# get the URI for the XGBoost container
container_image = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='latest')

# build a SageMaker estimator class
xgb_estimator = sagemaker.estimator.Estimator(
    container_image,
    role,
    instance_count=1,
    instance_type='ml.m5.large',
    output_path='s3://{}/iris/output'.format(bucket),
    sagemaker_session=sagemaker_session
)

# set the hyperparameters
xgb_estimator.set_hyperparameters(
                        num_class=len(np.unique(y)),
                        silent=0,
                        objective='multi:softmax',
                        num_round=10
)

In [None]:
%%time

s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/iris/data/iris_train.csv'.format(bucket), content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data='s3://{}/iris/data/iris_val.csv'.format(bucket), content_type='csv')
# Now run training against the training and test sets created above
# Refer to the SageMaker training console
xgb_estimator.fit({
    'train': s3_input_train,
    'validation': s3_input_validation
})

Note, from the SageMaker training jobs console, that while it took some time to bootstrap the training instance, you are only billed for the time the actual training took place.

## Bonus! How to use Spot instances

In [None]:
%%time

import sagemaker
import boto3
from sagemaker import image_uris

# get the URI for the XGBoost container
container_image = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='latest')

# build a SageMaker estimator class
xgb_estimator = sagemaker.estimator.Estimator(
    container_image,
    role,
    instance_count=1,
    instance_type='ml.m5.large',
    use_spot_instances=True,
    max_wait=900,
    max_run=900,
    output_path='s3://{}/iris/output'.format(bucket),
    sagemaker_session=sagemaker_session
)

# set the hyperparameters
xgb_estimator.set_hyperparameters(
                        num_class=len(np.unique(y)),
                        silent=0,
                        objective='multi:softmax',
                        num_round=10
)

xgb_estimator.fit({
    'train': s3_input_train,
    'validation': s3_input_validation
})

## Create an endpoint

From the trained model, we will create an endpoint to run inference from.

In [None]:
%%time
xgb_predictor = xgb_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.t2.medium'
)

In [None]:
#Modelname is the actual XGBoost model name
model_name = boto3.client('sagemaker').describe_endpoint_config(
    EndpointConfigName=xgb_predictor.endpoint_name
)['ProductionVariants'][0]['ModelName']

In [None]:
print("Save (Copy & Paste) this modelname for the next session: {}".format(model_name))

## Run inference with example data

Now that we have an endpoint up, we can run inference by providing data to it. This is done via a signed HTTP POST request, where the data is in the body. The two simplest way to generate that request and get the inference result are illustrated below:

1) With the SageMaker SDK

2) With the generic AWS SDK (in this case boto3 as it's python)

In [None]:
# The inference result should be: 2
exampledata = "6.7,3.1,5.6,2.4" 

In [None]:
# With sagemaker SDK
from sagemaker.predictor import Predictor
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import BytesDeserializer

xgb_endpoint = Predictor(model_name)

xgb_predictor.serializer = CSVSerializer()
xgb_predictor.deserializer = BytesDeserializer()
classification = xgb_predictor.predict(exampledata)

print("Classified as {} - Should be: 2".format(classification))

In [None]:
# With boto3
sm = boto3.client('sagemaker-runtime')

resp = sm.invoke_endpoint(
    EndpointName=xgb_predictor.endpoint_name,
    ContentType='text/csv',
    Body=exampledata
)
prediction = float(resp['Body'].read().decode('utf-8'))
print("Classified as {} - Should be: 2".format(prediction))

## Clean up

In [None]:
# delete the endpoint

xgb_predictor.delete_endpoint()