# Iris Dataset Prediction using Amazon SageMaker XGBoost

> The free tier resources are sufficient for this hands-on notebook.  
> Use `ml.t3.medium` for notebooks, and `ml.m4.xlarge` for training and inference.

First, we import some libraries and load the public dataset from `scikit-learn`.

In [None]:
import pandas as pd
import numpy as np

from sklearn import datasets
from sklearn.model_selection import train_test_split

In [None]:
# Load dataset
iris = datasets.load_iris()

# Print the label species (setosa, versicolor, virginica)
print(iris.target_names)

# Print the names of the four features
print(iris.feature_names)

Identify the features and classification, then perform train-test split.  
For simplicity, we do not consider the validation set for this demo.

In [None]:
data = pd.DataFrame({
    'sepal length': iris.data[:,0],
    'sepal width': iris.data[:,1],
    'petal length': iris.data[:,2],
    'petal width': iris.data[:,3],
    'species': iris.target
})

data

In [None]:
# Features and labels
X = data[['sepal length', 'sepal width', 'petal length', 'petal width']]  
y = data['species']  

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
# Generate train and test csv
train = pd.concat([pd.Series(y_train, index=X_train.index, name='species', dtype='int'), X_train], axis=1)
test = pd.concat([pd.Series(y_test, index=X_test.index, name='species', dtype='int'), X_test], axis=1)

train.to_csv('train.csv', index=False, header=False)
train.head()

Upload the train data to a S3 bucket and confirm that it was successful.  
The training script/model later will read data from this S3 bucket.

In [None]:
import sagemaker, boto3, os

bucket = sagemaker.Session().default_bucket()
prefix = 'IrisDataset'

boto3.Session().resource('s3').Bucket(bucket).Object(
    os.path.join(prefix, 'data/train.csv')).upload_file('train.csv')

In [None]:
!aws s3 ls {bucket}/{prefix}/data --recursive

You can check out details about your sagemaker session below.

In [None]:
import sagemaker

region = sagemaker.Session().boto_region_name
print(f'Aws Region name : {region}')

role = sagemaker.get_execution_role()
print(f'Role ARN (AWS Resource Name) : {role}')

We instantiate a built-in algorithm from SageMaker - XGBoost since our Iris dataset contains tabular data.  
Set some parameters and send a request for training to begin.

> Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning practitioners get started on training and deploying machine learning models quickly.

If you would like to find out more about XGBoost, you may read the article <a href="https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/">here</a>.

In [None]:
from sagemaker.session import TrainingInput

s3_output_location = 's3://{}/{}/{}'.format(bucket, prefix, 'xgboostModel')
container = sagemaker.image_uris.retrieve('xgboost', region, 'latest')

xgboostModel = sagemaker.estimator.Estimator(
    image_uri=container,
    role=role,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    output_path=s3_output_location,
    sagemaker_session=sagemaker.Session()
)

xgboostModel.set_hyperparameters(    
    objective='multi:softmax',
    num_class=3,
    num_round=100
)

In [None]:
## Takes 5 minutes to train

from sagemaker.session import TrainingInput

training_input = TrainingInput('s3://{}/{}/{}'.format(bucket, prefix, 'data/train.csv'), content_type='csv')
xgboostModel.fit({'train': training_input}, wait=True)

Our model training has completed! Now we need to deploy it so that can we can inference from it and test the accuracy.

In [None]:
## Takes 5 minutes to deploy to endpoint

import sagemaker
from sagemaker.serializers import CSVSerializer

xgb_predictor = xgboostModel.deploy(
    initial_instance_count=1,
    instance_type='ml.m4.xlarge',
    serializer=CSVSerializer()
)

xgb_predictor.endpoint_name

You can have a look at the accuracy below. Looks like our model is doing good!

In [None]:
test_data_array = test.drop(['species'], axis=1).values 
predictions = xgb_predictor.predict(test_data_array).decode('utf-8')

predictions

In [None]:
from sklearn.metrics import accuracy_score

y_pred = np.fromstring(predictions[1:], sep=',')
accuracy_score(y_test, y_pred)

We now proceed to create our Lambda function and API gateway for the rest of this hands-on.

<br><hr><br>

Do remember to clean up your resources when the workshop ends! (Uncomment the lines below)

In [None]:
# xgb_predictor.delete_endpoint(delete_endpoint_config=True)

# bucket_to_delete = boto3.resource('s3').Bucket(bucket)
# bucket_to_delete.objects.all().delete()