# Sagemaker Script mode usage (ML - XGboost)

Script mode is a very useful technique that lets you easily run your existing code in Amazon SageMaker, with very little change in codes. This time, we will tackle the simple Deep Learning problem (MNIST) with Tensorflow

We will use XGboost, but this works the same for other frameworks (TensorFlow, MXNet, PyTorch, etc.).

The list of built-in algorithms and its parameters supported by Sagemaker are [Here](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html)

In [None]:
import os, sys, time
import numpy as np 
import pandas as pd
import boto3, sagemaker

sess   = sagemaker.Session()
bucket = sess.default_bucket()                     
prefix = 'xgboost-scriptmode'
region = boto3.Session().region_name
role = 'arn:aws:iam::570447867175:role/SageMakerNotebookRole' # pass your IAM role name

print('Sagemaker version :', sagemaker.__version__)
print('Sagemaker session :' sess)
print('S3 bucket :' bucket)
print('Prefix :' prefix)
print('Region selected :' region)
print('IAM role :' role)

# 1. Load Data

> We will use the same bank deposit dataset as it was in Autopilot example

Download the direct marketing dataset from UCI's ML Repository.
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

[Download Dataset Manually here](https://archive.ics.uci.edu/ml/datasets/bank+marketing)

In [None]:
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 50)         # Keep the output on one page

# Read data
data = pd.read_csv('./bank-additional/bank-additional-full.csv', sep=';')

# 2. PreProcess the Data

In [None]:
# Remove dots in strings and column names
data.columns = data.columns.str.replace('\.', '_')
data.replace(to_replace='\.', value='_', inplace=True, regex=True)

# One-hot encode
data = pd.get_dummies(data)
data = data.drop(['y_no'], axis=1)

# Move labels to first column, which is what SM Model Monitor expects
data = pd.concat([data['y_yes'], data.drop(['y_yes'], axis=1)], axis=1)

# Split into training and validation (95/5)
train_data, val_data, _ = np.split(
    data.sample(frac=1, random_state=123),
    [int(0.95 * len(data)), int(len(data))]
)

# Save to CSV files
train_data.to_csv('training.csv', index=False, header=True, sep=',') # Need to keep column names
val_data.to_csv('validation.csv', index=False, header=True, sep=',')

In [None]:
train_data[:5]

In [None]:
output = "s3://{}/{}/output/".format(bucket,prefix)
print(output)

# 3. You would need your own __.py file

In [None]:
# Look at the structure of the python file before loading into the Sagemaker Training job
!pygmentize xgb.py

# 4. Train the model

Now, when train the model, you can bring your own scripted algorithm to feed in to the Sagemaker instance

Data in S3 bucket is utilized and EC2 Instacne is utilized, but only the script is used from local machine to compute

In [None]:
training = sess.upload_data(path="training.csv", key_prefix=prefix + "/training")
validation = sess.upload_data(path="validation.csv", key_prefix=prefix + "/validation")
print(training)
print(validation)

In [None]:
from sagemaker.xgboost import XGBoost

xgb_estimator = XGBoost(entry_point='xgb.py',                 # Load your __.py file
                          role=role,                          # Your IAM role for Sagemaker
                          train_instance_count=1, 
                          train_instance_type='ml.m4.xlarge',
                          framework_version='0.90-2',         # For configuration, please refer to above model link in AWS
                          py_version='py3',
                          output_path=output,
                          hyperparameters={                   # hyperparameters you want to compute. Bring your own once you done grid search
                              'max-depth': 5,
                              'eval-metric': 'error'
                          }
                       )

In [None]:
xgb_estimator.fit({'training':training, 'validation':validation}) 

# 5. Deploy the model

In [None]:
xgb_endpoint_name = prefix+time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
print('endpoint_name :', xgb_endpoint_name) 

xgb_predictor = xgb_estimator.deploy(
                     initial_instance_count=1, 
                     instance_type='ml.m4.xlarge',
                     endpoint_name=xgb_endpoint_name)

# 6. Predict samples from the validation set

In [None]:
smrt = boto3.client('sagemaker-runtime')

# Predict samples from the validation set
payload = val_data[:100].drop(['y_yes'], axis=1) 
payload = payload.to_csv(header=False, index=False).rstrip()

print(payload)

In [None]:
response = smrt.invoke_endpoint(
    EndpointName=xgb_endpoint_name,
    Body=payload.encode('utf8'),
    ContentType='text/csv')

print(response['Body'].read())

# 7. Close the SageMaker Instance

To make sure we don't get charged after the training is over and endpoint is generated, we have to **delete** the endpoint.

In [None]:
sess.delete_endpoint(endpoint_name=xgb_endpoint_name)