# (Advanced) Amazon SageMaker AutoPilot (Hyperparameter tuning)

Data load / set-up procedue is identical with the basic SageMaker AutoPilot Model Notebook

Please refer to the other notebook (Amazon SageMaker AutoPilot (model building to deploy)) 

In [1]:
import sagemaker
import smdebug_rulesconfig as rule_configs
import boto3
import os
import sys
import numpy as np 
import pandas as pd

print(sagemaker.__version__)

1.64.0


In [42]:
sess   = sagemaker.Session()
bucket = sess.default_bucket()                     
prefix = 'sagemaker/autopilot'
region = boto3.Session().region_name
role = 'arn:aws:iam::570447867175:role/SageMakerNotebookRole' # pass your IAM role name

print('Sagemaker session :', sess)
print('S3 bucket :', bucket)
print('Prefix :', prefix)
print('Region selected :', region)
print('IAM role :', role)

Sagemaker session : <sagemaker.session.Session object at 0x000002289785B908>
S3 bucket : hyunjoon-sagemaker-us
Prefix : sagemaker/autopilot
Region selected : us-west-2
IAM role : arn:aws:iam::570447867175:role/SageMakerNotebookRole


# 1. Load Data

Download the direct marketing dataset from UCI's ML Repository.
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

[Download Dataset Manually here](https://archive.ics.uci.edu/ml/datasets/bank+marketing)

or try:

#### Wget Download
!wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip \
!unzip -o bank-additional.zip

In [8]:
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 50)         # Keep the output on one page

# Read data
path = 'C:\\Users\\bokhy\\Desktop\\Python\\github\\amazon-sagemaker\\data'

data = pd.read_csv(os.path.join(path, 'bank-additional-full.csv'), sep=';')
data[:5] # Show the first 5 lines

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


# 2. EDA

- We see a class imbalance problem in the response variable (y), which might highly effect the prediction accuracy in training

In [9]:
one_class = data[data['y']=='yes']
one_class_count = one_class.shape[0]
print("Positive samples: %d" % one_class_count)

zero_class = data[data['y']=='no']
zero_class_count = zero_class.shape[0]
print("Negative samples: %d" % zero_class_count)

zero_to_one_ratio = zero_class_count/one_class_count
print("Ratio: %.2f" % zero_to_one_ratio)

Positive samples: 4640
Negative samples: 36548
Ratio: 7.88


# 3. PreProcess the Data

1. Many rows have the value of "999" for 'pdays' column. It is very likely that it indicates no contact was made before. So, lets' create a new column called "no_previous_contact", then grant it value of "1" when pdays is 999 and "0" otherwise.

In [10]:
[np.min(data['pdays']), np.max(data['pdays'])] 

[0, 999]

In [11]:
data['no_previous_contact'] = np.where(data['pdays'] == 999, 1, 0)
data = data.drop(['pdays'], axis=1)

2. Generate a new column to show whether the customer is working based on "job" column.

In [12]:
data['job'].value_counts()

admin.           10422
blue-collar       9254
technician        6743
services          3969
management        2924
retired           1720
entrepreneur      1456
self-employed     1421
housemaid         1060
unemployed        1014
student            875
unknown            330
Name: job, dtype: int64

In [13]:
data['not_working'] = np.where(np.in1d(data['job'], ['student', 'retired', 'unemployed']), 1, 0)

In [14]:
data['not_working'].value_counts()

0    37579
1     3609
Name: not_working, dtype: int64

3. Convert categorical columns to numeric columns

In [16]:
model_data = pd.get_dummies(data)  # Convert categorical variables to sets of indicators
model_data[:10]

Unnamed: 0,age,duration,campaign,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,no_previous_contact,not_working,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown,marital_divorced,marital_married,marital_single,marital_unknown,education_basic.4y,education_basic.6y,education_basic.9y,education_high.school,education_illiterate,education_professional.course,education_university.degree,education_unknown,default_no,default_unknown,default_yes,housing_no,housing_unknown,housing_yes,loan_no,loan_unknown,loan_yes,contact_cellular,contact_telephone,month_apr,month_aug,month_dec,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success,y_no,y_yes
0,56,261,1,0,1.1,93.994,-36.4,4.857,5191.0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0
1,57,149,1,0,1.1,93.994,-36.4,4.857,5191.0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0
2,37,226,1,0,1.1,93.994,-36.4,4.857,5191.0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0
3,40,151,1,0,1.1,93.994,-36.4,4.857,5191.0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0
4,56,307,1,0,1.1,93.994,-36.4,4.857,5191.0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0
5,45,198,1,0,1.1,93.994,-36.4,4.857,5191.0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0
6,59,139,1,0,1.1,93.994,-36.4,4.857,5191.0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0
7,41,217,1,0,1.1,93.994,-36.4,4.857,5191.0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0
8,24,380,1,0,1.1,93.994,-36.4,4.857,5191.0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0
9,25,50,1,0,1.1,93.994,-36.4,4.857,5191.0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0


4. Split the data

In [17]:
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=123), 
                                                  [int(0.7 * len(model_data)), int(0.9*len(model_data))])  

In [19]:
# Drop the two columns for 'yes' and 'no' and add 'yes' back as first column of the dataframe
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv(os.path.join(path, 'train.csv'), index=False, header=False)
pd.concat([validation_data['y_yes'], validation_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv(os.path.join(path, 'validation.csv'), index=False, header=False)
pd.concat([test_data['y_yes'], test_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv(os.path.join(path, 'test.csv'), index=False, header=False)

# Dropping the target value, as we will use this CSV file for batch transform
test_data.drop(['y_no', 'y_yes'], axis=1).to_csv('test.csv', index=False, header=False)

# 4. Upload data to S3 bucket

First upload data to the specifed S3 bucket

In [44]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file(os.path.join(path, 'train.csv'))
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file(os.path.join(path, 'validation.csv'))
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test/test.csv')).upload_file(os.path.join(path, 'test.csv'))

Then, SageMaker needs to know where the training and validation sets are located, so let's define that.

In [45]:
s3_input_train = sess.upload_data(path="./data/train.csv", key_prefix=prefix + "/input")
s3_input_validation = sess.upload_data(path="./data/validation.csv", key_prefix=prefix + "/input")
print(s3_input_train)

s3_data = {'train': s3_input_train, 'validation': s3_input_validation}

s3://sagemaker-us-west-2-570447867175/sagemaker/autopilot/input/train.csv


In [34]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')
s3_data = {'train': s3_input_train, 'validation': s3_input_validation}



# 5. Launching an Automatic Model Tuning job (Select Algorithm / Hyperparameter)

You should be able to check live-time in your AWS console. Please nagivate to:
### AWS Console --> Training --> Training Jobs

In [46]:
from sagemaker.amazon.amazon_estimator import get_image_uri

from sagemaker.estimator import Estimator
# https://sagemaker.readthedocs.io/en/stable/estimators.html

# Specify Docker Contatiner
container = get_image_uri(region, 'xgboost', repo_version='1.0-1')

xgb = Estimator(
    
    container,                                               # The contatiner that contatins algorithm (XGBoost)
    role=role,                                               # Your IAM role for Sagemaker
    sagemaker_session=sess,                                  # session
                                    
    input_mode='File',                                       # Copy the dataset and then train
    output_path='s3://{}/{}/output'.format(bucket, prefix),  # Save the model here
                                    
    train_instance_count=1,                                  # Instance requirements
    train_instance_type='ml.m4.xlarge',                      # Instance type
                                    
    train_use_spot_instances=True,                           # Use a spot instance
    train_max_run=300,                                       # Max training time
    train_max_wait=600,                                      # Max training time + spot waiting time
)

xgb.set_hyperparameters(objective='binary:logistic', 
                        num_round=100,
                        early_stopping_rounds=10)



We will use SageMaker tuning to automate the searching process effectively. Specifically, we specify a range, or a list of possible values in the case of categorical hyperparameters, for each of the hyperparameter that we plan to tune. SageMaker hyperparameter tuning will automatically launch multiple training jobs with different hyperparameter settings, evaluate results of those training jobs based on a predefined "objective metric", and select the hyperparameter settings for future attempts based on previous results. \

For each hyperparameter tuning job, we will give it a budget (max number of training jobs) and it will complete once that many training jobs have been executed.\

We will tune 4 hyperparameters in this XGboost problem \

All available set of hyperparameter are [here](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-tuning.html)

In [47]:
from sagemaker.tuner import IntegerParameter, ContinuousParameter

# We give a range of hyperparameters
hyperparameter_ranges = {
    'eta': ContinuousParameter(0, 1),
    'min_child_weight': ContinuousParameter(1, 10),
    'alpha': ContinuousParameter(0, 2),
    'max_depth': IntegerParameter(1, 10)
}

Next we'll specify the objective metric that we'd like to tune for and its definition. Several metrics are available. /

All available set of metric are [here](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-tuning.html)

Now, we'll create a HyperparameterTuner object, to which we pass:
- The XGBoost estimator we created above
- Our hyperparameter ranges
- Objective metric name and definition
- Tuning resource configurations (Number of training jobs to run in total and how many training jobs can be run in parallel)

In [48]:
objective_metric_name = 'validation:auc'
objective_type = 'Maximize'

In [49]:
from sagemaker.tuner import HyperparameterTuner

tuner = HyperparameterTuner(
    xgb,
    objective_metric_name,
    hyperparameter_ranges,
    objective_type=objective_type,
    max_jobs=40,
    max_parallel_jobs=4)

# 6. Start Training

In [50]:
tuner.fit(s3_data)



In [58]:
tuner

<sagemaker.tuner.HyperparameterTuner at 0x22897aa1ec8>

# 7. Inspecting jobs with Amazon SageMaker Experiments

While model tuning is running, we can explore its progress with SageMaker Experiments.

In [66]:
from sagemaker.analytics import HyperparameterTuningJobAnalytics

exp = HyperparameterTuningJobAnalytics(
    sagemaker_session=sess, 
    hyperparameter_tuning_job_name=tuner.latest_tuning_job.name
)

In [67]:
exp

<sagemaker.HyperparameterTuningJobAnalytics for sagemaker-xgboost-200618-1639>

In [68]:
df = exp.dataframe()
df.sort_values('FinalObjectiveValue', ascending=0)[:1]

Unnamed: 0,alpha,eta,max_depth,min_child_weight,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
0,0.591646,0.888403,7.0,2.673249,sagemaker-xgboost-200618-1639-008-7bcb13bd,Stopped,,2020-06-18 16:46:31-07:00,2020-06-18 16:47:16-07:00,45.0


In [69]:
df.head()

Unnamed: 0,alpha,eta,max_depth,min_child_weight,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
0,0.591646,0.888403,7.0,2.673249,sagemaker-xgboost-200618-1639-008-7bcb13bd,Stopped,,2020-06-18 16:46:31-07:00,2020-06-18 16:47:16-07:00,45.0
1,1.710656,0.729853,2.0,2.941976,sagemaker-xgboost-200618-1639-007-8048b959,Failed,,2020-06-18 16:45:48-07:00,2020-06-18 16:47:00-07:00,72.0
2,1.097842,0.287681,1.0,2.017434,sagemaker-xgboost-200618-1639-006-cc6efd69,Failed,,2020-06-18 16:45:30-07:00,2020-06-18 16:47:13-07:00,103.0
3,0.55995,0.464547,6.0,8.896817,sagemaker-xgboost-200618-1639-005-0b158347,Failed,,2020-06-18 16:45:54-07:00,2020-06-18 16:47:07-07:00,73.0
4,1.885231,0.356845,8.0,3.577142,sagemaker-xgboost-200618-1639-004-f809c522,Failed,,2020-06-18 16:41:59-07:00,2020-06-18 16:43:22-07:00,83.0


# 8. Model Deploy 

In [None]:
from time import strftime, gmtime
timestamp = strftime('%d-%H-%M-%S', gmtime())

xgb_endpoint = tuner.deploy(
    endpoint_name = 'xgboost-{}'.format(timestamp),
    initial_instance_count = 1,                    # Infrastructure requirements
    instance_type = 'ml.m4.xlarge'
)

# 9. Predicting with the test set

Predicting with our model
- First we'll need to determine how we pass data into and receive data from our endpoint. Our data is currently stored as NumPy arrays in memory of our notebook instance./
- To send it in an HTTP POST request, we'll serialize it as a CSV string and then decode the resulting CSV

In [None]:
sm = boto3.Session().client(service_name='runtime.sagemaker') 

test_samples = [line.rstrip('\n') for line in open('test.csv')]
test_samples = test_samples[:100] # We'll predict the first 100 samples

for sample in test_samples:
    sample = bytes(sample, 'utf-8')
    response = sm.invoke_endpoint(EndpointName=xgb_endpoint.endpoint, 
                                  ContentType='text/csv', 
                                  Body=sample)
    print(response['Body'].read())

# 10. Close the SageMaker Instance 

To make sure we don't get charged after the training/inference is over, we have to delete the endpoint.

In [None]:
sagemaker.Session().delete_endpoint(xgb_endpoint.endpoint) 