# A Quick Demonstration of SageMaker Reproducibility Problem

## Contents

1. [Introduction](#introduction)
2. [Preparation](#preparation)
3. [Download and prepare the data](#download)
4. [Train + Hyperparameter tuning](#tune)
5. [Deploy](#deploy)
6. [Reproducing Objective Value](#wtf)
7. [Teardown](#teardown)
8. [Training Log](#log)

# <a id='introduction'>Introduction</a>

We encountered a problem where Sagemaker hyperparameter tuning produced a much higher `f1-score` than other hyperparameter sweep algorithms (`H2OAutoML`, `Scikit-Optimize Bayes`, `RandomizedSearchCV`) that we have tried and our own hand-tuned models. We got excited and decide to pull the model and verify if that `f1-score` displayed was real. Our manual verification using the best model and the same validation data produced an `f1-score` that is muchlower than what's displayed on Sagemaker Hyperparameter Tuning dashboard.

This notebook demonstrates the misalignment between the displayed Objective Metrics Value on Hyperparameter Tuning dashboard vs the manually produced objective metrics value using the same model and same dataset.

We will use SageMaker Python SDK, a high level SDK, to simplify the way we interact with SageMaker Hyperparameter Tuning.

Example adopted from [HPO XGBoost Random Log Notebook](https://github.com/aws/amazon-sagemaker-examples/blob/master/hyperparameter_tuning/xgboost_random_log/hpo_xgboost_random_log.ipynb), modified to support `SageMaker 2.25.2.`

# <a id='preparation'>Preparation</a>

Let's start by specifying:

    The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as SageMaker training.
    The IAM role used to give training access to your data. See SageMaker documentation for how to create these.

In [1]:
import os
import tarfile
import pickle

import boto3
import sagemaker
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
from sagemaker.inputs import TrainingInput
from sagemaker.serializers import CSVSerializer

import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
import xgboost

region = boto3.Session().region_name    
smclient = boto3.Session().client('sagemaker')

role = sagemaker.get_execution_role()

bucket = sagemaker.Session().default_bucket()                     
prefix = 'sagemaker/DEMO-hpo-xgboost-dm'

# <a id='download'>Download and prepare the data</a>

Here we download the [direct marketing dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing) from UCI's ML Repository.

In [2]:
!wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
!unzip -o bank-additional.zip

--2021-03-24 20:37:53--  https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘bank-additional.zip’ not modified on server. Omitting download.

Archive:  bank-additional.zip
  inflating: bank-additional/.DS_Store  
  inflating: __MACOSX/bank-additional/._.DS_Store  
  inflating: bank-additional/.Rhistory  
  inflating: bank-additional/bank-additional-full.csv  
  inflating: bank-additional/bank-additional-names.txt  
  inflating: bank-additional/bank-additional.csv  
  inflating: __MACOSX/._bank-additional  


Now let us load the data, apply some preprocessing, and upload the processed data to s3

In [3]:
# Load data
data = pd.read_csv('./bank-additional/bank-additional-full.csv', sep=';')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 50)         # Keep the output on one page

# Apply some feature processing
data['no_previous_contact'] = np.where(data['pdays'] == 999, 1, 0)                                 # Indicator variable to capture when pdays takes a value of 999
data['not_working'] = np.where(np.in1d(data['job'], ['student', 'retired', 'unemployed']), 1, 0)   # Indicator for individuals not actively employed
model_data = pd.get_dummies(data)                                                                  # Convert categorical variables to sets of indicators

# columns that should not be included in the input
model_data = model_data.drop(['duration', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'], axis=1)

# split data
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9*len(model_data))])  

# save preprocessed file to s3
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
pd.concat([validation_data['y_yes'], validation_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('validation.csv', index=False, header=False)
pd.concat([test_data['y_yes'], test_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('test.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

# <a id='tune'>Setup hyperparameter tuning</a>

In this example, we are using SageMaker Python SDK to set up and manage the hyperparameter tuning job. We first configure the training jobs the hyperparameter tuning job will launch by initiating an estimator, and define the static hyperparameter and objective.

The objective metrics we use is `validation:auc`, which we presume to be `auc` for `validation` data set.  We will only train for one iteration for this demo.

In [4]:
# Deprecated in sagemaker v2.
# s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
# s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

s3_input_train = TrainingInput(s3_data='s3://{}/{}/train/train.csv'.format(bucket, prefix), content_type='text/csv')
s3_input_validation = TrainingInput(s3_data='s3://{}/{}/validation/validation.csv'.format(bucket, prefix), content_type='text/csv')

In [5]:
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
container = sagemaker.image_uris.retrieve(
    framework='xgboost',
    region=boto3.Session().region_name,
    version='1.2-1',
    py_version='py3')

xgb = sagemaker.estimator.Estimator(
    container,
    role, 
    instance_count=1, 
    instance_type='ml.m5.4xlarge',
    output_path='s3://{}/{}/output'.format(bucket, prefix),
    sagemaker_session=sess
)

xgb.set_hyperparameters(
    eval_metric='f1',
    objective='binary:logistic',
    num_round=1,
#     rate_drop=0.3,
#     tweedie_variance_power=1.4
)

objective_metric_name = 'validation:f1'

In [6]:
hyperparameter_ranges = {
    'alpha': ContinuousParameter(0.01, 10, scaling_type="Logarithmic"),
    'lambda': ContinuousParameter(0.01, 10, scaling_type="Logarithmic")
}

In [7]:
tuner_log = HyperparameterTuner(
    xgb,
    objective_metric_name,
    hyperparameter_ranges,
    max_jobs=1,
    max_parallel_jobs=1,
    strategy='Bayesian',
    objective_type='Maximize'
)

tuner_log.fit({'train': s3_input_train, 'validation': s3_input_validation}, include_cls_metadata=False)

.......................................!


Check if jobs have finished and get the logs of the job.


In [8]:
boto3.client('sagemaker').describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuner_log.latest_tuning_job.job_name)['HyperParameterTuningJobStatus']

'Completed'

In [9]:
status_log = boto3.client('sagemaker').describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuner_log.latest_tuning_job.job_name)['HyperParameterTuningJobStatus']

assert status_log == 'Completed', "First must be completed, was {}".format(status_log)

df_log = sagemaker.HyperparameterTuningJobAnalytics(tuner_log.latest_tuning_job.job_name).dataframe()
df_log['scaling'] = 'log'
final_objective_value = df_log['FinalObjectiveValue'][0]
df_log.head()

Unnamed: 0,alpha,lambda,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds,scaling
0,0.787034,0.428942,sagemaker-xgboost-210324-2037-001-0c35a89d,Completed,0.23828,2021-03-24 20:40:12+00:00,2021-03-24 20:41:06+00:00,54.0,log


# <a id='deploy'>Deploy</a>

After the model's trained, we will deploy it using Sagemaker to keep AWS happy and our wallets empty.

[Reference deployment code is here.](https://aws.amazon.com/getting-started/hands-on/build-train-deploy-machine-learning-model-sagemaker/)

In [10]:
xgb_predictor = tuner_log.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.xlarge',
)


2021-03-24 20:41:06 Starting - Preparing the instances for training
2021-03-24 20:41:06 Downloading - Downloading input data
2021-03-24 20:41:06 Training - Training image download completed. Training in progress.
2021-03-24 20:41:06 Uploading - Uploading generated training model
2021-03-24 20:41:06 Completed - Training job completed
-------------!

# <a id='wtf'> Reproducing Objective Value</a>

For this tuning job, we set our `objective_metric_name` as `validation:auc` and the tuner found the `FinalObjectiveValue` of `0.75241`.

Awesomesauce. Let's try to reproduce the same `FinalObjectiveValue` by running our validation data through the trained model and [calculate the auc](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html) using scikit-learn.

In [11]:
validation_data_array = validation_data.drop(['y_no', 'y_yes'], axis=1).values               # load the data into an array
xgb_predictor.serializer = CSVSerializer()                                                   # set the serializer type
predictions = xgb_predictor.predict(validation_data_array).decode('utf-8')                   # predict!
predictions_array = np.fromstring(predictions[1:], sep=',')                                  # and turn the prediction into an array
predictions_array = np.where(predictions_array > 0.5, 1, 0)                                  # sklearn f1-score takes labels, not class probabilities

validation_f1 = metrics.f1_score(validation_data['y_yes'], predictions_array)
print(f'Reproduced validation:f1 is {validation_f1:.4}. Advertised FinalObjectiveValue is {final_objective_value:.4}\n')

Reproduced validation:f1 is 0.2758. Advertised FinalObjectiveValue is 0.2383



Hmm, what's going on?  Maybe the FinalObjectiveValue is auc calculated on the training set? Let's try that.

In [12]:
train_data_array = train_data.drop(['y_no', 'y_yes'], axis=1).values                          # load the data into an array
xgb_predictor.serializer = CSVSerializer()                                                    # set the serializer type
predictions = xgb_predictor.predict(train_data_array).decode('utf-8')                         # predict!
predictions_array = np.fromstring(predictions[1:], sep=',')                                   # and turn the prediction into an array
predictions_array = np.where(predictions_array > 0.5, 1, 0)                                   # sklearn f1-score takes labels, not class probabilities

train_f1 = metrics.f1_score(train_data['y_yes'], predictions_array)
print(f'Reproduced train:f1 is {train_f1:.4}. Advertised FinalObjectiveValue is {final_objective_value:.4}\n')

Reproduced train:f1 is 0.3463. Advertised FinalObjectiveValue is 0.2383



`FinalObjectiveValue` is expected to be `validation:f1`. However, it neither equals to `validation:f1` or `train:f1`. So, what does that number represent? On Sagemaker Hyperparameter Tuner, we did specify it `eval_metric` to `f1` and `objective_metric_name` to `validation:f1`.

For sanity Check, we will try to download the model to local and use the local model to make inference.

In [13]:
training_job_name = df_log['TrainingJobName'][0]
# model_file = f's3://{bucket}/{prefix}/output/{training_job_name}/output/model.tar.gz'  
all_prefix = f"{prefix}/output/{df_log['TrainingJobName'][0]}/output/"
filename = "model.tar.gz"

# No idea why boto3 client API cant access the file
# s3 = boto3.client('s3')
# s3.download_file(Bucket=bucket, Key=all_prefix, Filename=filename)

s3 = boto3.resource('s3')
s3.Bucket(bucket).download_file(all_prefix+filename, "model.tar.gz")

tar = tarfile.open('model.tar.gz', "r:gz")
tar.extractall()
tar.close()

xgb_model = pickle.load(open("xgboost-model", 'rb'))

In [14]:
validation_data_dmatrix = xgboost.DMatrix(validation_data_array)
y_predict_proba = xgb_model.predict(validation_data_dmatrix)
y_pred_ = np.where(y_predict_proba > 0.5, 1, 0)

print(f"f1-score displayed in FinalObjectiveValue is: {df_log['FinalObjectiveValue'][0]}")
print(f"f1-score using loaded model on validation data is: {metrics.f1_score(validation_data['y_yes'], y_pred_)}")

f1-score displayed in FinalObjectiveValue is: 0.23827999830245972
f1-score using loaded model on validation data is: 0.27580511973575556


Good news: downloaded model produces the same f1-score on validation data compare to the deployed model.

Bad news: it is still different than the advertised FinalObjectiveValue.

# <a id='teardown'>Teardown</a>

Teardown the deployed endpoint and the model so we can afford food tomorrow.

In [15]:
xgb_predictor.delete_endpoint()
xgb_predictor.delete_model()

# <a id='Log'>Training Job Cloudwatch Log</a>

```
	No older events at this moment. 
Retry
	2021-03-24T13:40:57.660-07:00	[2021-03-24 20:40:55.364 ip-10-0-184-77.us-east-2.compute.internal:1 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
	2021-03-24T13:40:57.661-07:00	INFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training
	2021-03-24T13:40:57.661-07:00	INFO:sagemaker-containers:Failed to parse hyperparameter eval_metric value f1 to Json.
	2021-03-24T13:40:57.661-07:00	Returning the value itself
	2021-03-24T13:40:57.661-07:00	INFO:sagemaker-containers:Failed to parse hyperparameter _tuning_objective_metric value validation:f1 to Json.
	2021-03-24T13:40:57.661-07:00	Returning the value itself
	2021-03-24T13:40:57.661-07:00	INFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.
	2021-03-24T13:40:57.661-07:00	Returning the value itself
	2021-03-24T13:40:57.661-07:00	INFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)
	2021-03-24T13:40:57.661-07:00	INFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode
	2021-03-24T13:40:57.661-07:00	INFO:root:Determined delimiter of CSV input is ','
	2021-03-24T13:40:57.661-07:00	INFO:root:Determined delimiter of CSV input is ','
	2021-03-24T13:40:57.661-07:00	INFO:root:Determined delimiter of CSV input is ','
	2021-03-24T13:40:57.661-07:00	INFO:root:Determined delimiter of CSV input is ','
	2021-03-24T13:40:57.661-07:00	INFO:root:Single node training.
	2021-03-24T13:40:57.661-07:00	INFO:root:Setting up HPO optimized metric to be : f1
	2021-03-24T13:40:57.661-07:00	INFO:root:Train matrix has 28831 rows
	2021-03-24T13:40:57.661-07:00	INFO:root:Validation matrix has 8238 rows
	2021-03-24T13:40:57.661-07:00	[20:40:55] WARNING: ../src/learner.cc:516:
	2021-03-24T13:40:57.661-07:00	Parameters: { _tuning_objective_metric } might not be used. This may not be accurate due to some parameters are only used in language bindings but passed down to XGBoost core. Or some parameters are not used but slip through this verification. Please open an issue if you find above cases.
	2021-03-24T13:40:57.661-07:00	[0]#011train-error:0.09705#011validation-error:0.10646#011train-f1:0.23900#011validation-f1:0.23828
		No newer events at this moment. Auto retry paused. 
Resume
```

In [16]:
xgboost.__version__

'1.2.0'