# Higgs Boson - Collision event classification

## Model creation

### Upload data to S3

The data files have already been prepared locally on the storage of the Sagemaker Notebook instance. Next we'll upload them onto the default S3 bucket

In [1]:
# import libraries
import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score, recall_score, roc_auc_score

import os
import time
import boto3
import sagemaker
from sagemaker import get_execution_role

In [2]:
# SageMaker session and role
LOCAL_MODE = False
if LOCAL_MODE:
    # During model development it's more efficient to use SageMaker local model.
    # https://aws.amazon.com/blogs/machine-learning/use-the-amazon-sagemaker-local-mode-to-train-on-your-notebook-instance/
    sagemaker_session = sagemaker.LocalSession()

    # If notebook instance does not have GPUs, change to 'local'.
    train_instance_type = 'local_gpu'
    deploy_instance_type = 'local_gpu'
    print('Running in local mode.')
else:
    sagemaker_session = sagemaker.Session()
    train_instance_type = 'ml.m5.xlarge'
    deploy_instance_type = 'ml.m5.large'
    print('Running in remote mode.')

role = sagemaker.get_execution_role()

# Use the default Sagemaker an S3 bucket
bucket = sagemaker_session.default_bucket()

# Prefix for objects in S3 bucket
prefix = 'capstone/xgb_500000'

# Location of data files on notebook storage.
data_dir = '../data/500000'


Running in remote mode.


In [3]:
# Actual upload of the data files.
train_path = sagemaker_session.upload_data(data_dir+'/train.csv', bucket=bucket, key_prefix=prefix)
val_path = sagemaker_session.upload_data(data_dir+'/val.csv', bucket=bucket, key_prefix=prefix)

s3_input_train = sagemaker.s3_input(s3_data=train_path, content_type='csv')
s3_input_val = sagemaker.s3_input(s3_data=val_path, content_type='csv')

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


In [4]:
# Read the test data.
test_df = pd.read_csv(os.path.join(data_dir, 'test.csv'), header=None)
print("Shape: {}".format(test_df.shape))

# Seperate labels and features
test_y = test_df.iloc[:,0]
test_X = test_df.iloc[:, 1:]

# Dump the test features and upload to S3.
test_file = 'test-2.csv'
pd.DataFrame(test_X).to_csv(os.path.join(data_dir, test_file), header=False, index=False)
test_path = sagemaker_session.upload_data(os.path.join(data_dir, test_file), bucket=bucket, key_prefix=prefix)
print(test_path)

Shape: (75000, 18)
s3://sagemaker-eu-west-1-019518462631/capstone/xgb_500000/test-2.csv


In [5]:
# Check if files have been uploaded
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    if prefix in obj.key and '.csv' in obj.key:
        print(obj.key)

capstone/xgb_500000/test-2.csv
capstone/xgb_500000/train.csv
capstone/xgb_500000/val.csv


### Test model using XGBoost

Initially let's explore how well a SageMaker built-in algorithm performs on the dataset.
First a model is trained using basic parameters. Next hyper parameter tuning is applied to see how much it can be improved.

#### Initial XGBoost model

In [6]:
from sagemaker.amazon.amazon_estimator import get_image_uri

# Use the latest xgboost algorithm.
container = get_image_uri(sagemaker_session.boto_region_name,
                          'xgboost',
                          repo_version='1.0-1')


# Create an estimator and set the hyper parameters.
xgb_output_path = 's3://{}/{}/output'.format(bucket, prefix)
xgb = sagemaker.estimator.Estimator(container,
                                    role,
                                    train_instance_count=1,
                                    train_instance_type=train_instance_type,
                                    output_path=xgb_output_path,
                                    sagemaker_session=sagemaker_session)

xgb.set_hyperparameters(eval_metric='auc',
                        objective='binary:logistic',
                        max_depth=10,
                        num_round=500,
                        eta=0.2,
                        gamma=4,
                        rate_drop=0.3,
                        min_child_weight=6,
                        subsample=0.8,
                        early_stopping_rounds=15,
                        tweedie_variance_power=1.4)


'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


In [7]:
%%time

# Do actual training.
xgb.fit({'train': s3_input_train, 'validation': s3_input_val})

2020-07-19 09:47:45 Starting - Starting the training job...
2020-07-19 09:47:47 Starting - Launching requested ML instances......
2020-07-19 09:48:53 Starting - Preparing the instances for training...
2020-07-19 09:49:41 Downloading - Downloading input data
2020-07-19 09:49:41 Training - Downloading the training image...
2020-07-19 09:50:10 Training - Training image download completed. Training in progress......[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter eval_metric value auc to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CS

During training the AUC score on the validation set increased from an initial score of 0.65416 to **0.71918** (highest value on round 106). The number of rounds was set to 500 but training ended after 120 rounds, which means that given with the current data and settings for hyper parameters, training reached a plateau.

In [8]:
%%time

# After training, I evaluate the model using the test data...
xgb_transformer = xgb.transformer(instance_count = 1, instance_type = 'ml.m5.large')
xgb_transformer.transform(test_path, content_type='text/csv', split_type='Line')
xgb_transformer.wait()

# The output of the transform job is downloaded back onto the notebook...
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir

# And the predictions are processed and metrics calculated.
predictions = pd.read_csv(os.path.join(data_dir, 'test-2.csv.out'), header=None)
predictions = [round(num) for num in predictions.squeeze().values]

print("Accuracy: {:4f}".format(accuracy_score(test_y, predictions)))
print("Recal:    {:4f}".format(recall_score(test_y, predictions)))
print("ROC AUC:  {:4f}".format(roc_auc_score(test_y, predictions)))



...................[34m[2020-07-19:09:54:52:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2020-07-19:09:54:52:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2020-07-19:09:54:52:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;
[0m
[34mworker_rlimit_nofile 4096;
[0m
[34mevents {
  worker_connections 2048;[0m
[34m}
[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;

  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }

  server {
    listen 8080 deferred;
    client_max_body_size 0;

    keepalive_timeout 3;

    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
      proxy_pass http://gunicorn;
    }

    location /

With minimal effort and using the built-in XGBoost algorithm, an AUC score of **0.662989** on the test set was obtained.

#### Optimized model

Next a new model is created and hyper parameter tuning is applied to it.

In [10]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

xgb_output_path = 's3://{}/{}/output'.format(bucket, prefix)

xgb_estimator = sagemaker.estimator.Estimator(container,
                                              role,
                                              train_instance_count=1,
                                              train_instance_type=train_instance_type,
                                              output_path=xgb_output_path,
                                              sagemaker_session=sagemaker_session)

xgb_estimator.set_hyperparameters(eval_metric='auc',
                                  objective='binary:logistic',
                                  num_round=500,
                                  rate_drop=0.3,
                                  early_stopping_rounds=15,
                                  tweedie_variance_power=1.4)

hyperparameter_ranges = {
    'eta': ContinuousParameter(0, 1),
    'min_child_weight': ContinuousParameter(1, 10),
    'alpha': ContinuousParameter(0, 2),
    'eta': ContinuousParameter(0, 1),
    'gamma': ContinuousParameter(0, 10),
    'max_depth': IntegerParameter(1, 20),
    'subsample': ContinuousParameter(0, 1)
}

objective_metric_name = 'validation:auc'

xgb_tuner = HyperparameterTuner(xgb_estimator,
                                objective_metric_name,
                                hyperparameter_ranges,
                                max_jobs=30,
                                max_parallel_jobs=4)




In [11]:
%%time

import time

xgb_tuner.fit({'train': s3_input_train, 'validation': s3_input_val})

xgb_tuning_job_info = sagemaker_session.sagemaker_client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=xgb_tuner.latest_tuning_job.job_name)

# Wait while hyperparameter tuning job is running.
while xgb_tuning_job_info['HyperParameterTuningJobStatus'] == 'InProgress':
    print("JobStatus: {}".format(xgb_tuning_job_info['TrainingJobStatusCounters']))
    print("ObjectiveStatus: {}".format(xgb_tuning_job_info['ObjectiveStatusCounters']))
    time.sleep(5*60)
    xgb_tuning_job_info = sagemaker_session.sagemaker_client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=xgb_tuner.latest_tuning_job.job_name)


JobStatus: {'Completed': 0, 'InProgress': 0, 'RetryableError': 0, 'NonRetryableError': 0, 'Stopped': 0}
ObjectiveStatus: {'Succeeded': 0, 'Pending': 0, 'Failed': 0}
JobStatus: {'Completed': 3, 'InProgress': 4, 'RetryableError': 0, 'NonRetryableError': 0, 'Stopped': 0}
ObjectiveStatus: {'Succeeded': 3, 'Pending': 4, 'Failed': 0}
JobStatus: {'Completed': 8, 'InProgress': 4, 'RetryableError': 0, 'NonRetryableError': 0, 'Stopped': 0}
ObjectiveStatus: {'Succeeded': 8, 'Pending': 4, 'Failed': 0}
JobStatus: {'Completed': 10, 'InProgress': 4, 'RetryableError': 0, 'NonRetryableError': 0, 'Stopped': 0}
ObjectiveStatus: {'Succeeded': 10, 'Pending': 4, 'Failed': 0}
JobStatus: {'Completed': 13, 'InProgress': 4, 'RetryableError': 0, 'NonRetryableError': 0, 'Stopped': 0}
ObjectiveStatus: {'Succeeded': 13, 'Pending': 4, 'Failed': 0}
JobStatus: {'Completed': 15, 'InProgress': 3, 'RetryableError': 0, 'NonRetryableError': 0, 'Stopped': 0}
ObjectiveStatus: {'Succeeded': 15, 'Pending': 3, 'Failed': 0}
JobS

In [12]:
%%time

xgb_best_estimator = xgb_tuner.best_estimator()

xgb_best_transformer = xgb_best_estimator.transformer(instance_count = 1, instance_type = 'ml.m5.large')
xgb_best_transformer.transform(test_path, content_type='text/csv', split_type='Line')
xgb_best_transformer.wait()

!aws s3 cp --recursive $xgb_best_transformer.output_path $data_dir

# And the predictions are processed and metrics calculated.
predictions = pd.read_csv(os.path.join(data_dir, 'test-2.csv.out'), header=None)
predictions = [round(num) for num in predictions.squeeze().values]

print("Accuracy: {:4f}".format(accuracy_score(test_y, predictions)))
print("Recal:    {:4f}".format(recall_score(test_y, predictions)))
print("ROC AUC:  {:4f}".format(roc_auc_score(test_y, predictions)))




2020-07-19 10:29:09 Starting - Preparing the instances for training
2020-07-19 10:29:09 Downloading - Downloading input data
2020-07-19 10:29:09 Training - Training image download completed. Training in progress.
2020-07-19 10:29:09 Uploading - Uploading generated training model
2020-07-19 10:29:09 Completed - Training job completed



[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter eval_metric value auc to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter _tuning_objective_metric value validation:auc to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34m[10:20:52] 350000x17 matrix with 5950000 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&de

After testing 30 different combinations, the best XGBoost model achieved an AUC of **0.73419** on the validation set. Compared with the initial XGBoost model (AUC 0.71918), this is an improvement of 0.01501 over the initial XGBoost model.

On the test set, the final XGBoost model achieven an AUC of **0.674253** on the test set (which is an improvement of 0.011264 compared to the initial model).