# Higgs Boson - Collision event classification

## Deep Learning Model

### Upload data to S3

The data files have already been prepared locally on the storage of the Sagemaker Notebook instance. Next we'll upload them onto the default S3 bucket

In [1]:
# import libraries
import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score, recall_score, roc_auc_score

import os
import time
import boto3
import sagemaker
from sagemaker import get_execution_role

In [2]:
# SageMaker session and role
LOCAL_MODE = False
if LOCAL_MODE:
    # During model development it's more efficient to use SageMaker local model.
    # https://aws.amazon.com/blogs/machine-learning/use-the-amazon-sagemaker-local-mode-to-train-on-your-notebook-instance/
    sagemaker_session = sagemaker.LocalSession()

    # If notebook instance does not have GPUs, change to 'local'.
    train_instance_type = 'local'
    deploy_instance_type = 'local'
    print('Running in local mode.')
else:
    sagemaker_session = sagemaker.Session()
    train_instance_type = 'ml.p2.xlarge'
    deploy_instance_type = 'ml.m5.large'
    print('Running in remote mode.')

role = sagemaker.get_execution_role()

# Use the default Sagemaker an S3 bucket
bucket = sagemaker_session.default_bucket()

# Prefix for objects in S3 bucket
prefix = 'capstone/nn_250000'

# Location of data files on notebook storage.
data_dir = '../data/250000'


Running in remote mode.


In [3]:
# Actual upload of the data files.
train_path = sagemaker_session.upload_data(data_dir+'/train.csv', bucket=bucket, key_prefix=prefix)
val_path = sagemaker_session.upload_data(data_dir+'/val.csv', bucket=bucket, key_prefix=prefix)

s3_input_train = sagemaker.s3_input(s3_data=train_path, content_type='csv')
s3_input_val = sagemaker.s3_input(s3_data=val_path, content_type='csv')

print("S3 locations: train {} / validation {}".format(train_path, val_path))

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


S3 locations: train s3://sagemaker-eu-west-1-019518462631/capstone/nn_250000/train.csv / validation s3://sagemaker-eu-west-1-019518462631/capstone/nn_250000/val.csv


In [4]:
# Read the test data.
test_df = pd.read_csv(os.path.join(data_dir, 'test.csv'), header=None)

# Seperate labels and features
test_y = test_df.iloc[:,0]
test_X = test_df.iloc[:, 1:]

# Dump the test features and upload to S3.
test_file = 'test-2.csv'
pd.DataFrame(test_X).to_csv(os.path.join(data_dir, test_file), header=False, index=False)
test_path = sagemaker_session.upload_data(os.path.join(data_dir, test_file), bucket=bucket, key_prefix=prefix)

print("S3 locations: test {}".format(test_path))

S3 locations: test s3://sagemaker-eu-west-1-019518462631/capstone/nn_250000/test-2.csv


In [5]:
# Upload sklearn scalers model.
scalers_path = "s3://{}/{}/scalers.pkl".format(bucket, prefix)
!aws s3 cp $data_dir/scalers.pkl $scalers_path

print("S3 locations: scalers {}".format(scalers_path))

upload: ../data/250000/scalers.pkl to s3://sagemaker-eu-west-1-019518462631/capstone/nn_250000/scalers.pkl
S3 locations: scalers s3://sagemaker-eu-west-1-019518462631/capstone/nn_250000/scalers.pkl


In [6]:
# Check if files have been uploaded
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    if prefix in obj.key and '.csv' in obj.key:
        print(obj.key)

capstone/nn_250000/test-2.csv
capstone/nn_250000/test-2.csv.out
capstone/nn_250000/train.csv
capstone/nn_250000/val.csv


### Deep learning model

Next I wanted to evaluate the performance of a deep neural network.

After several try-outs I'm using following model.

* Input layer
    * Fully connected layer
    * Tanh activation function
    * Batch normalization
    * Drop out

* 3 hidden layers
    * Fully connected layer
    * Tanh activation function
    * Batch normalization
    * Drop out

* Output layer
    * Fully connected layer (with 1 output node)
    * Sigmoid activation function
    
The cells below show the code of the model and training.

In [7]:
!pygmentize source/model.py

[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m

[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m[04m[36m.[39;49;00m[04m[36mfunctional[39;49;00m [34mas[39;49;00m [04m[36mF[39;49;00m

[34mclass[39;49;00m [04m[32mNet[39;49;00m(nn.Module):
    
    [34mdef[39;49;00m [32m__init__[39;49;00m([36mself[39;49;00m, input_dim, hidden_dim, init_weights=[33m'[39;49;00m[33mnone[39;49;00m[33m'[39;49;00m, dropout_rate=[34m0.5[39;49;00m):
        [33m'''Defines layers of a neural network.[39;49;00m
[33m           :param input_dim: Number of input features[39;49;00m
[33m           :param hidden[39;49;00m
[33m           :param hidden_dim: Size of hidden layer(s)[39;49;00m
[33m           :para

In [8]:
!pygmentize source/train.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function [37m# future proof[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mboto3[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mre[39;49;00m

[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mjoblib[39;49;00m

[37m# pytorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36moptim[39;49;00m [34mas[39;49;00m [04m[36mopt

#### Initial PyTorch estimator

Before diving into Hyperparameter tuning, let's start by training a NN model using default parameters.

In [10]:
# import a PyTorch wrapper
from sagemaker.pytorch import PyTorch

scalers_config_location = '{}/scalers.pkl'.format(pyt_output_path)

# specify an output path
pyt_output_path = "s3://{}/{}".format(bucket, prefix)
print("S3 locations: output path {}".format(pyt_output_path))

S3 locations: output path s3://sagemaker-eu-west-1-019518462631/capstone/nn_250000


In [10]:
%%time

# Instantiate a pytorch estimator using default hyperparameters.
pyt_estimator_initial = PyTorch(entry_point='train.py',
                                source_dir='source',
                                role=role,
                                train_instance_count=1,
                                train_instance_type=train_instance_type,
                                output_path=pyt_output_path,
                                sagemaker_session=sagemaker_session,
                                framework_version='1.4',
                                enable_sagemaker_metrics=True,
                                hyperparameters={
                                    'input-dim': test_X.shape[1],
                                    'scalers-config': scalers_config_location
                                })

pyt_estimator_initial.fit({'train': s3_input_train, 'validation': s3_input_val})


'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-07-19 13:13:39 Starting - Starting the training job...
2020-07-19 13:13:41 Starting - Launching requested ML instances.........
2020-07-19 13:15:13 Starting - Preparing the instances for training.........
2020-07-19 13:16:47 Downloading - Downloading input data...
2020-07-19 13:17:30 Training - Downloading the training image...........[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-07-19 13:19:21,137 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-07-19 13:19:21,161 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-07-19 13:19:21,802 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-07-19 13:19:22,093 sagemaker-containers INFO     Module default_user_module_name does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-07-19 13:19:22

In [11]:
%%time

# Create transformer object
transformer = pyt_estimator_initial.transformer(instance_count=1, instance_type=deploy_instance_type, output_path=pyt_output_path)

# Perform batch transfor on test set
transformer.transform(test_path, content_type='text/csv', split_type='Line', wait=True)

# Download the output file
if LOCAL_MODE:
    output_file = transformer.output_path + '/' + transformer.latest_transform_job.job_name + '/' + test_file + '.out'
else:
    output_file = transformer.output_path + '/' + test_file + '.out'

# Output file is downloaded to the notebook.
!aws s3 cp $output_file $data_dir
!sed -i -e 's/\]\[/, /g' $data_dir/test-2.csv.out

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


..........................................!
download: s3://sagemaker-eu-west-1-019518462631/capstone/nn_250000/test-2.csv.out to ../data/250000/test-2.csv.out
CPU times: user 289 ms, sys: 50.1 ms, total: 339 ms
Wall time: 3min 32s


In [12]:
# And the predictions are processed and metrics calculated.
basePath = os.path.abspath(data_dir)
predictions_file = os.path.join(basePath, test_file+'.out')
predictions = pd.read_json(predictions_file)
predictions = [round(num) for num in predictions.squeeze().values]

print("Accuracy: {:4f}".format(accuracy_score(test_y, predictions)))
print("Recal:    {:4f}".format(recall_score(test_y, predictions)))
print("ROC AUC:  {:4f}".format(roc_auc_score(test_y, predictions)))

Accuracy: 0.668853
Recal:    0.727840
ROC AUC:  0.665635


The initial model obtained an AUC of **0.665635**, which is actually slightly worsed than the tuned XGBoost model.

After training the NN model using the defaults, let's use a good set of parameters determined using a small data set (50000 items).

* Batch size: 300
* Size of hidden layers: 300
* Epochs: 100 (if validation loss does not reduce for 10 epochs, training is stopped).
* Dropout rate: 0.5
* Learning rate: 0.001
* Optimizer: ADAM
* Weight initialization: Uniform

In [13]:
%%time

scalers_config_location = '{}/scalers.pkl'.format(pyt_output_path)

# Instantiate a pytorch estimator using default hyperparameters.
pyt_estimator = PyTorch(entry_point='train.py',
                        source_dir='source',
                        role=role,
                        train_instance_count=1,
                        train_instance_type=train_instance_type,
                        output_path=pyt_output_path,
                        sagemaker_session=sagemaker_session,
                        framework_version='1.4',
                        enable_sagemaker_metrics=True,
                        hyperparameters={
                            'input-dim': test_X.shape[1],
                            'scalers-config': scalers_config_location,
                            'batch-size': 300,
                            'hidden-dim': 300,
                            'epochs': 100,
                            'dropout-rate': 0.5,
                            'lr': 0.001,
                            'optimizer': 'adam',
                            'weigth-init': 'uniform'
                        })

pyt_estimator.fit({'train': s3_input_train, 'validation': s3_input_val})


'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-07-19 13:43:12 Starting - Starting the training job...
2020-07-19 13:43:14 Starting - Launching requested ML instances......
2020-07-19 13:44:20 Starting - Preparing the instances for training.........
2020-07-19 13:45:55 Downloading - Downloading input data...
2020-07-19 13:46:15 Training - Downloading the training image...........[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-07-19 13:48:25,149 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-07-19 13:48:25,173 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m

2020-07-19 13:48:24 Training - Training image download completed. Training in progress.[34m2020-07-19 13:48:31,430 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-07-19 13:48:31,718 sagemaker-containers INFO     Module default_user_module_name do

In [8]:
# code to evaluate the endpoint on test data
# returns a variety of model metrics
def evaluate(predictor, test_features, test_labels, verbose=True):
    """
    Evaluate a model on a test set given the prediction endpoint.  
    Return binary classification metrics.
    :param predictor: A prediction endpoint
    :param test_features: Test features
    :param test_labels: Class labels for test data
    :param verbose: If True, prints a table of all performance metrics
    :return: A dictionary of performance metrics.
    """
    
    # rounding and squeezing array
    test_preds = np.squeeze(np.round(predictor.predict(test_features)))
    
    # calculate true positives, false positives, true negatives, false negatives
    tp = np.logical_and(test_labels, test_preds).sum()
    fp = np.logical_and(1-test_labels, test_preds).sum()
    tn = np.logical_and(1-test_labels, 1-test_preds).sum()
    fn = np.logical_and(test_labels, 1-test_preds).sum()
    
    # calculate binary classification metrics
    recall = tp / (tp + fn)
    precision = tp / (tp + fp)
    accuracy = (tp + tn) / (tp + fp + tn + fn)
    auc = roc_auc_score(test_labels, test_preds)
    
    # print metrics
    if verbose:
        print(pd.crosstab(test_labels, test_preds, rownames=['actuals'], colnames=['predictions']))
        print("\n{:<11} {:.4f}".format('Recall:', recall))
        print("{:<11} {:.4f}".format('Precision:', precision))
        print("{:<11} {:.4f}".format('Accuracy:', accuracy))
        print("{:<11} {:.4f}".format('AUC:', auc))
        print()
                
    return {'TP': tp, 'FP': fp, 'FN': fn, 'TN': tn, 
            'Precision': precision, 'Recall': recall, 'Accuracy': accuracy, 'AUC': auc}


In [36]:
%%time

from sagemaker.pytorch import PyTorchModel

# Create a model from the trained estimator data and point to the prediction script
model = PyTorchModel(model_data=pyt_estimator.model_data,
                     entry_point='predict.py',
                     source_dir='source',
                     role=role,
                     framework_version='1.4')

predictor = model.deploy(initial_instance_count=1, instance_type=deploy_instance_type)

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


---------------!CPU times: user 446 ms, sys: 35.8 ms, total: 482 ms
Wall time: 7min 32s


In [37]:
metrics = evaluate(predictor, test_X, test_y, True)

predictions    0.0    1.0
actuals                  
0            10915   6865
1             5205  14515

Recall:     0.7361
Precision:  0.6789
Accuracy:   0.6781
AUC:        0.6750



The model achieves an AUC of **0.675**, which is on par with a tuned XGBoost model.

In [38]:
predictor.delete_endpoint()
model.delete_model()

#### Tuning Hyperparameters of PyTorch model

Let's create a new PyTorch model. The previously trained model already improved upon the initial model with default values.
Earlier training on smaller data set already showed that the ADAM optimizer and a uniform weight initialization function provided better results than the alternatives (e.g. SGD). So these hyperparameters will be defined static.
Epochs should also not be checked anymore. The training script will automatically save the model with the lowest validation loss and will stop automatically if validation loss does not reduce over 10 epochs.

For other hyperparameters ranges are defined in relation to the values used in previous training. For hyperparameter tuning 20 different combinations

In [11]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

# Instantiate a pytorch estimator
pyt_hyper_estimator = PyTorch(entry_point='train.py',
                              source_dir='source',
                              role=role,
                              train_instance_count=1,
                              train_instance_type=train_instance_type,
                              output_path=pyt_output_path,
                              sagemaker_session=sagemaker_session,
                              framework_version='1.4',
                              hyperparameters={
                                  'input-dim': test_X.shape[1],
                                  'scalers-config': scalers_config_location,
                                  'epochs': 100,
                                  'optimizer': 'adam',
                                  'weigth-init': 'uniform',
                                  'batch-size': 300,
                                  'hidden-dim': 300,
                                  'dropout-rate': 0.5,
                                  'lr': 0.001})

# Define hyperparameter ranges
hyperparameter_ranges = {
    'batch-size': CategoricalParameter([300, 512, 768, 1024]),
    'hidden-dim': CategoricalParameter([test_X.shape[1]*30, test_X.shape[1]*35, test_X.shape[1]*40, test_X.shape[1]*45, test_X.shape[1]*50]),
    'dropout-rate': CategoricalParameter([0.3, 0.4, 0.5, 0.6, 0.7]),
    'lr': CategoricalParameter([0.0001, 0.0003, 0.0005, 0.001]),
}

# Define objective metric
objective_metric_name = 'average validation loss'
objective_type = 'Minimize'
metric_definitions = [{'Name': 'average validation loss',
                       'Regex': 'Validation: Average loss: ([0-9\\.]+)'}]

# Create tuner
pyt_tuner = HyperparameterTuner(pyt_hyper_estimator,
                                objective_metric_name,
                                hyperparameter_ranges,
                                metric_definitions,
                                max_jobs=30,
                                max_parallel_jobs=4,
                                objective_type=objective_type)

In [12]:
%%time

import time

pyt_tuner.fit({'train': s3_input_train, 'validation': s3_input_val})

pyt_tuning_job_info = sagemaker_session.sagemaker_client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=pyt_tuner.latest_tuning_job.job_name)

while pyt_tuning_job_info['HyperParameterTuningJobStatus'] == 'InProgress':
    print("JobStatus: {}".format(pyt_tuning_job_info['TrainingJobStatusCounters']))
    print("ObjectiveStatus: {}".format(pyt_tuning_job_info['ObjectiveStatusCounters']))
    time.sleep(10 * 60)   
    pyt_tuning_job_info = sagemaker_session.sagemaker_client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=pyt_tuner.latest_tuning_job.job_name)


'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


JobStatus: {'Completed': 0, 'InProgress': 0, 'RetryableError': 0, 'NonRetryableError': 0, 'Stopped': 0}
ObjectiveStatus: {'Succeeded': 0, 'Pending': 0, 'Failed': 0}
JobStatus: {'Completed': 0, 'InProgress': 4, 'RetryableError': 0, 'NonRetryableError': 0, 'Stopped': 0}
ObjectiveStatus: {'Succeeded': 0, 'Pending': 4, 'Failed': 0}
JobStatus: {'Completed': 4, 'InProgress': 4, 'RetryableError': 0, 'NonRetryableError': 0, 'Stopped': 0}
ObjectiveStatus: {'Succeeded': 4, 'Pending': 4, 'Failed': 0}
JobStatus: {'Completed': 8, 'InProgress': 4, 'RetryableError': 0, 'NonRetryableError': 0, 'Stopped': 0}
ObjectiveStatus: {'Succeeded': 8, 'Pending': 4, 'Failed': 0}
JobStatus: {'Completed': 8, 'InProgress': 4, 'RetryableError': 0, 'NonRetryableError': 0, 'Stopped': 0}
ObjectiveStatus: {'Succeeded': 8, 'Pending': 4, 'Failed': 0}
JobStatus: {'Completed': 12, 'InProgress': 4, 'RetryableError': 0, 'NonRetryableError': 0, 'Stopped': 0}
ObjectiveStatus: {'Succeeded': 12, 'Pending': 4, 'Failed': 0}
JobStatu

### Test the model

Before training part of the data set was put aside for final evaluation of the model using a batch transform job.

In [13]:
%%time

from sagemaker.pytorch import PyTorchModel

# Create a model from the trained estimator data and point to the prediction script
model = PyTorchModel(model_data=pyt_tuner.best_estimator().model_data,
                     entry_point='predict.py',
                     source_dir='source',
                     role=role,
                     framework_version='1.4')

predictor = model.deploy(initial_instance_count=1, instance_type=deploy_instance_type)



2020-07-19 16:33:26 Starting - Preparing the instances for training
2020-07-19 16:33:26 Downloading - Downloading input data
2020-07-19 16:33:26 Training - Training image download completed. Training in progress.
2020-07-19 16:33:26 Uploading - Uploading generated training model
2020-07-19 16:33:26 Completed - Training job completed[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-07-19 16:24:20,723 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-07-19 16:24:20,725 sagemaker-containers INFO     Failed to parse hyperparameter _tuning_objective_metric value average validation loss to Json.[0m
[34mReturning the value itself[0m
[34m2020-07-19 16:24:20,748 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-07-19 16:24:20,753 sagemaker_pytorch_container.training INFO     Invoking user training scri



---------------!CPU times: user 826 ms, sys: 38.9 ms, total: 865 ms
Wall time: 7min 33s


In [14]:
metrics = evaluate(predictor, test_X, test_y, True)

predictions    0.0    1.0
actuals                  
0            11067   6713
1             5062  14658

Recall:     0.7433
Precision:  0.6859
Accuracy:   0.6860
AUC:        0.6829



---