# Higgs Boson - Collision event classification

## Deep Learning Model - Retrain tuned model

### Upload data to S3

The data files have already been prepared locally on the storage of the Sagemaker Notebook instance. Next we'll upload them onto the default S3 bucket

In [1]:
# import libraries
import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score, recall_score, roc_auc_score

import os
import time
import boto3
import sagemaker
from sagemaker import get_execution_role

In [2]:
# SageMaker session and role
LOCAL_MODE = False
if LOCAL_MODE:
    # During model development it's more efficient to use SageMaker local model.
    # https://aws.amazon.com/blogs/machine-learning/use-the-amazon-sagemaker-local-mode-to-train-on-your-notebook-instance/
    sagemaker_session = sagemaker.LocalSession()

    # If notebook instance does not have GPUs, change to 'local'.
    train_instance_type = 'local'
    deploy_instance_type = 'local'
    print('Running in local mode.')
else:
    sagemaker_session = sagemaker.Session()
    train_instance_type = 'ml.p2.xlarge'
    deploy_instance_type = 'ml.m5.large'
    print('Running in remote mode.')

role = sagemaker.get_execution_role()

# Use the default Sagemaker an S3 bucket
bucket = sagemaker_session.default_bucket()

# Prefix for objects in S3 bucket
prefix = 'capstone/nn_1000000'

# Location of data files on notebook storage.
data_dir = '../data/1000000'


Running in remote mode.


In [3]:
# Actual upload of the data files.
train_path = sagemaker_session.upload_data(data_dir+'/train.csv', bucket=bucket, key_prefix=prefix)
val_path = sagemaker_session.upload_data(data_dir+'/val.csv', bucket=bucket, key_prefix=prefix)

s3_input_train = sagemaker.s3_input(s3_data=train_path, content_type='csv')
s3_input_val = sagemaker.s3_input(s3_data=val_path, content_type='csv')

print("S3 locations: train {} / validation {}".format(train_path, val_path))

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


S3 locations: train s3://sagemaker-eu-west-1-019518462631/capstone/nn_1000000/train.csv / validation s3://sagemaker-eu-west-1-019518462631/capstone/nn_1000000/val.csv


In [4]:
# Read the test data.
test_df = pd.read_csv(os.path.join(data_dir, 'test.csv'), header=None)

# Seperate labels and features
test_y = test_df.iloc[:,0]
test_X = test_df.iloc[:, 1:]

# Dump the test features and upload to S3.
test_file = 'test-2.csv'
pd.DataFrame(test_X).to_csv(os.path.join(data_dir, test_file), header=False, index=False)
test_path = sagemaker_session.upload_data(os.path.join(data_dir, test_file), bucket=bucket, key_prefix=prefix)

print("S3 locations: test {}".format(test_path))

S3 locations: test s3://sagemaker-eu-west-1-019518462631/capstone/nn_1000000/test-2.csv


In [5]:
# Upload sklearn scalers model.
scalers_path = "s3://{}/{}/scalers.pkl".format(bucket, prefix)
!aws s3 cp $data_dir/scalers.pkl $scalers_path

print("S3 locations: scalers {}".format(scalers_path))

upload: ../data/1000000/scalers.pkl to s3://sagemaker-eu-west-1-019518462631/capstone/nn_1000000/scalers.pkl
S3 locations: scalers s3://sagemaker-eu-west-1-019518462631/capstone/nn_1000000/scalers.pkl


In [6]:
# Check if files have been uploaded
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    if prefix in obj.key and '.csv' in obj.key:
        print(obj.key)

capstone/nn_1000000/test-2.csv
capstone/nn_1000000/train.csv
capstone/nn_1000000/val.csv


### Deep learning model

Next I wanted to evaluate the performance of a deep neural network.

After several try-outs I'm using following model.

* Input layer
    * Fully connected layer
    * Tanh activation function
    * Batch normalization
    * Drop out

* 3 hidden layers
    * Fully connected layer
    * Tanh activation function
    * Batch normalization
    * Drop out

* Output layer
    * Fully connected layer (with 1 output node)
    * Sigmoid activation function
    
The cells below show the code of the model and training.

In [7]:
!pygmentize source/model.py

[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m

[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m[04m[36m.[39;49;00m[04m[36mfunctional[39;49;00m [34mas[39;49;00m [04m[36mF[39;49;00m

[34mclass[39;49;00m [04m[32mNet[39;49;00m(nn.Module):
    
    [34mdef[39;49;00m [32m__init__[39;49;00m([36mself[39;49;00m, input_dim, hidden_dim, init_weights=[33m'[39;49;00m[33mnone[39;49;00m[33m'[39;49;00m, dropout_rate=[34m0.5[39;49;00m):
        [33m'''Defines layers of a neural network.[39;49;00m
[33m           :param input_dim: Number of input features[39;49;00m
[33m           :param hidden[39;49;00m
[33m           :param hidden_dim: Size of hidden layer(s)[39;49;00m
[33m           :para

In [8]:
!pygmentize source/train.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function [37m# future proof[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mboto3[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mre[39;49;00m

[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mjoblib[39;49;00m

[37m# pytorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36moptim[39;49;00m [34mas[39;49;00m [04m[36mopt

#### PyTorch estimator

Let's retrain a PyTorch classifier on a data set of 1000000 items using the best hyperparameters.

* Batch size: 300
* Size of hidden layers: 680
* Epochs: 100 (if validation loss does not reduce for 10 epochs, training is stopped).
* Dropout rate: 0.3
* Learning rate: 0.001
* Optimizer: ADAM
* Weight initialization: Uniform

In [9]:
# import a PyTorch wrapper
from sagemaker.pytorch import PyTorch

# specify an output path
pyt_output_path = "s3://{}/{}".format(bucket, prefix)
print("S3 locations: output path {}".format(pyt_output_path))

scalers_config_location = '{}/scalers.pkl'.format(pyt_output_path)


S3 locations: output path s3://sagemaker-eu-west-1-019518462631/capstone/nn_1000000


In [10]:
%%time

# Instantiate a pytorch estimator using default hyperparameters.
pyt_estimator = PyTorch(entry_point='train.py',
                        source_dir='source',
                        role=role,
                        train_instance_count=1,
                        train_instance_type=train_instance_type,
                        output_path=pyt_output_path,
                        sagemaker_session=sagemaker_session,
                        framework_version='1.4',
                        enable_sagemaker_metrics=True,
                        hyperparameters={
                            'input-dim': test_X.shape[1],
                            'scalers-config': scalers_config_location,
                            'batch-size': 300,
                            'hidden-dim': 680,
                            'epochs': 100,
                            'dropout-rate': 0.3,
                            'lr': 0.001,
                            'optimizer': 'adam',
                            'weigth-init': 'uniform'
                        })

pyt_estimator.fit({'train': s3_input_train, 'validation': s3_input_val})


'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-07-21 08:33:56 Starting - Starting the training job...
2020-07-21 08:33:57 Starting - Launching requested ML instances......
2020-07-21 08:35:13 Starting - Preparing the instances for training.........
2020-07-21 08:36:48 Downloading - Downloading input data......
2020-07-21 08:37:38 Training - Downloading the training image.........
2020-07-21 08:39:13 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-07-21 08:39:15,093 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-07-21 08:39:15,118 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-07-21 08:39:16,558 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-07-21 08:39:16,929 sagemaker-containers INFO     Module default_user_module_name d

### Test the model

Before training part of the data set was put aside for final evaluation of the model using a batch transform job.

In [41]:
# code to evaluate the endpoint on test data
# returns a variety of model metrics
def evaluate(predictor, test_features, test_labels, verbose=True):
    """
    Evaluate a model on a test set given the prediction endpoint.  
    Return binary classification metrics.
    :param predictor: A prediction endpoint
    :param test_features: Test features
    :param test_labels: Class labels for test data
    :param verbose: If True, prints a table of all performance metrics
    :return: A dictionary of performance metrics.
    """
    
    # running predictions. Testing showed that max payload for predictor is 
    # 25000 items. Test set will be split and results joined back.
    test_preds = []
    nr_subarrays = test_features.shape[0] / 20_000
    test_features_split = np.array_split(test_features, nr_subarrays)
    for subarray in test_features_split:
        print("prediction...")
        test_preds.append(np.squeeze(np.round(predictor.predict(subarray))))
        
    test_preds = np.concatenate(test_preds)
    
    # calculate true positives, false positives, true negatives, false negatives
    tp = np.logical_and(test_labels, test_preds).sum()
    fp = np.logical_and(1-test_labels, test_preds).sum()
    tn = np.logical_and(1-test_labels, 1-test_preds).sum()
    fn = np.logical_and(test_labels, 1-test_preds).sum()
    
    # calculate binary classification metrics
    recall = tp / (tp + fn)
    precision = tp / (tp + fp)
    accuracy = (tp + tn) / (tp + fp + tn + fn)
    auc = roc_auc_score(test_labels, test_preds)
    
    # print metrics
    if verbose:
        print(pd.crosstab(test_labels, test_preds, rownames=['actuals'], colnames=['predictions']))
        print("\n{:<11} {:.4f}".format('Recall:', recall))
        print("{:<11} {:.4f}".format('Precision:', precision))
        print("{:<11} {:.4f}".format('Accuracy:', accuracy))
        print("{:<11} {:.4f}".format('AUC:', auc))
        print()
                
    return {'TP': tp, 'FP': fp, 'FN': fn, 'TN': tn, 
            'Precision': precision, 'Recall': recall, 'Accuracy': accuracy, 'AUC': auc}


In [24]:
%%time

from sagemaker.pytorch import PyTorchModel

# Create a model from the trained estimator data and point to the prediction script
model = PyTorchModel(model_data=pyt_estimator.model_data,
                     entry_point='predict.py',
                     source_dir='source',
                     role=role,
                     framework_version='1.4')

predictor = model.deploy(initial_instance_count=1, instance_type=deploy_instance_type)

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


---------------!CPU times: user 868 ms, sys: 41.8 ms, total: 910 ms
Wall time: 7min 33s


In [43]:
metrics = evaluate(predictor, test_X, test_y, True)

prediction...
prediction...
prediction...
prediction...
prediction...
prediction...
prediction...
predictions    0.0    1.0
actuals                  
0            45691  24559
1            20140  59610

Recall:     0.7475
Precision:  0.7082
Accuracy:   0.7020
AUC:        0.6989



The model achieves an AUC score of **0.6989**.

### Deploying the model

In [50]:
predictor.delete_endpoint()
model.delete_model()

---

### Rest API deployment

In [46]:
# Code for using predictor from Lambda
print(predictor.endpoint)

pytorch-inference-2020-07-21-09-32-04-914


The cell below shows the code of the Lambda Function, which should be created with following properties

* Runtime: python 3.6
* Memory Size: 1024MB
* Timeout: 29sec
* Environment variable: SM_ENDPOINT set to the output of the above cell (`predictor.endpoint`)

In [49]:
!pygmentize lambda/lambda.py

[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mboto3[39;49;00m

[34mdef[39;49;00m [32mlambda_handler[39;49;00m(event, context):
    
    data = event[[33m'[39;49;00m[33mbody[39;49;00m[33m'[39;49;00m]

    [37m# The SageMaker runtime is what allows us to invoke the endpoint that we've created.[39;49;00m
    runtime = boto3.Session().client([33m'[39;49;00m[33msagemaker-runtime[39;49;00m[33m'[39;49;00m)

    [37m# Now we use the SageMaker runtime to invoke our endpoint, sending the review we were given[39;49;00m
    response = runtime.invoke_endpoint(EndpointName=os.environ[[33m'[39;49;00m[33mSM_ENDPOINT[39;49;00m[33m'[39;49;00m],
                                       ContentType=[33m'[39;49;00m[33mtext/csv[39;49;00m[33m'[39;49;00m,
                                       Accept=[33m'[39;49;00m[33mtext/csv[39;49;00m[33m'[39;49;00m,
                                       Body=data)

    [37m# The response is an HTTP r

After the Lambda Function is created, A Rest API should be created that uses this Lambda function with proxy integration.

---