# AWS Sagemaker

Amazon SageMaker is a managed service that provides data scientists with the ability to build, train, and deploy machine learning (ML) models quickly. 

We'll be using Amazon SageMaker automatic model tuning, also known as hyperparameter tuning. This service finds the best version of a model by running many training jobs on the HAR dataset using the algorithm and ranges of hyperparameters that are specified. It then chooses the hyperparameter values that result in a model that performs the best, as measured by our metric (F1 score).

## Data Loading
Import the required libraries to build the Sagemaker model.

In [1]:
import sys
import math
import json
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
from time import gmtime, strftime
%matplotlib inline
sys.path.insert(1, '../src/')
from utils import load_dataset_data
from sklearn.model_selection import LeaveOneGroupOut, GroupKFold, GroupShuffleSplit
from xgboost import XGBClassifier
from sagemaker.amazon.amazon_estimator import get_image_uri
from time import gmtime, strftime, sleep
from sklearn.metrics import plot_confusion_matrix, f1_score, confusion_matrix, accuracy_score, classification_report, roc_curve, roc_auc_score,auc
import sagemaker
import boto3
import os

Setup a connection to AWS Sagemaker and S3.

In [2]:
session = boto3.Session(aws_access_key_id= 'AKIAZIIW3ED4EWTNGZHP', aws_secret_access_key='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', region_name='us-east-1')
client = session.client('sagemaker')
role = 'arn:aws:iam::636239945976:role/service-role/AmazonSageMaker-ExecutionRole-20201011T120068'

In [3]:
sagemaker_session = sagemaker.Session(boto_session=session)
bucket = sagemaker_session.default_bucket()
prefix = 'hardataset1'

We load the data and split a static train and validation dataset as Sagemaker does not support cross validation.

In [4]:
X_train, y_train, subject_train, X_test, y_test, subject_test = load_dataset_data()

with open('unimportant_features.json', 'r') as json_file:
    unimportant_features = json.load(json_file)
boruta_unimportant_features = unimportant_features['boruta']
mi_unimportant_features = unimportant_features['mi_unimportant_features']

X_train = X_train.drop(boruta_unimportant_features+mi_unimportant_features, axis=1)
X_test = X_test.drop(boruta_unimportant_features+mi_unimportant_features, axis=1)

kfolds = GroupKFold(5)
train, val = next(kfolds.split(X_train, y_train, subject_train))

df_train = pd.concat([y_train.iloc[train], X_train.iloc[train]], axis=1)
df_val = pd.concat([y_train.iloc[val], X_train.iloc[val]], axis=1)

df_train.activity_label = (df_train.activity_label-1)
df_val.activity_label = (df_val.activity_label-1)

df_train.to_csv('train.csv', ',', index=False, header=False)
df_val.to_csv('validation.csv', ',', index=False, header=False)

FileNotFoundError: [Errno 2] No such file or directory: 'unimportant_features.json'

Upload the train and validation data to S3.

In [None]:
session.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train_har/train.csv')).upload_file('train.csv')
session.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation_har/validation.csv')).upload_file('validation.csv')

# Model Training

We'll first define the the hyperparameter space, unique to the XGBoost model, that we want to optimize on. As we are using a more advanced search technique (Bayesian), we can make the search space larger. Sagemaker will intelligently search this hyperparameter space, rather than sampling randomly, or running the entire grid.
We'll optimize towards 'validation:merror' as 'validation:f1' is not yet supported.

In [None]:
tuning_job_name = 'xgboost-tuningjob-' + strftime("%d-%H-%M-%S", gmtime())

tuning_job_config = {
    "ParameterRanges": {
      "CategoricalParameterRanges": [],
      "ContinuousParameterRanges": [
        {
          "MaxValue": "1",
          "MinValue": "0",
          "Name": "eta",
        },
        {
          "MaxValue": "15",
          "MinValue": "5",
          "Name": "min_child_weight",
        },
        {
          "MaxValue": "2",
          "MinValue": "0",
          "Name": "alpha",            
        },
        {
          "MaxValue": "1",
          "MinValue": "0",
          "Name": "subsample",            
        },
        {
          "MaxValue": "1",
          "MinValue": "0",
          "Name": "colsample_bytree",            
        },
                    {
          "MaxValue": "1",
          "MinValue": "0",
          "Name": "colsample_bylevel",            
        }
      ],
      "IntegerParameterRanges": [
        {
          "MaxValue": "6",
          "MinValue": "1",
          "Name": "max_depth",
        },
        {
          "MaxValue": "1024",
          "MinValue": "512",
          "Name": "num_round",
        }, 
      ]
    },
    "ResourceLimits": {
      "MaxNumberOfTrainingJobs": 256,
      "MaxParallelTrainingJobs": 4
    },
    "Strategy": "Bayesian",
    "HyperParameterTuningJobObjective": {
      "MetricName": "validation:merror",
      "Type": "Minimize"
    },
  }
print (tuning_job_name)

xgboost-tuningjob-13-11-34-36


Next, we define the training and validation config, and we define the Sagemaker resources that we'll be using.

In [None]:
training_image = get_image_uri('us-east-1', 'xgboost', repo_version='latest')
     
s3_input_train = 's3://{}/{}/train_har/'.format(bucket, prefix)
s3_input_validation ='s3://{}/{}/validation_har/'.format(bucket, prefix)
    
training_job_definition = {
    "AlgorithmSpecification": {
      "TrainingImage": training_image,
      "TrainingInputMode": "File"
    },
    "InputDataConfig": [
      {
        "ChannelName": "train",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_train
          }
        }
      },
      {
        "ChannelName": "validation",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_validation
          }
        }
      }
    ],
    "OutputDataConfig": {
      "S3OutputPath": "s3://{}/{}/output_har".format(bucket,prefix)
    },
    "ResourceConfig": {
      "InstanceCount": 1,
      "InstanceType": "ml.m4.xlarge",
      "VolumeSizeInGB": 10
    },
    "RoleArn": role,
    "StaticHyperParameters": {
      "objective": "multi:softmax",
      "num_class": "12"
    },
    "StoppingCondition": {
      "MaxRuntimeInSeconds": 43200
        
    }
}

Run the hyper parameter tuning job.

In [None]:
client.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name,
                                            HyperParameterTuningJobConfig = tuning_job_config,
                                            TrainingJobDefinition = training_job_definition)

{'HyperParameterTuningJobArn': 'arn:aws:sagemaker:us-east-1:636239945976:hyper-parameter-tuning-job/xgboost-tuningjob-13-11-34-36',
 'ResponseMetadata': {'RequestId': '0ba591c9-b014-4b77-b0a5-4193661c981d',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '0ba591c9-b014-4b77-b0a5-4193661c981d',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '130',
   'date': 'Tue, 13 Oct 2020 11:34:50 GMT'},
  'RetryAttempts': 0}}

We verify whether the hyper parameter tuning job is running and list the different jobs. The jobs are sorted by objective metric.

In [None]:
tuning_job_name = "xgboost-tuningjob-13-06-58-23"

In [None]:
training_jobs = client.list_training_jobs_for_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuning_job_name, SortBy="FinalObjectiveMetricValue", SortOrder='Ascending')["TrainingJobSummaries"]
job_list = []
for job in training_jobs:
    if 'FinalHyperParameterTuningJobObjectiveMetric' in job:
        value = job["FinalHyperParameterTuningJobObjectiveMetric"]['Value']
        params = job['TunedHyperParameters']
        job_list.append((params, value))
job_list.sort(key=lambda x: x[1])

Select the best job as the model that we want to replicate.

In [None]:
best_job = job_list[0][0]

## Evaluation



We train the model with the best set of hyperparameters.

In [None]:
params = {}
for k,v in best_job.items():
    if k != 'num_round':
        params[k] = float(v)
params['n_estimators'] = int(best_job['num_round'])
params['max_depth'] = int(best_job['max_depth'])
clf = XGBClassifier()
clf.set_params(**params)
clf.fit(X_train, y_train.values.ravel())

{'alpha': 1.7632819381608957,
 'colsample_bylevel': 0.2719193271114313,
 'colsample_bytree': 0.465505553275144,
 'eta': 0.03384446477878989,
 'max_depth': 6,
 'min_child_weight': 12.281818465934075,
 'subsample': 0.8434874449635632,
 'n_estimators': 638}

We evaluate the model.

In [None]:
y_pred_test = clf.predict(X_test)
y_pred_train = clf.predict(X_train)
f1_score(y_pred_train, y_train.values, average='macro'), f1_score(y_pred_test, y_test, average='macro')

(0.9922014508130036, 0.8484351929650312)

## Summary

The current set of trials were able to find a better model.