# Model creation and evaluation
The next step was to create a few models, hypertune them and compare them using the F1 metric. The first model will be the benchmark model, which is the XGBoost.

To construct the XGBoost, I'll use the SageMaker's XGBoost API.

In [1]:
import os
import sagemaker
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker import get_execution_role

# Our current execution role is require when creating the model as the training
# and inference code will need to access the model artifacts.
role = get_execution_role()
session = sagemaker.Session() # Store the current SageMaker session
# S3 prefix (which folder will we use)
prefix = 'covid19-classifier'
container = get_image_uri(session.boto_region_name, 'xgboost')


'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
There is a more up to date SageMaker XGBoost image. To use the newer image, please set 'repo_version'='1.0-1'. For example:
	get_image_uri(region, 'xgboost', '1.0-1').


## Loading and saving the test data
The data was loaded from the [data processing notebook](./DataExploration.ipynb)'s files, and split into features and labels, so that we could test them.

In [2]:
import pandas as pd
test_df = pd.read_csv('data/test.csv', encoding='latin2', header=None)
test_x = test_df.iloc[:, 1:]
test_x.to_csv('data/test_x.csv', index=False, header=False)
test_y = test_df.iloc[:,0]

16298


In [13]:
import numpy as np
test_df_2020 = pd.read_csv('data/x_test_2020.csv', encoding='latin2', header=None)
test_x_2020 = test_df.iloc[:, 1:]
# as the 2020 test data is unreliable, the test data was totally classified as positive,
# so that the classification result has any meaning.
test_y_2020 = np.ones(len(test_df_2020))

## Uploading the data to S3
We upload the data to S3 so that it's accessible and easily consumable from the model.


In [3]:
data_dir = 'data'

test_location = session.upload_data(os.path.join(data_dir, 'test_x.csv'), key_prefix=prefix)
test_2020_location = session.upload_data(os.path.join(data_dir, 'x_test_2020.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'val.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)

## Creating the estimator
The estimator is a XGBoost estimator from the SageMaker SDK, and the hyperparameters were the default for the model.

In [4]:
xgb = sagemaker.estimator.Estimator(container, # The location of the container we wish to use
                                    role,                                    # What is our current IAM Role
                                    train_instance_count=1,                  # How many compute instances
                                    train_instance_type='ml.m4.xlarge',      # What kind of compute instances
                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
                                    sagemaker_session=session)


xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        early_stopping_rounds=10,
                        num_round=500)

Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.


In [5]:
# generating the s3 input objects for training
s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=val_location, content_type='csv')

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


In [6]:
# training the model
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

2020-08-01 22:22:12 Starting - Starting the training job...
2020-08-01 22:22:15 Starting - Launching requested ML instances.........
2020-08-01 22:23:56 Starting - Preparing the instances for training......
2020-08-01 22:25:06 Downloading - Downloading input data
2020-08-01 22:25:06 Training - Downloading the training image..[34mArguments: train[0m
[34m[2020-08-01:22:25:26:INFO] Running standalone xgboost training.[0m
[34m[2020-08-01:22:25:26:INFO] File size need to be processed in the node: 65.27mb. Available memory size in the node: 8461.55mb[0m
[34m[2020-08-01:22:25:26:INFO] Determined delimiter of CSV input is ','[0m
[34m[22:25:26] S3DistributionType set as FullyReplicated[0m
[34m[22:25:26] 301737x32 matrix with 9655584 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2020-08-01:22:25:26:INFO] Determined delimiter of CSV input is ','[0m
[34m[22:25:26] S3DistributionType set as FullyReplicated[0m
[34m[22:25:26] 75434x32 matr

## Metric evaluation
The metrics used were the [F1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html), accuracy, precision and recall. The script calculated them all using `sklearn`, and pred

In [7]:
def test_and_print_metrics(xgb_object, test_location, output_name, gt):
    xgb_transformer = xgb_object.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')    
    xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')
    xgb_transformer.wait()
    !aws s3 cp --recursive $xgb_transformer.output_path $data_dir
    predictions = pd.read_csv(os.path.join('data', '{}.out'.format(output_name)), header=None)
    predictions = [round(num) for num in predictions.squeeze().values]
    print_metrics(predictions, gt)
    

In [8]:
def print_metrics(preds, gt):
    from sklearn.metrics import f1_score, precision_score, accuracy_score, recall_score
    import numpy as np
    print("F1: ", f1_score(gt, preds))
    print("acc: ",accuracy_score(gt, preds))
    print("prec: ", precision_score(gt, preds))
    print("recall: ", recall_score(gt, preds))

In [9]:
test_and_print_metrics(xgb, test_location, 'test_x.csv', test_y)

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


.....................[34mArguments: serve[0m
[34m[2020-08-01 22:29:49 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2020-08-01 22:29:49 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2020-08-01 22:29:49 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2020-08-01 22:29:49 +0000] [38] [INFO] Booting worker with pid: 38[0m
[34m[2020-08-01 22:29:49 +0000] [39] [INFO] Booting worker with pid: 39[0m
[34m[2020-08-01 22:29:49 +0000] [40] [INFO] Booting worker with pid: 40[0m
[34m[2020-08-01 22:29:49 +0000] [41] [INFO] Booting worker with pid: 41[0m
[34m[2020-08-01:22:29:49:INFO] Model loaded successfully for worker : 40[0m
[34m[2020-08-01:22:29:49:INFO] Model loaded successfully for worker : 39[0m
[34m[2020-08-01:22:29:49:INFO] Model loaded successfully for worker : 38[0m
[34m[2020-08-01:22:29:49:INFO] Model loaded successfully for worker : 41[0m

[34m[2020-08-01:22:30:21:INFO] Sniff delimiter as ','[0m
[34m[2020-08-01:22:30:21:INFO] Determined de

In [14]:
test_and_print_metrics(xgb, test_2020_location, 'x_test_2020.csv', test_y_2020)

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
Using already existing model: xgboost-2020-08-01-22-22-12-703


......................[34mArguments: serve[0m
[34m[2020-08-01 23:01:57 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2020-08-01 23:01:57 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2020-08-01 23:01:57 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2020-08-01 23:01:57 +0000] [39] [INFO] Booting worker with pid: 39[0m
[34m[2020-08-01 23:01:57 +0000] [40] [INFO] Booting worker with pid: 40[0m
[34m[2020-08-01 23:01:57 +0000] [41] [INFO] Booting worker with pid: 41[0m
[34m[2020-08-01:23:01:57:INFO] Model loaded successfully for worker : 39[0m
[34m[2020-08-01:23:01:57:INFO] Model loaded successfully for worker : 40[0m
[34m[2020-08-01 23:01:57 +0000] [42] [INFO] Booting worker with pid: 42[0m
[34m[2020-08-01:23:01:57:INFO] Model loaded successfully for worker : 41[0m
[34m[2020-08-01:23:01:57:INFO] Model loaded successfully for worker : 42[0m
[32m2020-08-01T23:02:17.565:[sagemaker logs]: MaxConcurrentTransforms=4, MaxPayloadInMB=6, BatchStrateg