# Breast Cancer Prediction with XGBoost
_**Using Gradient Boosted Trees to Predict breast cancer with features derived from breast mass images**_

---


## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Host](#Hosting)
  1. [Evaluate](#Evaluate)
1. [Extensions](#Extensions)
  1. [Hyperparameter Optimization](#Hyperparameter-Optimization)

---


## Background

This notebook illustrates the use of SageMaker's built-in XGBoost algorithm for binary classification.
XGBoost uses decision trees to build a predictive model.

Also demonstrated is Hyperparameter optimization as well as using the best model from HPO to instantiate a new endpoint

### Why XGBoost and not Logistic Regression?

Whilst logistic regression is often used for classification exercises, it has some drawbacks. For example, additional feature engineering is required to deal with non-linear features.

XGBoost (an implementation of Gradient Boosted Trees) offers several benefits including naturally accounting for non-linear relationships between features and the target variable, as well as accommodating complex interactions between features.
Decision Tree algorithms such as XGBoost also have the added benefit of being able to deal with missing values in both the training dataset as well as unseen samples that are being used for inference.

Amazon SageMaker provides an XGBoost container that we can use to train in a managed, distributed setting, and then host as a real-time prediction endpoint



---

## Setup

_This notebook was created and tested on an ml.t2.medium notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The SageMaker role arn used to give learning and hosting access to your data. The snippet below will use the same role used by your SageMaker notebook instance. If you wish to use a different role, specify the full ARN of a role with the SageMakerFullAccess policy attached.

In [None]:
bucket = 'Your-S3-Bucket'
prefix = 'sagemaker/DEMO-xgboost-churn'

# Define IAM role
import boto3
import re
from sagemaker import get_execution_role
region = boto3.Session().region_name

role = get_execution_role()

Next, we'll import the Python libraries we'll need for the remainder of the exercise.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
from IPython.display import display
from time import strftime, gmtime
import sagemaker
from sagemaker.predictor import csv_serializer

---
## Data

For this illustration, we have taken an example for breast cancer prediction using UCI'S breast cancer diagnostic data set available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29. The data set is also available on Kaggle at https://www.kaggle.com/uciml/breast-cancer-wisconsin-data. The purpose here is to use this data set to build a predictve model of whether a breast mass image indicates benign or malignant tumor. 

In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data
    
# You can find out all the details of this dataset here: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

#Let's download the data and save it in the local folder with the name data.csv and take a look at it.

The dataset we downloaded does not have column headings; however this information is available at the source

More information about this dataset can be found here: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
Sample images used in this dataset can be seen here: ftp://ftp.cs.wisc.edu/math-prog/cpo-dataset/machine-learn/cancer/cancer_images

- `id`: ID number
- `diagnosis`: The diagnosis of breast tissues (M = malignant, B = benign)
- `radius_mean`: mean of distances from center to points on the perimeter
- `texture_mean`: standard deviation of gray-scale values
- `perimeter_mean`: mean size of the core tumor
- `area_mean`: 
- `smoothness_mean`: mean of local variation in radius lengths
- `compactness_mean`: mean of perimeter^2 / area - 1.0
- `concavity_mean`: mean of severity of concave portions of the contour
- `concave points_mean`: mean for number of concave portions of the contour
- `symmetry_mean`: 
- `fractal_dimension_mean`: mean for "coastline approximation" - 1
- `radius_se`: standard error for the mean of distances from center to points on the perimeter
- `texture_se`: standard error for standard deviation of gray-scale values
- `perimeter_se`: 
- `area_se`: 
- `smoothness_se`: standard error for local variation in radius lengths
- `compactness_se`: standard error for perimeter^2 / area - 1.0
- `concavity_se`: standard error for severity of concave portions of the contour
- `concave points_se`: standard error for number of concave portions of the contour
- `symmetry_se`: 
- `fractal_dimension_se`: standard error for "coastline approximation" - 1
- `radius_worst`: "worst" or largest mean value for mean of distances from center to points on the perimeter
- `texture_worst`: "worst" or largest mean value for standard deviation of gray-scale values
- `perimeter_worst`: 
- `area_worst`: 
- `smoothness_worst`: "worst" or largest mean value for local variation in radius lengths
- `compactness_worst`: "worst" or largest mean value for perimeter^2 / area - 1.0
- `concavity_worst`: "worst" or largest mean value for severity of concave portions of the contour
- `concave points_worst`: "worst" or largest mean value for number of concave portions of the contour
- `symmetry_worst`: 
- `fractal_dimension_worst`: "worst" or largest mean value for "coastline approximation" - 1


If we load this CSV data into a pandas dataframe, we can easily take a closer look


In [None]:
col_names = ["id","diagnosis","radius_mean","texture_mean","perimeter_mean","area_mean","smoothness_mean",
                "compactness_mean","concavity_mean","concave points_mean","symmetry_mean","fractal_dimension_mean",
                "radius_se","texture_se","perimeter_se","area_se","smoothness_se","compactness_se","concavity_se",
                "concave points_se","symmetry_se","fractal_dimension_se","radius_worst","texture_worst",
                "perimeter_worst","area_worst","smoothness_worst","compactness_worst","concavity_worst",
                "concave points_worst","symmetry_worst","fractal_dimension_worst"]
breastcancer = pd.read_csv('./wdbc.data', header=None, names=col_names)
breastcancer

The breast cancer dataset is quite small, with only 569 records, where each record uses 32 attributes to describe the profile of a breast mass.
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image

Let's see which of our colums are of type string

In [None]:
breastcancer.select_dtypes(include=['object'])

Only one column is of type string, and that is the diagnosis. Lets take a look at the diagnosis distribution in both absolute and normalised forms:

In [None]:
display(pd.crosstab(index=breastcancer['diagnosis'], columns='% observations'))
display(pd.crosstab(index=breastcancer['diagnosis'], columns='% observations', normalize='columns'))

So 63% of our samples are benign and 37% are malignant. This is a reasonable spread

Next we will take a closer look at all of the numeric features in the dataset

In [None]:
display(breastcancer.describe())

It will be more useful to look at the histograms of the numerical features

In [None]:
%matplotlib inline
hist = breastcancer.hist(bins=30, sharey=False, figsize=(20, 20))

From these histograms we can see that:
- Most of the numeric features are nicely distributed, with some even showing bell-like gaussianity.
- `id` should not be included as a feature (and it should be converted to non-numeric) 


We will drop the `id` column from the dataset:


In [None]:
breastcancer = breastcancer.drop(['id'], axis=1)

#### Note for future reference:
There may be scenarios where you have a numeric field like `id` that did add some non-numeric value.
A good example would be if the first N characters of patient ID indicated the country or state where the patient was located and you wanted to see if this location had any bearing on the diagnosis.
In such a case you would convert the field to a string:
<pre><code>breastcancer['id'] = breastcancer['id'].astype(object)</code></pre>
and extract the pertinant information.
You would then treat that field as a categorical field


To take a look at the relationship between any categorical fields and the final diagnosis, you would use the following cross-tabulation report: 
<pre><code>for column in breastcancer.select_dtypes(include=['object']).columns:
    if column != 'diagnosis':
        display(pd.crosstab(index=breastcancer[column], columns=breastcancer['diagnosis'], normalize='columns'))
</code></pre>

Now we will look at the direct relationship between numeric (non-object) values and diagnosis. We do this by plotting a histogram for every numeric value.
We divide our samples into `bins`. The X-axis represents the bins and the Y-axis represents how many samples fall into each bin.
By forcing the benign and malignant graphs to share the same X and Y scale it is easier to visualise which bins are more populated between the two diagnoses.

Feel free to adjust the number of bins being plotted by the histogram and view the effect.

In [None]:
for column in breastcancer.select_dtypes(exclude=['object']).columns:
    print(column)
    hist = breastcancer[[column, 'diagnosis']].hist(by='diagnosis', sharey=True, sharex=True, bins=30)
    plt.show()

What can we infer from these relationships?

We see that malignant diagnosis appear to have higher values for the following features:
- radius_mean
- perimeter_mean
- area_mean
- compactness_mean
- concavity_mean
- concave points_mean
- radius_se
- area_se
- radius_worst
- texture_worst
- perimeter_worst
- area_worst
- compactness_worst
- concavity_worst
- concave points_worst

We see similar distributions for features such as `radius_mean`, `perimeter_mean` and `area_mean` for both malignant and benign diagnosis. This makes sense as each of these features are related to the size of the tumour.

Let's dig deeper into the relationships between our features by computing the pairwise correlation of columns.

In [None]:
display(breastcancer.corr())

We see several features that essentially have a strong (but not 100%) correlation with one another.

One such example is radius_mean and perimeter_mean which have a correlation score of 0.997855

It can be easier to see correlations using a scatter matrix


In [None]:
pd.plotting.scatter_matrix(breastcancer, figsize=(30, 30))
plt.show()

In the scatter matrix, such strongly correlated features are indicated by a diagonal line running from bottom left to top right. In the correlation matrix, such relationships are indicated by a correlation value close to 1.

In some cases it can be a good idea to remove one element of a highly correlated feature pair. 
For the first run of our training, I am going to leave all data in; however, it would be a valuable exercise to remove the following values and compare results of the final model:
`perimeter_mean` and `area_mean` - since `radius_mean` has high correlation (>98%) with those measurements.
A further exercise would be to remove one of a feature pair that have more than 96% correlation and compare final predictive results of the models. This is the 'scientific experimentation' side of data science.

For reference, the command to drop columns from the pandas dataframe is:

<pre><code>breastcancer = breastcancer.drop(['ColName1', 'ColName2'], axis=1)</code></pre>

Now that we have a clean dataset (and have potentially removed some unneccessary columns), we can prepare the dataset for XGBoost. 

Amazon SageMaker XGBoost can train on data in either a CSV or LibSVM format.  For this example, we'll stick with CSV.  It should:
- Contain only numeric values
- Have the predictor variable in the first column
- Not have a header row

We will also
- Shuffle the dataset
- Split the dataset into training, validation and testing sets

### Step 1: Convert our categorical features into numeric features using the "get_dummies" function which will automatically convert categorical variable into dummy/indicator variables.

I have shown the first and last row of the dataset in order to illustrate the output of the 'get_dummies' method.

Since our only categorical variable is 'diagnosis', you will see that it now appears as the last column(s) of the new dataset broken up into one column per diagnosis label.

In [None]:
pd.set_option('max_rows', 3)
display(breastcancer)
model_data = pd.get_dummies(breastcancer)
display(model_data)

### Step 2: Select a single predictor variable and bring it forward to the first column

Now we will keep only one of the predictor variables `diagnosis_M` which is our label for the tumour being malignant. We will append this column to a version of our dataset that drops the predictor columns (as XGBoost requires the predictor variable as the first column)

In [None]:
model_data = pd.concat([model_data['diagnosis_M'], model_data.drop(['diagnosis_B', 'diagnosis_M'], axis=1)], axis=1)
display(model_data)

And now let's split the data into training, validation, and test sets.  This will help prevent us from overfitting the model, and allow us to test the models accuracy on data it hasn't already seen.

### Step3: Shuffle the input dataset

We will shuffle the order of the dataset so as to reduce variance and ensure that the resultant model remains general.

We do this with the `sample` method.
I am specifying a value for `random state` only for the purposes of reproducability
Setting `frac`=1 specifies to keep all the samples, as opposed to returning a fraction of the samples

In [None]:
shuffled_data=model_data.sample(frac=1, random_state=1)
display(shuffled_data)
pd.reset_option('max_rows')

### Step 4: Split the dataset

Split our data into a training dataset, validation dataset and test dataset.

The ratio we will use is:
- Training dataset - 70%
- Validation dataset - 20%
- Test dataset - 10%

We do this using the numpy `split` function specifying splits at the 70% mark and the 90% mark of the shuffled dataset

In [None]:
train_data, validation_data, test_data = np.split(shuffled_data, [int(0.7 * len(model_data)), int(0.9 * len(model_data))])
print("Training data sample size:",len(train_data))
print("Validation data sample size:",len(validation_data))
print("Test data sample size:",len(test_data))


We need to convert the training dataset and validation dataset to CSV and upload to S3 for consumption by the containers running the XGBoost algorithm

In [None]:
train_data.to_csv('train.csv', header=False, index=False)
validation_data.to_csv('validation.csv', header=False, index=False)

Now we'll upload these files to S3.

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

---
## Train

Moving onto training, first we'll need to specify the locations of the XGBoost algorithm containers.

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost')

Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3.

In [None]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

Now, we can specify a few parameters like what type of training instances we'd like to use and how many, as well as our XGBoost hyperparameters.  A few key hyperparameters are:
- `max_depth` controls how deep each tree within the algorithm can be built.  Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting.  There is typically some trade-off in model performance that needs to be explored between a large number of shallow trees and a smaller number of deeper trees.
- `subsample` controls sampling of the training data.  This technique can help reduce overfitting, but setting it too low can also starve the model of data.
- `num_round` controls the number of boosting rounds.  This is essentially the subsequent models that are trained using the residuals of previous iterations.  Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.
- `eta` controls how aggressive each round of boosting is.  Larger values lead to more conservative boosting.
- `gamma` controls how aggressively trees are grown.  Larger values lead to more conservative models.

More detail on XGBoost's hyperparmeters can be found on the GitHub [page](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md).

In [None]:
%%time
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.1,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 


## Hosting
Now that we've trained the `xgboost` algorithm on our data, let's deploy a model that's hosted behind a real-time endpoint.

In [None]:
%%time
xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.t2.medium')

## Evaluate

Now that we have a hosted endpoint running, we can make real-time predictions from our model very easily, simply by making an http POST request.  But first, we'll need to setup serializers and deserializers for passing our `test_data` NumPy arrays to the model behind the endpoint.

In [None]:
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
xgb_predictor.deserializer = None

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Extract the features for each sample 
1. Retrieve the prediction for each sample by invoking the XGBoost endpoint
1. Collect predictions and convert from a python list to a NumPy array

In [None]:
# Convert the dataframe to a numpy array
dtest = test_data.to_numpy()

# As expected, the numpy array has 57 rows (57 samples in the test dataset), and 31 columns (30 features + 1 label)
#print(dtest.shape)

# Create a list to hold all of our predictions
predictions = []

# Loop through the matrix of our test data samples, pulling out the features for each sample and running inference
# Note: dtest[i:i+1, 1:] is an vector of all the features for the sample i (without the first entry which is the label)
for i in range(dtest.shape[0]):
    sample_features=dtest[i:i+1, 1:]
    prediction=xgb_predictor.predict(sample_features).decode('utf-8')
    predictions.append(float(prediction))
       
# Convert our list of predictions to a numpy array
predictions = np.asarray(predictions)
display(predictions)

To evaluate the performance of this machine learning model on the test dataset, we will use a simple confusion matrix to compare actual to predicted values.  In this case, we're predicting whether the tumor was malignant (`1`) or benign (`0`).

- We get the actual values from the first column (column 0) of the dataset: `test_data.iloc[:, 0]`
- We get the predicted values from our array of predictions: `predictions`. We will simply round the predictions to the nearest integer (so a prediction < 0.5 will be 0 - benign and a prediction => 0.5 will be 1 - malignant)

In [None]:
pd.crosstab(index=test_data.iloc[:, 0], columns=np.round(predictions), rownames=['actual'], colnames=['predictions'])

Of the 57 samples in the test dataset, 34 were for benign tumours and indeed we've correctly predicted all 34 of them.
23 of the samples were malignant and we correctly predicted 20 of them.
3 of of the malignant samples were incorrectly predicted as benign

An important point here is that because of the `np.round()` function above we are using a simple threshold (or cutoff) of 0.5.  Our predictions from `xgboost` come out as continuous values between 0 and 1 and we force them into the binary classes that we began with.  

However, because we would rather err on the side of a false positive than a false negative, we will adjust this cutoff. 

To get a rough intuition here, let's look at the continuous values of our predictions.

In [None]:
plt.hist(predictions, bins=20)
plt.show()

The continuous valued predictions coming from our model are generally quite decisive so tend to skew toward 0 or 1; however there are a few values between 0.1 and 0.9 where the model is less confident.

How you adjust the cutoff is completely dependent upon the problem space you are addressing and whether you want to have more likelihood of false positives or false negatives.

In the case of predicting malignant tumours we will be somewhat conservative and report any prediction greater than 0.3 as malignant 

In [None]:
pd.crosstab(index=test_data.iloc[:, 0], columns=np.where(predictions > 0.3, 1, 0), rownames=['actual'], colnames=['predictions'])

We can see that changing the cutoff from 0.5 to 0.3 yields improved results. 

There is still 1 malignant sample that is being incorrectly predicted as benign. This would certainly require further investigation, hyperparameter tuning and likely even an ensemble approach where the prediction from this model are combined with the predictions of other models in order to vote for the final prediction.

---
## Extensions

### Hyperparameter-Optimization

Set our static hyperparameters


In [None]:
static_hyperparameters = {
    "objective" : "binary:logistic",
    "num_round" : "100"
}

In [None]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
sess = sagemaker.Session()

container = get_image_uri(region, 'xgboost', repo_version='latest')

xgb_hpo = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)

xgb_hpo.set_hyperparameters(**static_hyperparameters)

Set our hyperparameters that we want SageMaker to tune

In [None]:
tuned_hyperparameter_ranges = {'eta': ContinuousParameter(0.1, 0.5),
                        'min_child_weight': ContinuousParameter(1, 10),
                        'alpha': ContinuousParameter(0, 2),
                        #'gamma': ContinuousParameter(0, 20),
                        'subsample': ContinuousParameter(0.5, 1),
                        'max_depth': IntegerParameter(6, 10)}

Define the hyperparameter tuning job

In [None]:
tuner = HyperparameterTuner(xgb_hpo,
                            objective_metric_name='validation:error',
                            objective_type='Minimize',
                            hyperparameter_ranges=tuned_hyperparameter_ranges,
                            max_jobs=20,
                            max_parallel_jobs=3)

Run the hyperparameter tuning job

In [None]:
timestamp = time.strftime('-%Y%m%d%H%M', time.gmtime())

In [None]:
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation}, job_name='HPObreastCancer'+timestamp, include_cls_metadata=False)

### Check job name and status of HPO job

In [None]:
sage_client = boto3.Session().client('sagemaker')

hpo_job_name=tuner.latest_tuning_job.job_name
print(hpo_job_name)

sage_client.describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=hpo_job_name)['HyperParameterTuningJobStatus']


### Analyze tuning job results - after tuning job is completed
Please refer to "HPO_Analyze_TuningJob_Results.ipynb" to see example code to analyze the tuning job results.

If the job is complete and you really don't care about analysing the results, the code below will return you the best job name and the best combination of hyperparameters found

In [None]:
tuning_job_result = sage_client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=hpo_job_name)
best_job=tuning_job_result.get('BestTrainingJob',None)

best_job_name=best_job.get('TrainingJobName',None)
tuned_hyperparams=best_job.get('TunedHyperParameters',None)
print("Best job had name:",best_job_name)
print("Best hyperparameter combination:",tuned_hyperparams)

### Create a hosted endpoint from the best job results

Locate the S3 path to the model artifact

In [None]:
info = sage_client.describe_training_job(TrainingJobName=best_job_name)
model_data = info['ModelArtifacts']['S3ModelArtifacts']
print(model_data)

Create a SageMaker model from the model artifact

In [None]:
primary_container = {
    'Image': container,
    'ModelDataUrl': model_data
}

model_name=best_job_name + '-model'

create_model_response = sage_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

In [None]:
print(create_model_response['ModelArn'])

Create a configuration for a SageMaker hosted endpoint

In [None]:
endpoint_config_name = 'HPO-XGBoostEndpointConfig-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_config_name)
create_endpoint_config_response = sage_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.t2.medium',
        'InitialVariantWeight':1,
        'InitialInstanceCount':1,
        'ModelName':model_name,
        'VariantName':'AllTraffic'}])

print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

Create a SageMaker hosted endpoint using the configuration created above

In [None]:
endpoint_name = 'HPO-XGBoostEndpoint-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_name)
create_endpoint_response = sage_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print(create_endpoint_response['EndpointArn'])

resp = sage_client.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Status: " + status)

while status=='Creating':
    time.sleep(60)
    resp = sage_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print("Status: " + status)

print("Arn: " + resp['EndpointArn'])
print("Status: " + status)

Create a predictor (this is an object to make it simpler to make requests to an endpoint). 

In [None]:
xgb_predictor_hpo=sagemaker.predictor.RealTimePredictor(endpoint=endpoint_name,
                               serializer=csv_serializer,
                               deserializer=None,
                               content_type='text/csv',
                               accept=None)

The code below is the same as the code we used to run our test data against our initial predictor (before hyperparameter optimization)

In [None]:
# Create a list to hold all of our predictions using the HPO model
predictions_hpo = []

# Loop through the matrix of our test data samples, pulling out the features for each sample and running inference
# Note: dtest[i:i+1, 1:] is an vector of all the features for the sample i (without the first entry which is the label)
for i in range(dtest.shape[0]):
    sample_features=dtest[i:i+1, 1:]
    prediction=xgb_predictor_hpo.predict(sample_features).decode('utf-8')
    predictions_hpo.append(float(prediction))
       
# Convert our list of predictions to a numpy array
predictions_hpo = np.asarray(predictions_hpo)
display(predictions_hpo)

How does our new model go at making predictions?

In [None]:
pd.crosstab(index=test_data.iloc[:, 0], columns=np.round(predictions_hpo), rownames=['actual'], colnames=['predictions'])

Interestingly it only has 2 false negatives where the original model had 3.
What about if we shift the cutoff point for a positive result?

In [None]:
pd.crosstab(index=test_data.iloc[:, 0], columns=np.where(predictions_hpo > 0.3, 1, 0), rownames=['actual'], colnames=['predictions'])

This is no better than our result before Hyperparameter Optimization.
At this point we can't improve our model any further, so we know we need to look at the data. 
Perhaps we need to use domain knowledge to include extra relevant features.

### (Optional) Clean-up

If you're finished with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)
sagemaker.Session().delete_endpoint(xgb_predictor_hpo.endpoint)