# Certified Machine Learing Specialist (CMLS) Project
_**Supervised Learning with Gradient Boosted Trees: A Binary Prediction Problem With Unbalanced Classes**_

---

---

## Contents

1. [Background](#Background)
1. [Preparation](#Preparation)
1. [Data](#Data)
    1. [Exploration](#Exploration)
    1. [Data Cleaning](#Data-Cleaning)
1. [Training](#Training)
1. [Hosting](#Hosting)
1. [Evaluation](#Evaluation)


---

## Background
This notebook presents an example problem to predict for Credit Card Fraud Detection with Amazon SageMaker XGBoost.  The steps include:

* Preparing your Amazon SageMaker notebook
* Downloading data from the internet into Amazon SageMaker
* Investigating and transforming the data so that it can be fed to Amazon SageMaker algorithms
* Estimating a model using the Gradient Boosting algorithm
* Evaluating the effectiveness of the model
* Setting the model up to make on-going predictions

---

## Preparation

_This notebook was created and tested on an ml.m4.xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [1]:
bucket = 'cmlsproject'
prefix = 'sagemaker/projects'
 
# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

Now let's bring in the Python libraries that we'll use throughout the analysis

In [2]:
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
import sagemaker                                  # Amazon SageMaker's Python SDK provides many helper functions
from sagemaker.predictor import csv_serializer    # Converts strings for HTTP POST requests on inference

---

## Data
Let's run below cell to download the dataset from my S3 storage, it's a [Credit Card Fraud Dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud/home) duplicated copy from kaggle.

In [3]:
!wget https://s3-ap-southeast-1.amazonaws.com/poocb-machine-learning/creditcard-dataset/creditcardfraud.zip
!unzip -o creditcardfraud.zip

--2019-04-30 08:58:13--  https://s3-ap-southeast-1.amazonaws.com/poocb-machine-learning/creditcard-dataset/creditcardfraud.zip
Resolving s3-ap-southeast-1.amazonaws.com (s3-ap-southeast-1.amazonaws.com)... 52.219.40.21
Connecting to s3-ap-southeast-1.amazonaws.com (s3-ap-southeast-1.amazonaws.com)|52.219.40.21|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 69155632 (66M) [application/zip]
Saving to: ‘creditcardfraud.zip’


2019-04-30 08:58:14 (69.2 MB/s) - ‘creditcardfraud.zip’ saved [69155632/69155632]

Archive:  creditcardfraud.zip
  inflating: creditcard.csv          


Now lets read this into a Pandas data frame and take a look.

In [4]:
data = pd.read_csv('./creditcard.csv', sep=',')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20)         # Keep the output on one page
data

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.551600,-0.617801,-0.991390,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.524980,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.119670,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
5,2.0,-0.425966,0.960523,1.141109,-0.168252,0.420987,-0.029728,0.476201,0.260314,-0.568671,-0.371407,1.341262,0.359894,-0.358091,-0.137134,0.517617,0.401726,-0.058133,0.068653,-0.033194,0.084968,-0.208254,-0.559825,-0.026398,-0.371427,-0.232794,0.105915,0.253844,0.081080,3.67,0
6,4.0,1.229658,0.141004,0.045371,1.202613,0.191881,0.272708,-0.005159,0.081213,0.464960,-0.099254,-1.416907,-0.153826,-0.751063,0.167372,0.050144,-0.443587,0.002821,-0.611987,-0.045575,-0.219633,-0.167716,-0.270710,-0.154104,-0.780055,0.750137,-0.257237,0.034507,0.005168,4.99,0
7,7.0,-0.644269,1.417964,1.074380,-0.492199,0.948934,0.428118,1.120631,-3.807864,0.615375,1.249376,-0.619468,0.291474,1.757964,-1.323865,0.686133,-0.076127,-1.222127,-0.358222,0.324505,-0.156742,1.943465,-1.015455,0.057504,-0.649709,-0.415267,-0.051634,-1.206921,-1.085339,40.80,0
8,7.0,-0.894286,0.286157,-0.113192,-0.271526,2.669599,3.721818,0.370145,0.851084,-0.392048,-0.410430,-0.705117,-0.110452,-0.286254,0.074355,-0.328783,-0.210077,-0.499768,0.118765,0.570328,0.052736,-0.073425,-0.268092,-0.204233,1.011592,0.373205,-0.384157,0.011747,0.142404,93.20,0
9,9.0,-0.338262,1.119593,1.044367,-0.222187,0.499361,-0.246761,0.651583,0.069539,-0.736727,-0.366846,1.017614,0.836390,1.006844,-0.443523,0.150219,0.739453,-0.540980,0.476677,0.451773,0.203711,-0.246914,-0.633753,-0.120794,-0.385050,-0.069733,0.094199,0.246219,0.083076,3.68,0


Let's talk about the data.  At a high level, we can see:

* We have a little over 284K credit card transaction records, and 31 features for each transaction.

_**Specifics on each of the features:**_

*Anonymized features for confidentiality:*
* `V1 till V28`: principle components been transformed with PCA

*Non Anonymized features:*
* `Time`:  Number of seconds(numeric) elapsed between this transaction and the first transaction in the dataset
* `Amount`: Transaction Amount(numeric)

*Target variable:*
* `class`: Is the transaction a fraud? 1 for fraudulent, 0 otherwise.

### Exploration
Let's start exploring the data.  First, let's understand how target variable are distributed.

In [5]:
data['Class'].value_counts(normalize=True)

0    0.998273
1    0.001727
Name: Class, dtype: float64

Notice that:

* Only 0.002% of the values for our target variable `Class` are "1", so little transaction are fraud. The sample data is highly unbalanced, which are expected because fraud action is act on purpose that would be smaller percentage compare to others valid transactions.

Next, since most features are anonymized, let's move on to transformation.

### Data Cleaning

Cleaning up data is part of nearly every machine learning project.  Since the given data already been transformed with PCA, we would just look for any missing value first.

In [6]:
data.isnull().values.any()

False

As there is no missing value, we would proceed to determine whether certain features will add value in our case. From the given data, we will drop the Time feature which should not help to improve prediction of fraudulent.

In [7]:
model_data = data.drop(['Time',], axis=1)

Next, we'll divide data into 3 groups.  The model will be trained on 70% of data, evaluated 20% of data, and 10% as our testing dataset.

In [8]:
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9 * len(model_data))])   # Randomly sort the data then split out first 70%, second 20%, and last 10%

Amazon SageMaker's XGBoost container expects data in the libSVM or CSV data format.  For this example, we'll stick to CSV.  Note that the first column must be the target variable and the CSV should not include headers.  Also, notice that although repetitive it's easiest to do this after the train|validation|test split rather than before.  This avoids any misalignment issues due to random reordering.

In [9]:
pd.concat([train_data['Class'], train_data.drop(['Class'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
pd.concat([validation_data['Class'], validation_data.drop(['Class'], axis=1)], axis=1).to_csv('validation.csv', index=False, header=False)

Moving on, let's create bucket to S3.

In [10]:
s3 = boto3.resource('s3')
my_region = boto3.session.Session().region_name # set the region of the instance
print('region set to:', my_region)
args = dict(Bucket = bucket)
args['CreateBucketConfiguration'] = dict(LocationConstraint = my_region)
try:
    if  my_region == 'us-east-1':
      s3.create_bucket(Bucket=bucket)
    else: 
      s3.create_bucket(**args)
    print('S3 bucket created successfully')
except Exception as e:
    print('S3 error: ',e)

region set to: ap-southeast-1
S3 bucket created successfully


Now we'll copy the file to S3 for Amazon SageMaker's managed training to pickup.

In [13]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

---

## Training
`xgboost` is an extremely popular, open-source package for gradient boosted trees.  It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions.  Let's start with a simple `xgboost` model, trained using Amazon SageMaker's managed, distributed training framework.

First we'll need to specify the ECR container location for Amazon SageMaker's implementation of XGBoost.

In [14]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost')

Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3, which also specify that the content type is CSV.

In [15]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

First we'll need to specify training parameters to the estimator.  This includes:
1. The `xgboost` algorithm container
1. The IAM role to use
1. Training instance type and count
1. S3 location for output data
1. Algorithm hyperparameters

And then a `.fit()` function which specifies:
1. S3 location for output data.  In this case we have both a training and validation set which are passed in.

In [16]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 

INFO:sagemaker:Creating training-job with name: xgboost-2019-04-30-09-01-10-465


2019-04-30 09:01:10 Starting - Starting the training job...
2019-04-30 09:01:13 Starting - Launching requested ML instances......
2019-04-30 09:02:22 Starting - Preparing the instances for training......
2019-04-30 09:03:27 Downloading - Downloading input data
2019-04-30 09:03:27 Training - Downloading the training image..
[31mArguments: train[0m
[31m[2019-04-30:09:03:48:INFO] Running standalone xgboost training.[0m
[31m[2019-04-30:09:03:48:INFO] File size need to be processed in the node: 131.72mb. Available memory size in the node: 8405.54mb[0m
[31m[2019-04-30:09:03:48:INFO] Determined delimiter of CSV input is ','[0m
[31m[09:03:48] S3DistributionType set as FullyReplicated[0m
[31m[09:03:48] 199364x29 matrix with 5781556 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[31m[2019-04-30:09:03:48:INFO] Determined delimiter of CSV input is ','[0m
[31m[09:03:48] S3DistributionType set as FullyReplicated[0m
[31m[09:03:48] 56962x29 matri

---

## Hosting
Now that we've trained the `xgboost` algorithm on our data, let's deploy a model that's hosted behind a real-time endpoint.

In [17]:
xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: xgboost-2019-04-30-09-06-52-337
INFO:sagemaker:Creating endpoint with name xgboost-2019-04-30-09-01-10-465


------------------------------------------------------------------------------------------!

---

## Evaluation
Now we comparing actual to predicted values based on the 10% testing dataset created earlier on.

Since the testing data is currently stored as NumPy arrays in memory of notebook instance.  To send it in an HTTP POST request, we'll serialize it as a CSV string and then decode the resulting CSV.

*Note: For inference with CSV format, SageMaker XGBoost requires that the data does NOT include the target variable.*

In [18]:
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batches to CSV string payloads (notice, we drop the target variable from our dataset first)
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [19]:
def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, xgb_predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

predictions = predict(test_data.drop(['Class'], axis=1).as_matrix())

Then we'll check our confusion matrix to see how well we predicted versus actuals.

In [20]:
pd.crosstab(index=test_data['Class'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])


predictions,0.0,1.0
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0,28424,3
1,12,42


As outcome above, over 24481 transactions, we predicted 45 are fraudulent and 42 of them did.  We also had 12 transactions that are fraudulent that we did not predict would be.

### (Optional) Clean-up

If you are done with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)