# Fraud Detection with Amazon SageMaker - XGBoost

*Supervised Learning with Gradient Boosted Trees: Binary Prediction with unbalanced classes.*



## Background 

Globally each year, tens of billions of dollars are lost to online fraud. To prevent fraud, companies need to 
develop machine learning powered fraud detection applications, as the traditional rule based solutions cannot cope up 
with the changing behaviour of the fraudsters. In this notebook, we will see how you can build, train, tune and deploy a fraud detection model with Amazon SageMaker. 

Steps include: 
- Preparing your Amazon SageMaker notebook 
- Downloading the data from the internet into Amazon SageMaker 
- Investigating and transforming the data so that it can be fed into Amazon SageMaker Algorithms. 
- Estimating a model using the Gradient Boosting algorithm.
- Evaluating effectiveness of the model
- Setting the model up to make on-going predictions. 



## Preparation

*This notebook was creted and tested on am ml.m4.xlarge notebook instance.* 

Specifications: 
- S3 Bucket prefix that is required for training and model data. 
- IAM Role arn used to give training and hosting access to the data. Documentation can be referred in this case. 

In [1]:
!pip install --upgrade pandas

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting pandas
  Downloading pandas-1.4.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.7/11.7 MB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 1.3.4
    Uninstalling pandas-1.3.4:
      Successfully uninstalled pandas-1.3.4
Successfully installed pandas-1.4.3
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [17]:
import sagemaker 

bucket = sagemaker.Session().default_bucket()
prefix = 'sagemaker/DEMO-xgboost-fraud'

import boto3 
import re 
from sagemaker import get_execution_role
import os

role = get_execution_role()

In [3]:
# All the required python libraries are imported. 

# Matrix operations and numerical processing. 
import numpy as np 

# Munging tabular data. 
import pandas as pd 

# For visualizations and charts. 
import matplotlib.pyplot as plt 

# To display images in the notebook. 
from IPython.display import Image 

# To display outputs in the notebook. 
from IPython.display import display 

# Labelling sagemaker models, endpoints, etc. 
from time import gmtime, strftime 

# For writing outputs to notebook 
import sys 

# For ceiling functions
import math 

# For parsing hosting outputs
import json 

# Manipulating filepath names
import os 
import sagemaker

#Sagemaker's Python SDK help functions. 
import zipfile 


In [4]:
pd.__version__

'1.4.3'

## Data 

The credit card fraud dataset is downloaded and read: 

In [9]:
!wget https://s3-us-west-2.amazonaws.com/sagemaker-e2e-solutions/fraud-detection/creditcardfraud.zip 
    
with zipfile.ZipFile('creditcardfraud.zip', 'r') as zip_ref: 
    zip_ref.extractall('.')

--2022-08-05 03:54:31--  https://s3-us-west-2.amazonaws.com/sagemaker-e2e-solutions/fraud-detection/creditcardfraud.zip
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.229.152
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.229.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 69155632 (66M) [application/zip]
Saving to: ‘creditcardfraud.zip.2’


2022-08-05 03:54:37 (11.2 MB/s) - ‘creditcardfraud.zip.2’ saved [69155632/69155632]



In [10]:
data = pd.read_csv('./creditcard.csv')
print(data.columns)
data[['Time', 'V1', 'V2', 'V27', 'V28', 'Amount', 'Class']].describe()
data.head(10)

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0
5,2.0,-0.425966,0.960523,1.141109,-0.168252,0.420987,-0.029728,0.476201,0.260314,-0.568671,-0.371407,1.341262,0.359894,-0.358091,-0.137134,0.517617,0.401726,-0.058133,0.068653,-0.033194,0.084968,-0.208254,-0.559825,-0.026398,-0.371427,-0.232794,0.105915,0.253844,0.08108,3.67,0
6,4.0,1.229658,0.141004,0.045371,1.202613,0.191881,0.272708,-0.005159,0.081213,0.46496,-0.099254,-1.416907,-0.153826,-0.751063,0.167372,0.050144,-0.443587,0.002821,-0.611987,-0.045575,-0.219633,-0.167716,-0.27071,-0.154104,-0.780055,0.750137,-0.257237,0.034507,0.005168,4.99,0
7,7.0,-0.644269,1.417964,1.07438,-0.492199,0.948934,0.428118,1.120631,-3.807864,0.615375,1.249376,-0.619468,0.291474,1.757964,-1.323865,0.686133,-0.076127,-1.222127,-0.358222,0.324505,-0.156742,1.943465,-1.015455,0.057504,-0.649709,-0.415267,-0.051634,-1.206921,-1.085339,40.8,0
8,7.0,-0.894286,0.286157,-0.113192,-0.271526,2.669599,3.721818,0.370145,0.851084,-0.392048,-0.41043,-0.705117,-0.110452,-0.286254,0.074355,-0.328783,-0.210077,-0.499768,0.118765,0.570328,0.052736,-0.073425,-0.268092,-0.204233,1.011592,0.373205,-0.384157,0.011747,0.142404,93.2,0
9,9.0,-0.338262,1.119593,1.044367,-0.222187,0.499361,-0.246761,0.651583,0.069539,-0.736727,-0.366846,1.017614,0.83639,1.006844,-0.443523,0.150219,0.739453,-0.54098,0.476677,0.451773,0.203711,-0.246914,-0.633753,-0.120794,-0.38505,-0.069733,0.094199,0.246219,0.083076,3.68,0


The class column refers to whether transactions are fraudulent or not. 

We can see that the majority of the data is non-fraudulent. 

In [11]:
nonfrauds, frauds = data.groupby('Class').size()
print('Number of frauds', frauds)
print('Number of non-frauds', nonfrauds)
print('Percentage of fradulent data = ', 100.*frauds/(frauds + nonfrauds))

Number of frauds 492
Number of non-frauds 284315
Percentage of fradulent data =  0.1727485630620034


This dataset has 29 columns, $V_i$ for i = 1, 2, ....., 28of anonymized features along with columns for time, amount, and class. We already know that the columns $V_i$ have been *normalized* to have 0 mean and unit standard deviation as the result of PCA (Principal Component Analysis). 

In [13]:
feature_columns = data.columns[:-1]
label_column = data.columns[-1]


features = data[feature_columns].values.astype('float32')
labels = data[label_column].values.astype('float32')

In [14]:
model_data = data 
model_data.head()
model_data = pd.concat([model_data['Class'], model_data.drop(['Class'], axis = 1)], axis = 1)
model_data.head()

Unnamed: 0,Class,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62
1,0,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69
2,0,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66
3,0,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5
4,0,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99


In [33]:
# Splitting the data into training and validation sets. 


train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state = 123), 
                                                 [int(0.7*len(model_data)), int(0.9*len(model_data))])

train_data.to_csv('train.csv', header = False, index = False)
validation_data.to_csv('validation.csv', header = False, index = False)

# Uploading data to S3

In [20]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

s3_train_data = 's3://{}/{}/train/train.csv'.format(bucket, prefix)
s3_validation_data = 's3://{}/{}/validation/validation.csv'.format(bucket, prefix)
print('Uploaded Training data location {}\n'.format(s3_train_data))
print('Uploaded Validation data location {}\n'.format(s3_validation_data))

output_location = 's3://{}/{}/output'.format(bucket, prefix)
print('Training artifacts will be uploaded to: {}'.format(output_location))


Uploaded Training data location s3://sagemaker-ap-southeast-2-789833638223/sagemaker/DEMO-xgboost-fraud/train/train.csv

Uploaded Validation data location s3://sagemaker-ap-southeast-2-789833638223/sagemaker/DEMO-xgboost-fraud/validation/validation.csv

Training artifacts will be uploaded to: s3://sagemaker-ap-southeast-2-789833638223/sagemaker/DEMO-xgboost-fraud/output


## Training 

In the case of training, first it is necessary to specify the locations of the XGBoost Algorithm contiainers. 
To specify the linear learner algorithm, we make use of a untility function to obtain it's URI. 

*xgboost* is a popular open-source package for gradient boosted trees. It is computationally powerful, fully featured, and has been successfully used in many machine learning use-cases. 


The ECR Container lcoation for SageMaker's Implementation of XGBoost has to be specified. 

In [22]:
container = sagemaker.image_uris.retrieve(region = boto3.Session().region_name, framework = 'xgboost', version = 'latest')



Since the training data is in the CSV format, the s3_input is created can used as a pointer to the files in S3, 
which specify the content type is CSV> 

In [23]:
s3_input_train = sagemaker.inputs.TrainingInput(s3_data = 's3://{}/{}/train'.format(bucket, prefix), content_type = 'csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data = 's3://{}/{}/validation/'.format(bucket, prefix), content_type = 'csv')



Training parameters have to be specified to the estimator, which includes: 
- xgboost algorithm container 
- IAM Role that has to be used 
- Training instance type and count 
- S3 location for the output data 
- Algorithm hyper parameters. 


the .fit() function specified the S3 location for output data, in which the training and validation set are passed. 

In [25]:
sess = sagemaker.Session() 
xgb = sagemaker.estimator.Estimator(container, role, instance_count = 1, instance_type = 'ml.m4.xlarge', 
                                   output_path = 's3://{}/{}/output'.format(bucket, prefix), 
                                   sagemaker_session = sess)

xgb.set_hyperparameters(max_depth = 5, 
                       eta = 0.2, 
                       gamma = 4, 
                       min_child_weight = 6, 
                       subsample = 0.8, 
                       silent = 0, 
                       objective = 'binary:logistic', 
                       num_round = 100) 
                        
                        
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})                        

2022-08-05 05:30:19 Starting - Starting the training job...
2022-08-05 05:30:43 Starting - Preparing the instances for trainingProfilerReport-1659677419: InProgress
.........
2022-08-05 05:32:05 Downloading - Downloading input data......
2022-08-05 05:33:03 Training - Downloading the training image.....[34mArguments: train[0m
[34m[2022-08-05:05:33:55:INFO] Running standalone xgboost training.[0m
[34m[2022-08-05:05:33:55:INFO] File size need to be processed in the node: 129.45mb. Available memory size in the node: 8461.95mb[0m
[34m[2022-08-05:05:33:55:INFO] Determined delimiter of CSV input is ','[0m
[34m[05:33:55] S3DistributionType set as FullyReplicated[0m
[34m[05:33:55] 199364x30 matrix with 5980920 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2022-08-05:05:33:55:INFO] Determined delimiter of CSV input is ','[0m
[34m[05:33:55] S3DistributionType set as FullyReplicated[0m
[34m[05:33:55] 56962x30 matrix with 1708860 entrie

### Now that the xgboost algorithm has been trained on the data, let's deploy a model that is hosted behind a real-time endpoint. 

In [26]:
xgb_predictor = xgb.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')

------!

## Evaluation 

There are many ways to compare the performance of a machine learning model, but one of the simplest ways is by comparing the actual and predicted values. In this case, the prediction is made whether a transaction is fraudulent (1) or not (0). This leads to a confusion matrix. 


We need to determine how data is passed into the endpoint and how data is received from the endpoint. The data is stored as a NumPy array in memory of our notebook instance. To send it in an HTTP Post request, we'll serialize it as a CSV string and then decode the resulting CSV. 


The data should not include the target variable. 

In [27]:
xgb_predictor.serializer = sagemaker.serializers.CSVSerializer()

A function is used to: 
- Loop over the test dataset 
- Split it into mini-batches of rows 
- Convert the mini-batches to CSV string paylods (target variable is dropped from the dataset first)
- Retrieve min-batch predictions by invoking the XGBoost Endpoint 
- Collect predictions and convert from the CSV output the model provides, into a NumPy array. 

In [31]:
def predict(data, predictor, rows = 500): 
    split_array = np.array_split(data, int(data.shape[0] / float(rows) * 1))
    predictions = ''
    for array in split_array: 
        predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])
        
    return np.fromstring(predictions[1:], sep = ',')

predictions = predict(test_data.drop(['Class'], axis = 1).to_numpy(), xgb_predictor)



In [32]:
pd.crosstab(index = test_data.iloc[:,0], columns = np.round(predictions), rownames = ['actual'], colnames = ['predictions'])


predictions,0.0,1.0
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,28440,2
1,8,31


Due to the randomized elements of the algorithm, results may vary. 

Of the 39 fraudsters, 31 of them have been correctly predicted (true positives). There is an incorrect prediction of 2 people as fraudsters (false-positive). There are also 8 cases of fraud that the model has not predicted as fraudulent (true-negative) which can have an impact in the real-world scenario. 

An important point here is that because of the np.round() function, a simple threshold of 0.5 is used. Predictions from XGBoost are continuous values between 0 and 1 and are forced into binary classes (as per the source). This cutoff can be adjusted, which might either result in the increase of false-positive or increase of true-positives. 

## Automatic Model Tuning (AMT)

In [36]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

hyperparameter_ranges = {'eta': ContinuousParameter(0, 1), 
                        'min_child_weight': ContinuousParameter(1, 10),
                        'alpha': ContinuousParameter(0, 2), 
                        'max_depth': IntegerParameter(1, 10)}

In [37]:
objective_metric_name = 'validation:auc'

In [38]:
tuner = HyperparameterTuner(xgb, objective_metric_name, hyperparameter_ranges, max_jobs = 9, max_parallel_jobs=3)

In [39]:
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})

..........................................................................................................................................................................................!


In [None]:
boto3.client('sagemaker').describe_hyper_parameter_tuning_job ( 
HyperParameterTuningJobName = tuner.latest_tuning_job.job_name)['HyperParameterTuningJobStatus']

In [None]:
tuner.best_training_job()

In [None]:
tuner_predictor = tuner.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')

In [None]:
tuner_predictor.serializer = sagemaker.serializers.CSVSerializer()