# Banking Fraud Detection with XGBoost

Detecting fraud in banking is very important for real banks. Very few of these are fraud, but if the transaction isn’t well detected, the damage is huge. In this task, you will create a binary classifier for mobile money transaction data. You can also download data from [kaggle](https://www.kaggle.com/ntnu-testimon/paysim1). Since the dataset has already been processed once, the number of columns or rows may be different.

## Data description

This simulation dataset helps financial research, and it contains 9 columns and 300 records. Each record means one transaction that includes cash in, cash out, debit, credit, or transfer. Also each transcation has amount, name of origin and etc. The label or target column is ``isFraud`` that shows whether fraud or not.

- type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
- amount - amount of the transaction in local currency.
- nameOrig - customer who started the transaction
- oldbalanceOrg - initial balance before the transaction
- newbalanceOrig - new balance after the transaction
- nameDest - customer who is the recipient of the transaction
- oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).
- newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).
- isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

## IMPORTANT
This notebook assumes that you have already performed data preprocessing with SageMaker Data Wrangler. Please look up this [github](https://github.com/jjk-dev/amazon-sagemaker-studio-workshop.git).

## Setup

Download several libraries to proceed with this notebook.

In [1]:
#!pip install -U sagemaker

### Import libraries

In [2]:
import os                                         # For manipulating filepath names  
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs

import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.

import sagemaker                                  # Amazon SageMaker's Python SDK provides many helper functions
from sagemaker import get_execution_role          # Define IAM role
from sklearn.preprocessing import MinMaxScaler

import boto3

### Set boto3 and variables

Connect the session and search for IAM (Identity and Access Management) role. And load data then set some values such as ``S3 bucket name`` and ``student number``.

In [3]:
sagemaker_session = sagemaker.Session()
s3 = boto3.resource('s3')
role = get_execution_role()

Couldn't call 'get_role' to get Role ARN from role name AmazonSageMaker-ExecutionRole-20210426T215985 to get Role path.
Assuming role was created in SageMaker AWS console, as the name contains `AmazonSageMaker-ExecutionRole`. Defaulting to Role ARN with service-role in path. If this Role ARN is incorrect, please add IAM read permissions to your role or supply the Role Arn directly.


### Change the values below

In [None]:
student_number = 'CHANGE TO YOUR STUDENT NUMBER'       # e.g. '2021000000'
bucket = 'CHANGE TO YOUR S3 BUCKET NAME'               # e.g. sagemaker-000000000000

input_data_bucket = 'CHANGE TO YOUR INPUT DATA LOCATION IN S3 BUCKET'     # e.g. sagemaker-000000000000/.../default
file = 'CHANGE TO YOUR TRANSFORMED DATA'                # e.g. part-00000-edb8e4ca....csv

In [5]:
prefix = 'banking-fraud'                           # Set folder name in S3 bucket        

data_location = 's3://{}/{}'.format(input_data_bucket, file) 

In [20]:
df = pd.read_csv(data_location)

df.shape

(3000, 11)

In [7]:
df.head()

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_PAYMENT,type_CASH_OUT,type_CASH_IN,type_TRANSFER,type_DEBIT
0,140421.18,16004,0.0,0.0,140421.18,0,0.0,1.0,0.0,0.0,0.0
1,216666.53,50398,0.0,10119297.16,10335963.7,0,0.0,1.0,0.0,0.0,0.0
2,234636.2,74262,0.0,166046.48,400682.68,0,0.0,1.0,0.0,0.0,0.0
3,52816.29,117751,170567.29,0.0,0.0,0,0.0,0.0,1.0,0.0,0.0
4,63871.25,6012,0.0,456488.36,520359.6,0,0.0,1.0,0.0,0.0,0.0


## Split dataset

In [8]:
# Delete target column that unnecessary in triainin
y_column = 'isFraud'
columns_to_drop = [y_column]

In [9]:
# Split data as train, validation,and test in 7:2:1
train_data, validation_data, test_data = np.split(df.sample(frac=1, random_state=2021), [int(0.7 * len(df)), int(0.9 * len(df))])

In Amazon SageMaker, XGBoost container use data in libSVM or CSV format. In this lab, you use CSV file. The first column in the CSV file must be specified as the target value, and the header must not be included. You will work after splittin into train | validation | test dataset.

In [10]:
# Save each dataset in local environment
pd.concat([train_data[y_column], train_data.drop(columns_to_drop, axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
pd.concat([validation_data[y_column], validation_data.drop(columns_to_drop, axis=1)], axis=1).to_csv('validation.csv', index=False, header=False)
pd.concat([test_data[y_column], test_data.drop(columns_to_drop, axis=1)], axis=1).to_csv('test.csv', index=False, header=False)
pd.concat([test_data.drop(columns_to_drop, axis=1)], axis=1).to_csv('test_features.csv', index=False, header=False)

Upload the file to S3 so you can acesss in managed enrionment of SageMaker.

In [11]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test/test_features.csv')).upload_file('test.csv')

## Train a model using XGBoost

In this notebook, use XGBoost which is simple but effective for binary classification. XGBoost is a open source library that conduct Gradient Boosting. This performs excellent calculation skill, implements all the necessary functions, and has been successful in many machine learning competitions. Let's start with a simple xgboost model to learn using managed, distributed learning framework.

From ECR container, you can use built-in algorithm.

In [12]:
from sagemaker import image_uris

container = sagemaker.image_uris.retrieve(framework = 'xgboost', 
                                          region = boto3.Session().region_name, 
                                          version = 'latest')

Create `s3_input` object that informs file location and set content type as csv because now you use CSV file format.

In [13]:
s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

Generate estimator as specifing parameter below.

- Xgboost algorithm container
- IAM role
- Training instance type and count (By using 'local_cpu', you can train model with in this notebook instance.)
- Output location in S3
- Algorithm hyperparameter

Execute `.fit()` using follow value.
- Location of train / validation data

In [14]:
sess = sagemaker.Session()

job_name=student_number+'-banking-fraud-'+strftime("%Y-%m-%d-%H-%M-%S", gmtime())

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge', 
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess,
                                    base_job_name=job_name)

xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=5)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


2021-04-26 18:53:18 Starting - Starting the training job...
2021-04-26 18:53:26 Starting - Launching requested ML instancesProfilerReport-1619463197: InProgress
.........
2021-04-26 18:55:06 Starting - Preparing the instances for training............
2021-04-26 18:57:19 Downloading - Downloading input data
2021-04-26 18:57:19 Training - Training image download completed. Training in progress..[34mArguments: train[0m
[34m[2021-04-26:18:57:20:INFO] Running standalone xgboost training.[0m
[34m[2021-04-26:18:57:20:INFO] File size need to be processed in the node: 0.15mb. Available memory size in the node: 8405.34mb[0m
[34m[2021-04-26:18:57:20:INFO] Determined delimiter of CSV input is ','[0m
[34m[18:57:20] S3DistributionType set as FullyReplicated[0m
[34m[18:57:20] 2100x10 matrix with 21000 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2021-04-26:18:57:20:INFO] Determined delimiter of CSV input is ','[0m
[34m[18:57:20] S3Distribut

## Hosting
### Create endpoint

When the xgboost model trained on the input data, it deployed as an endpoint for real-time inference.

In [15]:
%%time

endpoint_name=student_number+'-banking-fraud-'+strftime("%Y-%m-%d-%H-%M-%S", gmtime())

xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge', 
                           endpoint_name=endpoint_name)

-------------!CPU times: user 199 ms, sys: 18 ms, total: 217 ms
Wall time: 6min 31s


## Perform Inference
### Make predictions using the endpoint

Compare actual and predicted values to verify the performance of \machine learning model. Transfer the data for inference to endpoint and get the result. Serialize data into CSV format to send it as an HTTP POST request and decode CSV result.

CAUTION: SageMaker XGBoost doesn't contain target column when inference as CSV format.

In [16]:
from sagemaker.serializers import CSVSerializer

xgb_predictor.serializer = CSVSerializer()

Create function that call endpoint.

- Repeat test dataset (Loop)
- Divide minibatch as number of rows
- Transform minibatch to CSV string payloads (Delete target column)
- Call XGBoost endpoint and send predict value
- Transform retured CSV result to NumPy arrary

In [17]:
def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for i, array in enumerate(split_array):
        predictions = ','.join([predictions, xgb_predictor.predict(array).decode('utf-8')])
        if i % 10 == 0:
            print(i, 'out of', len(split_array), 'completed')
    return np.fromstring(predictions[1:], sep=',')

predictions = predict(test_data.drop(columns_to_drop, axis=1).to_numpy())

0 out of 1 completed


In [18]:
# F1-score, accurancy, ROC
from sklearn.metrics import classification_report, roc_auc_score, accuracy_score # import classification metrics

print(classification_report(test_data[y_column], np.round(predictions)))
print("Test accuracy:", accuracy_score(test_data[y_column], np.round(predictions)))
print("ROC_AUC score:", roc_auc_score(test_data[y_column], np.round(predictions) / 1000))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98       265
           1       0.90      0.80      0.85        35

    accuracy                           0.97       300
   macro avg       0.94      0.89      0.91       300
weighted avg       0.97      0.97      0.97       300

Test accuracy: 0.9666666666666667
ROC_AUC score: 0.8943396226415095


Create confustion matrix that compares predicted result and actual value. The result may not exactly same as above.

In [19]:
pd.crosstab(index=test_data['isFraud'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

predictions,0.0,1.0
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0,262,3
1,7,28


TP, TN, FP, FN are defined as below.

- TP = Truly (identified as) Positive
- TN = Truly (identified as) Negative
- FP = Falsely (identified as) Positive
- FN = Falsely (identified as) Negative

The confusion matrix means like this:

| actuals\predictions | 0 | 1 |
| --- | --- | --- |
| 0 | TN | FP |
| 1 | FN | TP |


On that basis, you can calculate Accuracy, Precision, Recall.
- Accuracy = (TP + TN) / (TP + FP + FN + TP)
- Precision = TP / (TP + FP) = 670 / (670 + 29)
- Recall = TP / (TP + FN) = 670 / (670 + 157)

## Stop / Close the Endpoint

After finishing all of these examples, run the cell below. The following command removes the endpoint hosted on the SageMaker created in the inference step. If the endpoint exist, the charges will occur.

In [None]:
# sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)