# Banking Fraud Detection with RandomForestClassifier

Detecting fraud in banking is very important for real banks. Very few of these are fraud, but if the transaction isn’t well detected, the damage is huge. In this task, you will create a binary classifier for mobile money transaction data. You can also download data from [kaggle](https://www.kaggle.com/ntnu-testimon/paysim1). Since the dataset has already been processed once, the number of columns or rows may be different.

## Data description

This simulation dataset helps financial research, and it contains 9 columns and 300 records. Each record means one transaction that includes cash in, cash out, debit, credit, or transfer. Also each transcation has amount, name of origin and etc. The label or target column is ``isFraud`` that shows whether fraud or not.

- type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
- amount - amount of the transaction in local currency.
- nameOrig - customer who started the transaction
- oldbalanceOrg - initial balance before the transaction
- newbalanceOrig - new balance after the transaction
- nameDest - customer who is the recipient of the transaction
- oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).
- newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).
- isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

## IMPORTANT
This notebook assumes that you have already performed data preprocessing with SageMaker Data Wrangler. Please look up this [github](https://github.com/jjk-dev/amazon-sagemaker-studio-workshop.git).

## Set up
Download several libraries to proceed with this notebook.

### Import libraries

In [1]:
import os
import math
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split # import train_test_split function
from sklearn.metrics import classification_report # import classification metrics
from sklearn.ensemble import RandomForestClassifier #import RandomForestClassifier

import sagemaker 
import boto3

### Set boto3 and variables

Connect the session and search for IAM (Identity and Access Management) role. And load data then set some values such as ``S3 bucket name`` and ``file name``.

In [2]:
sagemaker_session = sagemaker.Session()
s3 = boto3.resource('s3')

### Change the values below

In [None]:
# Set S3 bucket to get csv file
bucket = 'CHANGE TO YOUR S3 BUCKET NAME'        # e.g. sagemaker-000000000000/.../default
file = 'CHANGE TO YOUR TRANSFORMED DATA'        # e.g. part-00000-edb8e4ca....csv

data_location = 's3://{}/{}'.format(bucket, file) 

## Read dataset

Read data from your S3 bucket that you set above.

In [4]:
df = pd.read_csv(data_location)

df.shape

(3000, 11)

In [5]:
df.head()

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_PAYMENT,type_CASH_OUT,type_CASH_IN,type_TRANSFER,type_DEBIT
0,140421.18,16004,0.0,0.0,140421.18,0,0.0,1.0,0.0,0.0,0.0
1,216666.53,50398,0.0,10119297.16,10335963.7,0,0.0,1.0,0.0,0.0,0.0
2,234636.2,74262,0.0,166046.48,400682.68,0,0.0,1.0,0.0,0.0,0.0
3,52816.29,117751,170567.29,0.0,0.0,0,0.0,0.0,1.0,0.0,0.0
4,63871.25,6012,0.0,456488.36,520359.6,0,0.0,1.0,0.0,0.0,0.0


## Train a model using RandomForestClassifier

In [6]:
# Set the feature and target columns
features = df.drop('isFraud', axis=1)
target = df.isFraud

# split the data into train and test
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.3)

In [7]:
# Modeling
model = RandomForestClassifier(n_estimators = 10)

# Train
model.fit(x_train, y_train)

# Predict
y_pred = model.predict(x_test)
y_real = y_test

In [8]:
print(classification_report(y_real, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       801
           1       0.93      0.93      0.93        99

    accuracy                           0.98       900
   macro avg       0.96      0.96      0.96       900
weighted avg       0.98      0.98      0.98       900

