# Fraud Detection using Isolation Forest
### Introduction
We use the dataset of [ULB Creditcard Dataset](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) to train our frauld detection model. In this notebook, we will use [Isolation Forest](https://mmlspark.blob.core.windows.net/docs/0.9.1/pyspark/synapse.ml.isolationforest.html) algorithm, which refers to some execellent work listed as below:

* **Fraud detection handbook**: https://fraud-detection-handbook.github.io/fraud-detection-handbook/Foreword.html
* **AWS creditcard fraud detector**: https://github.com/awslabs/fraud-detection-using-machine-learning/blob/master/source/notebooks/sagemaker_fraud_detection.ipynb
* **Anomaly Detection using different methods**: https://www.kaggle.com/code/adepvenugopal/anomaly-detection-using-different-methods

In a fraud detection scenario, we may have very few labeled examples, and it's possible that labeling fraud takes a very long time. Isolation Forest, as an unsupervised learning algorithm, is very scalable and can help us to identify the fraud data only based on features if there is little labled data.

In [None]:
import pyspark
import yaml
import numpy as np
import pandas as pd
import warnings

from pyspark.sql import functions as F
from pyspark.sql.types import FloatType, DoubleType
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics


warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [None]:
def init_spark():
    spark = pyspark.sql.SparkSession.builder\
            .appName("Fraud Detection-LightGBM") \
            .config("spark.executor.memory","8G") \
            .config("spark.executor.instances","4") \
            .config("spark.executor.cores", "4") \
            .config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.9.4") \
            .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
            .getOrCreate()
    sc = spark.sparkContext
    print(sc.version)
    print(sc.applicationId)
    print(sc.uiWebUrl)
    return spark

def load_config(path):
    params = dict()
    with open(path, 'r') as stream:
        params = yaml.load(stream, Loader=yaml.FullLoader)
    return params

def read_dataset(spark, data_path):
    dataset = spark.read.format("csv")\
      .option("header",  True)\
      .option("inferSchema",  True)\
      .load(data_path)  
    return dataset

def get_vectorassembler(dataset, features='features', label='label'):
    featurizer = VectorAssembler(
        inputCols = feature_cols,
        outputCol = 'features',
        handleInvalid = 'skip'
    )
    dataset = featurizer.transform(dataset)[label, features]
    return dataset

In [None]:
spark = init_spark()

### Train detection model using Isolation Forest

Here we are using [Isolation Forest](https://mmlspark.blob.core.windows.net/docs/0.9.1/pyspark/synapse.ml.isolationforest.html) to train our fraud detection model. Moreover, we will test the model performance by using multiple metrics, such as AUC, KS, Balanced accuracy, Cohen's kappa and Confusion Matrix. 

Moreover, You should replace ``{YOUR_S3_BUCKET}``, ``{TRAIN_S3_PATH}`` and ``{TEST_S3_PATH}`` with actual values before executing code cells containing these placeholders.

In [None]:
train_file_path = 's3://{YOUR_S3_BUCKET}/{TRAIN_S3_PATH}'
test_file_path = 's3://{YOUR_S3_BUCKET}/{TEST_S3_PATH}'
fg_train_dataset = read_dataset(spark, train_file_path)
fg_test_dataset = read_dataset(spark, test_file_path)

In [None]:
fg_train_dataset.printSchema()

In [None]:
feature_cols = fg_train_dataset.columns[:-1]
feature_cols

In [None]:
label_col = fg_train_dataset.columns[-1]
label_col

In [None]:
train_data = get_vectorassembler(fg_train_dataset, label=label_col, features='features')
test_data = get_vectorassembler(fg_test_dataset, label=label_col, features='features')

In [None]:
train_data.limit(10).toPandas()

In [None]:
train, valid = train_data.randomSplit([0.90, 0.10], seed=2022)

In [None]:
model_params = {
    'numEstimators': 100,
    'bootstrap': False,
    'maxSamples': 256,
    'maxFeatures': 1.0,
    'contamination': 0.02,
    'contaminationError': 0.02 * 0.01,
    'randomSeed': 2022
}

def train_isolationforest(train_dataset, feature_col, label_col, model_params):
    from synapse.ml.isolationforest import IsolationForest
    model = IsolationForest(featuresCol='features', predictionCol='predictedLabel', scoreCol='rawPrediction', **model_params)
    model = model.fit(train_dataset)
    return model

def evaluate(predictions, label_col, metricName="areaUnderROC"):
    evaluator = BinaryClassificationEvaluator(labelCol=label_col, metricName="areaUnderROC")
    return evaluator.evaluate(predictions)

model = train_isolationforest(train, 'features', 'Class', model_params)

In [None]:
print("train dataset prediciton:")
predictions = model.transform(train_data)
print("train dataset auc:", evaluate(predictions, label_col))

In [None]:
print("validation dataset prediciton:")
predictions = model.transform(valid)
print("validation dataset auc:", evaluate(predictions, label_col))

In [None]:
model = train_isolationforest(train_data, 'features', 'Class', model_params)

In [None]:
print("test dataset prediciton:")
predictions = model.transform(test_data)
print("test dataset auc:", evaluate(predictions, label_col))

In [None]:
predictionAndLabels = predictions.select('prediction', F.col(label_col).cast(DoubleType()))\
                                 .withColumnRenamed(label_col, 'label')
metrics = MulticlassMetrics(predictionAndLabels.rdd)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sn
plt.figure(figsize = (10,7))
sn.heatmap(metrics.confusionMatrix().toArray(), 
           xticklabels=['Not Fraud', 'Fraud'],
           yticklabels=['Not Fraud', 'Fraud'],
           linewidths=5, fmt='g', annot=True)

In [None]:
import sys
import matplotlib.pyplot as plt

sys.path.append('../../') 
from common.ks_utils import ks_2samp, ks_curve

label = np.array(predictions.select(label_col).collect()).reshape(-1).astype(np.float32)
prediction = np.array(predictions.select('rawPrediction').collect())[:, 0].reshape(-1)
print('label: ', label[0:10])
print('prediction: ', prediction[0:10])

ks = ks_2samp(label, prediction)
print("KS statistic: ", ks.statistic)
ks_curve(label, prediction)
plt.show()

In [None]:
from sklearn.metrics import balanced_accuracy_score, cohen_kappa_score

# scikit-learn expects 0/1 predictions, so we threshold our raw predictions
y_preds = np.where(prediction > 0.5, 1, 0)
print("Balanced accuracy = {}".format(balanced_accuracy_score(label, y_preds)))
print("Cohen's Kappa = {}".format(cohen_kappa_score(label, y_preds)))

In [None]:
spark.stop()