# Fruad Detection via XGBoost Classifier

This notebook shows how to train and run an XGBoost model to identify fraud over encrypted data. The dataset is based on fraud detection <a href="https://www.kaggle.com/datasets/dssouvikganguly/application-datacsv"> dataset</a>. This notebook uses the first 10 features of the dataset (5 numerical, 5 categorical), where the target field is a boolean indicating fraud ('1' is fraud).

The dataset attributes are:

|Attribute|Description|Values|
|---|---|---|
|SK_ID_CURR|Loan application ID|Integer (100000-500000)|
|TARGET|The label|Categorial, 1 for fraud, 0 otherwise
|NAME_CONTRACT_TYPE| The type of loan|Categorial, "Cash loans" or "Revolving loans"|
|CODE_GENDER|The gender of the client|Categorial, 'M' for male, 'F' for Female|
|FLAG_OWN_CAR|A boolean represents  if the client has car|Categorial, 'Y' for Yes, 'N' for no|
|FLAG_OWN_REALTY|A boolean represents  if the client has realty|Categorial, 'Y' for Yes, 'N' for no|
|CNT_CHILDREN|It represents the number of children a client has|Integer|
|AMT_INCOME_TOTAL|Details on the requested loan|Numerical|
|AMT_CREDIT|Details on the requested loan|Numerical|
|AMT_ANNUITY|Details on the requested loan|Numerical|

In [8]:
import os
import warnings
warnings.filterwarnings("ignore")

##### For reproducibility
from numpy.random import seed
seed_value= 1
os.environ['PYTHONHASHSEED']=str(seed_value)
seed(seed_value)
import numpy as np
import pandas as pd


from sklearn import metrics
from sklearn.model_selection import train_test_split

import h5py

import random
import sklearn_json as skljson
from sklearn.linear_model import LogisticRegression
import sys
from  preprocessor import Preprocessor
TASK_NAME = "fraud_detection_xgb"
run_with_gpu = False

### Data loading
Please refer to the dataset <a href="https://www.kaggle.com/datasets/dssouvikganguly/application-datacsv">documentation</a> for the complete list of attributes and their description.


In [9]:
df = pd.read_csv("./datasets/fraud_detection.csv")
df = df.iloc[0:100000,0:10]

df.head()

Unnamed: 0.1,Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT
0,0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5
1,1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5
2,2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0
3,3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5
4,4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0


### Data preprocessing

We first convert the categorial features (in the table below) to indicator vectors. 

Subsequently, we split every row into its target value (y) and predicates (X).

In [10]:

from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
y_train = train["TARGET"]
y_test = test["TARGET"]
x_train = train.drop("TARGET",axis=1)
x_test = test.drop("TARGET",axis=1)
print(x_test)


       Unnamed: 0  SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR  \
43660       43660      150542         Cash loans           M            Y   
87278       87278      201297         Cash loans           F            N   
14317       14317      116701         Cash loans           F            N   
81932       81932      195014         Cash loans           F            N   
95321       95321      210668         Cash loans           F            Y   
...           ...         ...                ...         ...          ...   
73441       73441      185154         Cash loans           F            N   
1341         1341      101575    Revolving loans           M            N   
71987       71987      183485         Cash loans           F            N   
26910       26910      131278    Revolving loans           F            N   
24890       24890      128948         Cash loans           F            N   

      FLAG_OWN_REALTY  CNT_CHILDREN  AMT_INCOME_TOTAL  AMT_CREDIT  
43660  

### Data preprocessing

We split the dataset into the training (x_train, y_train) and test (x_test, y_test) sets and scale their features. 

We convert the categorial features (in the table below) to indicator vectors. 

Subsequently, we split the test set into test and validation sets.

In [11]:
prep = Preprocessor()
x_train = prep.fit_transform(x_train)
x_test = prep.transform(x_test)

x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=4096, random_state=5, stratify=y_test)

For later use in HE, we save the different preprocessed datasets.

In [12]:
def save_data_set(x, y, data_type, path, s=''):
    if not os.path.exists(path):
        os.makedirs(path)
    fname=os.path.join(path, f'x_{data_type}{s}.h5')
    print("Saving x_{} of shape {} in {}".format(data_type, x.shape, fname))
    xf = h5py.File(fname, 'w')
    xf.create_dataset('x_{}'.format(data_type), data=x)
    xf.close()

    print("Saving y_{} of shape {} in {}".format(data_type, y.shape, fname))
    yf = h5py.File(os.path.join(path, f'y_{data_type}{s}.h5'), 'w')
    yf.create_dataset(f'y_{data_type}', data=y)
    yf.close()

datasets_dir = "./datasets/"
model_dir = "./model/"

save_data_set(x_test, y_test, data_type='test', path=datasets_dir)
save_data_set(x_train, y_train, data_type='train', path=datasets_dir)
save_data_set(x_val, y_val, data_type='val', path=datasets_dir)

if not os.path.exists(model_dir):
    os.mkdir(model_dir)
prep.save(os.path.join(model_dir, "prep.pickle"))


Saving x_test of shape (15904, 9) in ./datasets/x_test.h5
Saving y_test of shape (15904,) in ./datasets/x_test.h5
Saving x_train of shape (80000, 9) in ./datasets/x_train.h5
Saving y_train of shape (80000,) in ./datasets/x_train.h5
Saving x_val of shape (4096, 9) in ./datasets/x_val.h5
Saving y_val of shape (4096,) in ./datasets/x_val.h5


### Training a XGBoost Model

In [13]:
from xgboost import XGBClassifier
clf = XGBClassifier(eta=0.2, gamma=3.6, max_depth=3,scale_pos_weight =10, min_child_weight=3, subsample=0.8, objective="binary:logistic", eval_metric = "aucpr", n_estimators=5)
clf.fit(x_train, y_train)

print('xgb model ready')

xgb model ready


For later use in HE, we save the trained model.

In [14]:
def save_model(model, path):
    if not os.path.exists(path):
        os.mkdir(path)
    fname = os.path.join(path, f"{TASK_NAME}_model.json")
    model.save_model(fname)
    #skljson.to_json(model, fname)
    print("Saved model to ",fname)

save_model(clf, model_dir)

Saved model to  ./model/fraud_detection_xgb_model.json


### Using the model for classifying cleartest data

In [15]:
y_pred = clf.predict(x_test)

Confusion Matrix - TEST

In [16]:
f,t,thresholds = metrics.roc_curve(y_test, y_pred)
cm = metrics.confusion_matrix(y_test, y_pred)
print(f"AUC Score: {metrics.auc(f,t):.3f}")
print("Classification report:")
print(metrics.classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(cm)

AUC Score: 0.575
Classification report:
              precision    recall  f1-score   support

           0       0.93      0.74      0.83     14560
           1       0.13      0.41      0.19      1344

    accuracy                           0.71     15904
   macro avg       0.53      0.58      0.51     15904
weighted avg       0.86      0.71      0.77     15904

Confusion Matrix:
[[10782  3778]
 [  793   551]]


### Using the model for classifying encrypted data

To run the model over encrypted samples with homomorphic encryption (HE), we first load the pyhelayers package and refer it to the directory "output/", where we saved the model and the relevant datasets.

In [17]:
!pip install pyhelayers
import pyhelayers

ModuleNotFoundError: No module named 'pyhelayers'

Load test data and labels from the h5 file

In [None]:
with h5py.File(datasets_dir + "x_test.h5") as f:
    x_test = np.array(f["x_test"])
with h5py.File(datasets_dir + "y_test.h5") as f:
    y_test = np.array(f["y_test"])

### Compute the feature ranges


Our implementaiton requires the users to specify the minimum and maximum values of each feature. Here, we extract this info from the training data and assume it will also be relevant to the test data.

In [None]:
def get_feature_range(col):
    return (col.min(), col.max())

feature_ranges = []
for col in x_train.T:
    feature_ranges.append(get_feature_range(col))

Load a plain model

In [None]:
hyper_params = pyhelayers.PlainModelHyperParams()
hyper_params.feature_ranges = feature_ranges
# hyper_params.grep = 3
# hyper_params.frep = 1
hyper_params.verbose = False

plain_xgb = pyhelayers.PlainModel.create(hyper_params, [os.path.join(model_dir, f"{TASK_NAME}_model.json")]) 
print("loaded plain model")

loaded plain model


Apply automatic optimziations

In [None]:
he_run_req = pyhelayers.HeRunRequirements()
if hasattr(pyhelayers, "HeaanContext"):
    print('Using HEaaN backend')
    he_run_req.set_he_context_options([pyhelayers.HeaanContext()])
else:
    print('Using SEAL backend')
    he_run_req.set_he_context_options([pyhelayers.SealCkksContext()])

Using SEAL backend


In [None]:
profile = pyhelayers.HeModel.compile(plain_xgb, he_run_req)

Intialize the HE context with the optimized configuration.

In [None]:
he_context = pyhelayers.HeModel.create_context(profile)
if run_with_gpu:
    he_context.set_default_device(pyhelayers.DeviceType.DEVICE_GPU)
else:
    he_context.set_default_device(pyhelayers.DeviceType.DEVICE_CPU)

### 2.6. Initialize and encrypt the model¶
We initialize the HE model using the plain model and the HE profile computed above.

In [None]:
xgb = plain_xgb.get_empty_he_model(he_context)
xgb.encode_encrypt(plain_xgb, profile)
print('FHE model encrypted and initialized')

FHE model encrypted and initialized


We use the encrypted model over batches of 16 records at a time. 

In [None]:
batch_size=16
plain_samples = x_test.take(indices=range(0, batch_size), axis=0)
labels = y_test.take(indices=range(0, batch_size), axis=0)

Encrypt input samples

In [None]:
iop = xgb.create_io_processor()
x_test_enc = iop.encode_encrypt_input_for_predict(plain_samples)
print('input data encrypted')

input data encrypted


We perform FHE prediction on the encrypted test samples, using the encrypted model. The resulting predictions are encrypted as well, and will next be decrypted and compared to the expected labels.

### Run prediction over the encrypted data
Now we perform inference of the 16 samples under encryption 

In [None]:
res = xgb.predict(x_test_enc)
print('prediction ready')

prediction ready


### Plaintext results

Decrypting the final results

In [None]:
res_plain = iop.decrypt_decode_output(res)
res_plain = np.where(res_plain > 0.0, 1, 0)

In [None]:
print('\nclassification results')
print('=========================================')
for label,pred in zip(labels,res_plain):
    print('Label:',('Good' if label==1 else 'Bad.'),end=', ')
    print('Prediction:',('Bad' if pred[0]<0.5 else 'Good.'))


classification results
Label: Bad., Prediction: Bad
Label: Bad., Prediction: Bad
Label: Bad., Prediction: Bad
Label: Bad., Prediction: Good.
Label: Bad., Prediction: Bad
Label: Bad., Prediction: Good.
Label: Bad., Prediction: Bad
Label: Bad., Prediction: Bad
Label: Bad., Prediction: Good.
Label: Bad., Prediction: Bad
Label: Bad., Prediction: Good.
Label: Bad., Prediction: Good.
Label: Good, Prediction: Bad
Label: Bad., Prediction: Bad
Label: Bad., Prediction: Good.
Label: Bad., Prediction: Bad
