# Fraud Detection with XGBoost

*This notebook is adapted from [Real-time Serving for XGBoost, Scikit-Learn RandomForest, LightGBM, and More](https://developer.nvidia.com/blog/real-time-serving-for-xgboost-scikit-learn-randomforest-lightgbm-and-more/).*

### Introduction
In this example notebook, we will go step-by-step through the process of training and deploying an XGBoost fraud detection model using Triton's new FIL backend. Along the way, we'll show how to analyze the performance of a model deployed in Triton and optimize its performance based on specific SLA targets or other considerations.

### Pre-Requisites
You can use Jupyter Lab running on the PyTorch container. If you haven't already installed the the PyTorch see [Set up PyTorch and Triton Containers]().

## Model
### Fetching Training Data
For this example, we will make use of data from the [IEEE-CIS Fraud Detection](https://www.kaggle.com/c/ieee-fraud-detection/overview) Kaggle competition.

In [1]:
#!tar xzvf data/train_transaction.tgz -C data/
train_csv = 'data/train_transaction.csv'

train_transaction.csv


## Training Example Models
While the IEEE-CIS Kaggle competition focused on a more sophisticated problem involving analysis of both fraudulent transactions and the users linked to those transactions, we will use a simpler version of that problem (identifying fraudulent transactions only) to build our example model. In the following steps, we make use of cuML's preprocessing tools to clean the data and then train two example models using XGBoost.

In [7]:
import cudf
import cupy as cp
from cuml.preprocessing import SimpleImputer
from cuml.preprocessing import LabelEncoder
# Due to an upstream bug, cuML's train_test_split function is
# currently non-deterministic. We will therefore use sklearn's
# train_test_split in this example to obtain more consistent
# results.
from sklearn.model_selection import train_test_split

SEED=0

In [3]:
# Load data from CSV files into cuDF DataFrames
data = cudf.read_csv(train_csv)

In [4]:
# Replace NaNs in data
nan_columns = data.columns[data.isna().any().to_pandas()]
float_nan_subset = data[nan_columns].select_dtypes(include='float64')

imputer = SimpleImputer(missing_values=cp.nan, strategy='median')
data[float_nan_subset.columns] = imputer.fit_transform(float_nan_subset)

obj_nan_subset = data[nan_columns].select_dtypes(include='object')
data[obj_nan_subset.columns] = obj_nan_subset.fillna('UNKNOWN')

In [8]:
# Convert string columns to categorical or perform label encoding
cat_columns = data.select_dtypes(include='object')
for col in cat_columns.columns:
    data[col] = LabelEncoder().fit_transform(data[col])

In [9]:
# Split data into training and testing sets
X = data.drop('isFraud', axis=1)
y = data.isFraud.astype(int)
X_train, X_test, y_train, y_test = train_test_split(
    X.to_pandas(), y.to_pandas(), test_size=0.3, stratify=y.to_pandas(), random_state=SEED
)
# Copy data to avoid slowdowns due to fragmentation
X_train = X_train.copy()
X_test = X_test.copy()

In [10]:
import xgboost as xgb

In [13]:
# Define model training function
def train_model(num_trees, max_depth):
    model = xgb.XGBClassifier(
        tree_method='gpu_hist',
        use_label_encoder=False,
        predictor='gpu_predictor',
        eval_metric='aucpr',
        objective='binary:logistic',
        max_depth=max_depth,
        n_estimators=num_trees
    )
    model.fit(
        X_train,
        y_train,
        eval_set=[(X_test, y_test)]
    )
    return model

In [None]:
# Train a small model with just 500 trees and a maximum depth of 3
model = train_model(500, 3)

In [15]:
# Free up some room on the GPU by explicitly deleting dataframes
import gc
del data
del nan_columns
del float_nan_subset
del imputer
del obj_nan_subset
del cat_columns
del X
del y
gc.collect()

1243

In [17]:
# Save for the next step.
X_test.to_pickle("X_test.pkl")
y_test.to_pickle("y_test.pkl")
model.save_model("model.json")