# Data Preprocessing using RAPIDS and Training XGBoost for Fraud Detection

In this notebook we will walk through using RAPIDS for GPU-accelerated data preprocessing and training of XGBoost model for Fraud Detection use-case.

## To Run This Notebook Please Select RAPIDS 2106 Kernel from the Kernel Dropdown menu

## Get Data

For this example, we use the Tabformer [synthetic credit card transactions dataset](https://arxiv.org/abs/1910.03033) from IBM available on [Kaggle](https://www.kaggle.com/datasets/ealtman2019/credit-card-transactions). You can either directly download the dataset from this [Kaggle link](https://www.kaggle.com/datasets/ealtman2019/credit-card-transactions) and then upload it to your SageMaker notebook instance. Or you may fetch the data from Kaggle command line client using the following commands.


### Kaggle API

First we install the Kaggle CLI.

In [1]:
!pip install -q kaggle

Then we enable the Kaggle API. This assumes you have an account on Kaggle. It's free and only takes a minute. Once you have that, follow [instructions here](https://github.com/Kaggle/kaggle-api#api-credentials) to retrieve your kaggle.json file and upload it to SageMaker through JupyterLab upload interface. Then run the following cells.

In [2]:
!mkdir /home/ec2-user/.kaggle
!mv kaggle.json /home/ec2-user/.kaggle/
!chmod 600 /home/ec2-user/.kaggle/kaggle.json

In [3]:
!kaggle datasets download -d ealtman2019/credit-card-transactions

Downloading credit-card-transactions.zip to /home/ec2-user/SageMaker/fil_triton_sagemaker
100%|███████████████████████████████████████▉| 263M/263M [00:57<00:00, 4.94MB/s]
100%|████████████████████████████████████████| 263M/263M [00:57<00:00, 4.79MB/s]


In [4]:
!unzip -u credit-card-transactions.zip

Archive:  credit-card-transactions.zip
  inflating: User0_credit_card_transactions.csv  
  inflating: credit_card_transactions-ibm_v2.csv  
  inflating: sd254_cards.csv         
  inflating: sd254_users.csv         


## Data Preprocessing

In [1]:
import cudf
import cuml
from cuml.preprocessing import LabelEncoder
import numpy as np
import pickle
import os

In [2]:
data_path = './'

In [3]:
data_csv = 'credit_card_transactions-ibm_v2.csv'

In [4]:
data = cudf.read_csv(os.path.join(data_path, data_csv))

In [5]:
data.head()

Unnamed: 0,User,Card,Year,Month,Day,Time,Amount,Use Chip,Merchant Name,Merchant City,Merchant State,Zip,MCC,Errors?,Is Fraud?
0,0,0,2002,9,1,06:21,$134.09,Swipe Transaction,3527213246127876953,La Verne,CA,91750.0,5300,,No
1,0,0,2002,9,1,06:42,$38.48,Swipe Transaction,-727612092139916043,Monterey Park,CA,91754.0,5411,,No
2,0,0,2002,9,2,06:22,$120.34,Swipe Transaction,-727612092139916043,Monterey Park,CA,91754.0,5411,,No
3,0,0,2002,9,2,17:45,$128.95,Swipe Transaction,3414527459579106770,Monterey Park,CA,91754.0,5651,,No
4,0,0,2002,9,3,06:23,$104.71,Swipe Transaction,5817218446178736267,La Verne,CA,91750.0,5912,,No


In [6]:
data['Zip'] = data['Zip'].astype('object')
data['MCC'] = data['MCC'].astype('object')
data["Merchant Name"] = data["Merchant Name"].astype("object")

In [7]:
SEED = 42
data = data.sample(frac=0.2, random_state=SEED)
data = data.reset_index(drop=True)

In [8]:
data.shape

(4877380, 15)

### Encode labels


In [9]:
y = data['Is Fraud?']
data.drop(columns=['Is Fraud?'], inplace=True)
y = (y == "Yes").astype(int)

### Save subset for inference

We will save a subset of data to submit inference requests for later on in the second notebook.

In [10]:
data_infer = data.iloc[258:263]
data_infer.to_csv('data_infer.csv', index=False)

### Handle Missing Values

In [11]:
data.isna().sum()/len(data) * 100

User               0.000000
Card               0.000000
Year               0.000000
Month              0.000000
Day                0.000000
Time               0.000000
Amount             0.000000
Use Chip           0.000000
Merchant Name      0.000000
Merchant City      0.000000
Merchant State    11.136081
Zip               11.779972
MCC                0.000000
Errors?           98.408121
dtype: float64

In [12]:
data.loc[data["Merchant City"]=="ONLINE", "Merchant State"] = "ONLINE" 
data.loc[data["Merchant City"]=="ONLINE", "Zip"] = "ONLINE" 

In [13]:
data['Errors?'] = data['Errors?'].notna()

In [14]:
us_states_plus_online = ['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
           'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
           'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
           'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
           'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY', 'ONLINE']

data.loc[~data["Merchant State"].isin(us_states_plus_online), "Zip"] = "FOREIGN"

### Handle Amount and Time

In [15]:
data['Amount'] = data['Amount'].str.slice(1)
data['Hour'] = data['Time'].str.slice(stop=2)
data['Minute'] = data['Time'].str.slice(start=3)
data.drop(columns=['Time'], inplace=True)

###  Train Test Split

In [16]:
from cuml.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.3, random_state=SEED, stratify=y)

In [17]:
# Free up some room on the GPU by explicitly deleting dataframes
import gc
del data
del y
gc.collect()

0

### Encoding Categorical Columns

In [18]:
categorial_columns = ['Zip', 'MCC', 'Merchant Name', 'Use Chip', 'Merchant City', 'Merchant State']
encoders = {}

# handle unknown values present in training data but not in test data
for col in categorial_columns:
    unique_values = X_train[col].unique().values_host
    X_test.loc[~X_test[col].isin(unique_values), col] = 'UNKNOWN'
    unique_values = np.append(unique_values, ['UNKNOWN'])
    # convert to cudf series
    unique_values = cudf.Series(unique_values)
    le = LabelEncoder().fit(unique_values)
    X_train[col] = le.transform(X_train[col])
    X_test[col] = le.transform(X_test[col])
    encoders[col] = le.classes_.values_host

### Save Label Encoders to be used for preprocessing at Inference time

In [19]:
with open('label_encoders.pkl', 'wb') as f:
    pickle.dump(encoders, f)

In [20]:
# convert all dtypes to fp32 for xgboost training
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

## Train XGBoost

Now we train the XGBoost model. This will take 2-3 minutes.

In [21]:
import xgboost as xgb
import time

dtrain = xgb.DMatrix(
        X_train,
        y_train)

dtest = xgb.DMatrix(
        X_test,
        y_test)

max_depth = 8
num_trees = 2000
xgb_params = {
    'max_depth':          max_depth,
    'tree_method':       'gpu_hist',
    'objective':         'binary:logistic',
    'eval_metric':       'aucpr',
    'predictor':         'gpu_predictor',
}
start = time.time()
model = xgb.train(params=xgb_params, 
                       dtrain=dtrain, 
                       num_boost_round=num_trees)
print("Training Time", time.time()-start, "seconds")

Training Time 160.4755470752716 seconds


In [22]:
y_score = model.predict(dtest)
threshold = 0.5
y_pred = (y_score >= 0.5).astype(int)

In [23]:
from sklearn.metrics import f1_score

y_true = y_test.values_host
f1 = f1_score(y_true, y_pred)
print(f'Test F1-Score: {f1: 0.4f}')

Test F1-Score:  0.8364


### Save Trained Model

In [24]:
model_path = "./xgboost.json"

In [25]:
model.save_model(model_path)