# Data Preprocessing using RAPIDS and Training XGBoost for Fraud Detection

In this notebook we will walk through using RAPIDS for GPU-accelerated data preprocessing and training of XGBoost model for Fraud Detection use-case.

## Get Data

For this example, we use a [synthetic credit card transactions dataset](https://arxiv.org/abs/1910.03033) available on [Kaggle](https://www.kaggle.com/datasets/ealtman2019/credit-card-transactions). You can either directly download the dataset from this [Kaggle link](https://www.kaggle.com/datasets/ealtman2019/credit-card-transactions) and then upload it to your SageMaker notebook instance. Or you may fetch the data from Kaggle command line client using the following commands.

**NOTE:** You will need to make sure that your Kaggle credentials are [available](https://github.com/Kaggle/kaggle-api#api-credentials) either through a kaggle.json file or via environment variables.

In [2]:
!pip install -q kaggle

^C
[31mERROR: Operation cancelled by user[0m[31m
[0m

In [3]:
!kaggle datasets download -d ealtman2019/credit-card-transactions

/bin/sh: kaggle: command not found


In [4]:
!unzip -u credit-card-transactions.zip

Archive:  credit-card-transactions.zip


## Data Preprocessing

In [1]:
import cudf
import cuml
from cuml.preprocessing import LabelEncoder
import numpy as np
import pickle
import os

In [2]:
data_path = './'

In [3]:
data_csv = 'credit_card_transactions-ibm_v2.csv'

In [4]:
data = cudf.read_csv(os.path.join(data_path, data_csv))

In [5]:
data.head()

Unnamed: 0,User,Card,Year,Month,Day,Time,Amount,Use Chip,Merchant Name,Merchant City,Merchant State,Zip,MCC,Errors?,Is Fraud?
0,0,0,2002,9,1,06:21,$134.09,Swipe Transaction,3527213246127876953,La Verne,CA,91750.0,5300,,No
1,0,0,2002,9,1,06:42,$38.48,Swipe Transaction,-727612092139916043,Monterey Park,CA,91754.0,5411,,No
2,0,0,2002,9,2,06:22,$120.34,Swipe Transaction,-727612092139916043,Monterey Park,CA,91754.0,5411,,No
3,0,0,2002,9,2,17:45,$128.95,Swipe Transaction,3414527459579106770,Monterey Park,CA,91754.0,5651,,No
4,0,0,2002,9,3,06:23,$104.71,Swipe Transaction,5817218446178736267,La Verne,CA,91750.0,5912,,No


In [6]:
data.shape

(24386900, 15)

In [7]:
data['Zip'] = data['Zip'].astype('object')
data['MCC'] = data['MCC'].astype('object')
data["Merchant Name"] = data["Merchant Name"].astype("object")

In [8]:
SEED = 42
data = data.sample(frac=0.5, random_state=SEED)
data = data.reset_index(drop=True)

In [9]:
data.shape

(12193450, 15)

### Encode labels


In [10]:
y = data['Is Fraud?']
data.drop(columns=['Is Fraud?'], inplace=True)
y = (y == "Yes").astype(int)

### Save subset for inference

We will save a subset of data to submit inference requests for later on in the second notebook.

In [11]:
data_infer = data.iloc[258:263]
data_infer.to_csv('data_infer.csv', index=False)

### Handle Missing Values

In [17]:
data.isna().sum()/len(data) * 100

User               0.000000
Card               0.000000
Year               0.000000
Month              0.000000
Day                0.000000
Time               0.000000
Amount             0.000000
Use Chip           0.000000
Merchant Name      0.000000
Merchant City      0.000000
Merchant State    11.159934
Zip               11.804764
MCC                0.000000
Errors?           98.408834
dtype: float64

In [18]:
data.loc[data["Merchant City"]=="ONLINE", "Merchant State"] = "ONLINE" 
data.loc[data["Merchant City"]=="ONLINE", "Zip"] = "ONLINE" 

In [19]:
data['Errors?'] = data['Errors?'].notna().astype(int)

In [20]:
us_states_plus_online = ['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
           'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
           'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
           'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
           'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY', 'ONLINE']

data.loc[~data["Merchant State"].isin(us_states_plus_online), "Zip"] = "FOREIGN"

### Handle Amount and Time

In [21]:
data['Amount'] = data['Amount'].str.slice(1).astype('float32')
data['Hour'] = data['Time'].str.slice(stop=2).astype('int64')
data['Minute'] = data['Time'].str.slice(start=3).astype('int64')
data.drop(columns=['Time'], inplace=True)

###  Train Test Split

In [22]:
from cuml.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.3, random_state=SEED, stratify=y)

In [23]:
# Free up some room on the GPU by explicitly deleting dataframes
import gc
del data
del y
gc.collect()

1654

### Encoding Categorical Columns

In [24]:
categorial_columns = ['Zip', 'MCC', 'Merchant Name', 'Use Chip', 'Merchant City', 'Merchant State']
encoders = {}

# handle unknown values present in training data but not in test data
for col in categorial_columns:
    unique_values = X_train[col].unique().values_host
    X_test.loc[~X_test[col].isin(unique_values), col] = 'UNKNOWN'
    unique_values = np.append(unique_values, ['UNKNOWN'])
    # convert to cudf series
    unique_values = cudf.Series(unique_values)
    le = LabelEncoder().fit(unique_values)
    X_train[col] = le.transform(X_train[col])
    X_test[col] = le.transform(X_test[col])
    encoders[col] = le.classes_.values_host

### Save Label Encoders to be used for preprocessing at Inference time

In [25]:
with open('label_encoders.pkl', 'wb') as f:
    pickle.dump(encoders, f)

In [26]:
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

## Train XGBoost

In [21]:
import xgboost as xgb
import time

dtrain = xgb.DMatrix(
        X_train,
        y_train)

dtest = xgb.DMatrix(
        X_test,
        y_test)

max_depth = 8
num_trees = 2000
xgb_params = {
    'max_depth':          max_depth,
    'tree_method':       'gpu_hist',
    'objective':         'binary:logistic',
    'eval_metric':       'aucpr',
    'predictor':         'gpu_predictor',
}
start = time.time()
model = xgb.train(params=xgb_params, 
                       dtrain=dtrain, 
                       num_boost_round=num_trees)
print("Training Time", time.time()-start, "seconds")

Training Time 108.20628547668457 seconds


In [22]:
y_score = model.predict(dtest)
threshold = 0.5
y_pred = (y_score >= 0.5).astype(int)

In [23]:
from sklearn.metrics import f1_score

y_true = y_test.values_host
f1 = f1_score(y_true, y_pred)
print(f'Test F1-Score: {f1: 0.4f}')

Test F1-Score:  0.8502


### Save Trained Model

In [32]:
model_path = "./xgboost.json"

In [25]:
# model.save_model(model_path)

## CPU GPU FIL Benchmarks

**Note:** CPU Predictor can take ~2 minutes.

In [None]:
import xgboost as xgb
model_path = "./xgboost.json"
model = xgb.Booster()
model.load_model(model_path)

In [None]:
# %%time
# model.set_param({"predictor": "cpu_predictor"})
# cpu_preds = model.predict(dtest)

In [54]:
%%timeit
model.set_param({"predictor": "gpu_predictor"})
gpu_preds = model.predict(dtrain)

31.2 ms ± 448 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


###  Prediction using Forest Inference Library (FIL)

In [58]:
X_train.head()

Unnamed: 0,User,Card,Year,Month,Day,Amount,Use Chip,Merchant Name,Merchant City,Merchant State,Zip,MCC,Errors?,Hour,Minute
2501506,417.0,4.0,2017.0,5.0,31.0,47.189999,2.0,29480.0,9355.0,58.0,7013.0,85.0,0.0,14.0,35.0
2857440,1795.0,3.0,2020.0,2.0,14.0,240.820007,0.0,62385.0,8593.0,132.0,19868.0,99.0,0.0,13.0,31.0
4557656,1020.0,1.0,2014.0,3.0,9.0,6.2,2.0,1252.0,5626.0,109.0,16654.0,57.0,0.0,14.0,14.0
3537044,101.0,2.0,2006.0,4.0,26.0,0.88,2.0,24809.0,11929.0,105.0,3163.0,57.0,0.0,8.0,27.0
7369248,484.0,0.0,2017.0,12.0,10.0,80.0,0.0,40426.0,12095.0,132.0,19158.0,57.0,0.0,13.0,7.0


In [55]:
# Call FIL inference from cuml and load model and params
fil = cuml.ForestInference.load(
      filename=model_path,                                    
      output_class=True,    
      model_type='xgboost_json')

In [56]:
%%timeit
fil_preds = fil.predict_proba(X_train)

4.34 s ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [40]:
# Call FIL inference from cuml and load model and params
fil_naive = cuml.ForestInference.load(
      filename=model_path, 
      algo = 'NAIVE',
      output_class=True,    
      model_type='xgboost_json')

In [41]:
%%timeit
fil_naive_preds = fil_naive.predict_proba(X_test)

1.87 s ± 29.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [42]:
# Call FIL inference from cuml and load model and params
fil_treereorg = cuml.ForestInference.load(
      filename=model_path, 
      algo = 'TREE_REORG',
      output_class=True,    
      model_type='xgboost_json')

In [43]:
%%timeit
fil_treereorg_preds = fil_treereorg.predict_proba(X_test)

1.87 s ± 24.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [48]:
# Call FIL inference from cuml and load model and params
fil_batchtreereorg = cuml.ForestInference.load(
      filename=model_path, 
      algo = 'BATCH_TREE_REORG',
      output_class=True,    
      model_type='xgboost_json')

In [49]:
%%timeit
fil_batchtreereorg_preds = fil_batchtreereorg.predict_proba(X_test)

1.98 s ± 4.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [28]:
import IPython

IPython.Application.instance().kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}