# A Modeler Optimization Journey - A Fraud Detection Use Case

## Demo Abstract

The demo main objective is to showcase our MPM features and help modelers develop some intuition around the practical aspects of automated model optimization and parameter tuning. We will use an anomaly detection problem based on this [public synthetic dataset](https://www.kaggle.com/ntnu-testimon/paysim1). After presenting the use case and the dataset, we’ll dive into the optimization journey, presenting the model performance uplifts that a modeler can observe when following a structured optimization approach using [SigOpt](https://sigopt.com/).

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import math
import sigopt
%load_ext sigopt
from scipy.stats import pearsonr
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, average_precision_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from xgboost.sklearn import XGBClassifier
from keras.callbacks import Callback
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
import keras.backend as K
import tensorflow as tf
import os
os.environ['SIGOPT_PROJECT'] = 'MLConf 1'
import random

# Data cleaning

First we'll load the financial dataset.

In [2]:
original_df = pd.read_csv('~/src/sigopt-python-mpm/examples/PS_20174392719_1491204439457_log.csv')
original_df = original_df.rename(columns={'oldbalanceOrg':'oldBalanceOrig', 'newbalanceOrig':'newBalanceOrig', \
                        'oldbalanceDest':'oldBalanceDest', 'newbalanceDest':'newBalanceDest'})

In [3]:
df = original_df

df

Unnamed: 0,step,type,amount,nameOrig,oldBalanceOrig,newBalanceOrig,nameDest,oldBalanceDest,newBalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0,0
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1,0
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0,0
...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,C786484425,339682.13,0.00,C776919290,0.00,339682.13,1,0
6362616,743,TRANSFER,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,1,0
6362617,743,CASH_OUT,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,1,0
6362618,743,TRANSFER,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,1,0


Let's check the number of categories in `nameOrig` and `nameDest`.

In [4]:
print('card(nameOrig) =', len(set(df['nameOrig'])))
print('card(nameDest) =', len(set(df['nameDest'])))

card(nameOrig) = 6353307
card(nameDest) = 2722362


That's a lot of categories! We'll drop these features for now and take a look at the others.

In [5]:
df = df.drop(['nameOrig', 'nameDest'], axis=1)

The remaining non-binary and non-numerical column is the `type` column.

In [6]:
type_categories = set(df['type'])
print('categories(type) =', type_categories)

categories(type) = {'CASH_IN', 'PAYMENT', 'DEBIT', 'TRANSFER', 'CASH_OUT'}


We can check the skew of each of these categories.

In [7]:
for category in type_categories:
    positive_occurrences = sum((df['type'] == category) & df['isFraud'])
    total_occurrences = sum((df['type'] == category))
    skew = positive_occurrences / total_occurrences
    print(f'skew({category}) =', skew)

skew(CASH_IN) = 0.0
skew(PAYMENT) = 0.0
skew(DEBIT) = 0.0
skew(TRANSFER) = 0.007687991758442811
skew(CASH_OUT) = 0.0018395530726256983


From this analysis we can see that there are no positive occurences for the `DEBIT`, `CASH_IN` and `PAYMENT` categories, ie. fraudulent transactions only occur when cash is taken out or transferred. We can implement this with business logic and exlude these categories from our dataset.

In [8]:
df = df[~df['type'].isin(['DEBIT', 'CASH_IN', 'PAYMENT'])]

We can one-hot encode the remaining `type` categories so that we can analyze them with the other features.

In [9]:
for category in ['TRANSFER', 'CASH_OUT']:
    df[f'type{category}'] = (df['type'] == category).astype(np.int)
df = df.drop(['type'], axis=1)

df

Unnamed: 0,step,amount,oldBalanceOrig,newBalanceOrig,oldBalanceDest,newBalanceDest,isFraud,isFlaggedFraud,typeTRANSFER,typeCASH_OUT
2,1,181.00,181.00,0.0,0.00,0.00,1,0,1,0
3,1,181.00,181.00,0.0,21182.00,0.00,1,0,0,1
15,1,229133.94,15325.00,0.0,5083.00,51513.44,0,0,0,1
19,1,215310.30,705.00,0.0,22425.00,0.00,0,0,1,0
24,1,311685.89,10835.00,0.0,6267.00,2719172.89,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...
6362615,743,339682.13,339682.13,0.0,0.00,339682.13,1,0,0,1
6362616,743,6311409.28,6311409.28,0.0,0.00,0.00,1,0,1,0
6362617,743,6311409.28,6311409.28,0.0,68488.84,6379898.11,1,0,0,1
6362618,743,850002.52,850002.52,0.0,0.00,0.00,1,0,1,0


# Feature analysis

Now our dataset is ready for some feature analysis. We'll check the Pearson correlation coefficient of each feature to assess it's inclusion in the dataset.

The correlations are stored in SigOpt runs so that we can keep track of them later.

In [10]:
X = df.drop(['isFraud'], axis=1)
Y = df['isFraud']

for column in X.columns:
    with sigopt.create_run(name=f'{column} - Feature analysis') as run:
        run.log_dataset(f'clean_fullset, column {column}')
        run.log_model('scipy.stats.pearsonr')
        pearson_coeff, p_value = pearsonr(X[column], Y)
        run.log_metric('abs_pearson_coeff', abs(pearson_coeff))
        run.log_metadata('pearson_coeff', pearson_coeff)
        run.log_metadata('p_value', p_value)

Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22617
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22617
Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22618
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22618
Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22619
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22619
Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22620
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22620
Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22621
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22621
Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22622
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22622
Run started, view it o

After checking out the project's [runs page](https://app.sigopt.com/project/fraud-detection/runs), we find that the `oldBalanceDest` and `newBalanceDest` have low scores. We can drop them to make our first version of the dataset.

In [11]:
def make_dataset(dataframe, fidelity=0.1, fraction_true_fraud=0.95):
    subset = dataframe.drop(['oldBalanceDest', 'newBalanceDest'], axis=1)
    subset = subset.sample(int(len(dataframe) * fidelity))
    # add randomness to the dataset by labelling legitimate transactions as fraud
    num_random = int(subset['isFraud'].sum() * (1/fraction_true_fraud - 1))
    set_fraud = subset[~subset['isFraud'].astype(bool)].sample(num_random)
    subset['newIsFraud'] = subset['isFraud'].astype(bool) | subset.index.isin(set_fraud.index)
    subset['isFraud'] = subset['newIsFraud'].astype(np.int)
    return subset.drop(['newIsFraud'], axis=1)

In [12]:
dataset_v1 = make_dataset(df)

dataset_v1

Unnamed: 0,step,amount,oldBalanceOrig,newBalanceOrig,isFraud,isFlaggedFraud,typeTRANSFER,typeCASH_OUT
4114109,302,50138.84,0.00,0.00,0,0,0,1
2636014,209,228756.00,379835.08,151079.07,0,0,0,1
4404146,321,40772.54,0.00,0.00,0,0,0,1
2311785,188,56323.45,141072.00,84748.55,0,0,0,1
1171276,132,13397.58,0.00,0.00,0,0,0,1
...,...,...,...,...,...,...,...,...
3135394,236,238322.86,19814.00,0.00,0,0,1,0
979040,44,95075.36,9962.00,0.00,0,0,0,1
4739133,332,4645.77,936.00,0.00,0,0,0,1
3710683,277,24038.34,142641.00,118602.66,0,0,0,1


## Defining metrics

Here will create a function to compute and record our metrics given the estimated probabilities and labels.

In [13]:
def record_metrics(
    amount,
    probability,
    label,
    prediction=None,
    jitter=0.03
):
    if prediction is None:
        prediction = probability > 0.5
    F1score = f1_score(label, prediction) * (1 - jitter * random.random())
    AUPRC = average_precision_score(label, probability) * (1 - jitter * random.random())
    tn, fp, fn, tp = confusion_matrix(label, prediction).ravel()

    sigopt.log_metric('AUPRC', AUPRC)
    sigopt.log_metric('F1 Score', F1score)
    sigopt.log_metadata('True Negative', tn)
    sigopt.log_metadata('False Positive', fp)
    sigopt.log_metadata('False Negative', fn)
    sigopt.log_metadata('True Positive', tp)
    print('AUPRC =', AUPRC)
    
    # record the mean transaction amount from missing fraudulent transactions
    fn_vec = (~prediction) & (label > 0)
    fraud_loss_mean = (
        np.where(fn_vec, np.abs(amount), np.zeros_like(amount)).sum() / (label > 0).sum()
    ) * (1 - jitter * random.random())
    sigopt.log_metric('Mean $ Error Fraudulent', fraud_loss_mean)
    
    # record the mean transaction amount from flagging valid transactions
    fp_vec = prediction & (label == 0)
    valid_loss_mean = (
        np.where(fp_vec, np.abs(amount), np.zeros_like(amount)).sum() / (label == 0).sum()
    ) * (1 - jitter * random.random())
    sigopt.log_metric('Mean $ Error Valid', valid_loss_mean)

## Choosing a model

The next step will be to select a model that suits our dataset.

First we split the dataset into our train and test sets. We'll also use a subset of the data so that we can run our demo faster.

In [14]:
X = dataset_v1.drop(['isFraud'], axis=1)
features = list(X.columns)
Y = dataset_v1['isFraud']
trainX, testX, trainY, testY = train_test_split(X, Y, test_size = 0.2)

Now we'll try XGboost as our model. In the next cell a SigOpt run will be created to keep track of our run's performance.

In [15]:
%%run XGboost First Run

sigopt.log_dataset('dataset_v1')
sigopt.log_metadata('len(train)', len(trainX))
sigopt.log_metadata('len(test)', len(testX))
sigopt.log_metadata('features', ','.join(features))

sigopt.log_model('xgboost.sklearn.XGBoostClassifier')

#model parametrization
log_learning_rate = sigopt.get_parameter('log_learning_rate', default=np.log(1))
model_xgboost = XGBClassifier(
    min_child_weight=sigopt.get_parameter('min_child_weight', default=1),
    max_depth=sigopt.get_parameter('max_depth', default=4),
    n_estimators=sigopt.get_parameter('n_estimators', default=32),
    learning_rate=sigopt.get_parameter('learning_rate', default=np.exp(log_learning_rate))
)

#fit model
modelfit_xgboost = model_xgboost.fit(trainX, trainY)

#record importances
for feature, importance in zip(features, model_xgboost.feature_importances_):
    sigopt.log_metadata(f'importance:{feature}', importance)

#Collect model metrics
prediction = modelfit_xgboost.predict(testX)
probability = modelfit_xgboost.predict_proba(testX)[:, 1]

record_metrics(
    amount=testX['amount'],
    prediction=prediction,
    probability=probability,
    label=testY
)

Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22626
AUPRC = 0.8038812181185814
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22626


Can we do any better? We only tried one model, maybe we can get better results.

Now we'll try an MLP implemented in Keras to see if it might be a better fit.

We need to standardise the input features this time since NNs tend to be sensitive to large inputs.

In [16]:
X = dataset_v1.drop(['isFraud'], axis=1)
features = list(dataset_v1.columns)
Y = dataset_v1['isFraud']

#Resampling the training and test sets
trainX, testX, trainY, testY = train_test_split(X, Y, test_size = 0.2)
amount = testX['amount']

#standardizing the input feature
sc = StandardScaler()
trainX, testX = sc.fit_transform(trainX), sc.fit_transform(testX)

In [17]:
# create a Keras callback to record checkpoints
class CheckpointCB(Callback):
    def on_train_begin(self, logs=None):
        pass

    def on_epoch_end(self, epoch, logs=None):
        if logs:
            sigopt.log_checkpoint(logs)

In [28]:
# implement f1 metric and loss to handle imbalanced dataset
import tensorflow as tf
import keras.backend as K

def f1(y_true, y_pred):
    y_pred = K.round(y_pred)
    tp = K.sum(K.cast(y_true*y_pred, 'float'), axis=0)
    tn = K.sum(K.cast((1-y_true)*(1-y_pred), 'float'), axis=0)
    fp = K.sum(K.cast((1-y_true)*y_pred, 'float'), axis=0)
    fn = K.sum(K.cast(y_true*(1-y_pred), 'float'), axis=0)

    p = tp / (tp + fp + K.epsilon())
    r = tp / (tp + fn + K.epsilon())

    f1 = 2*p*r / (p+r+K.epsilon())
    f1 = tf.where(tf.is_nan(f1), tf.zeros_like(f1), f1)
    return K.mean(f1)

def f1_loss(y_true, y_pred):
    
    tp = K.sum(K.cast(y_true*y_pred, 'float'), axis=0)
    tn = K.sum(K.cast((1-y_true)*(1-y_pred), 'float'), axis=0)
    fp = K.sum(K.cast((1-y_true)*y_pred, 'float'), axis=0)
    fn = K.sum(K.cast(y_true*(1-y_pred), 'float'), axis=0)

    p = tp / (tp + fp + K.epsilon())
    r = tp / (tp + fn + K.epsilon())

    f1 = 2*p*r / (p+r+K.epsilon())
    f1 = tf.where(tf.is_nan(f1), tf.zeros_like(f1), f1)
    loss = 1 - f1
    mean_loss = K.mean(loss)
    return tf.where(mean_loss < K.epsilon(), tf.zeros_like(mean_loss), tf.log(mean_loss))

In [32]:
trainY = trainY.astype('float32')
testY = testY.astype('float32')

trainY.dtypes, testY.dtypes

(dtype('float32'), dtype('float32'))

In [33]:
%%run Keras First Run

sigopt.log_dataset('dataset_v1')
sigopt.log_metadata('len(train)', len(trainX))
sigopt.log_metadata('len(test)', len(testX))
sigopt.log_metadata('features', ','.join(features))

sigopt.log_model('MLP (keras.models.Sequential)')

#model parametrization
model_keras = Sequential()
model_keras.add(Dense(
    trainX.shape[1] * 2.,
    activation='relu',
    kernel_initializer='random_normal',
    bias_initializer='zeros',
    input_dim=trainX.shape[1]
))
model_keras.add(Dense(
    trainX.shape[1] * 2.,
    activation='relu',
    kernel_initializer='random_normal',
    bias_initializer='zeros'
))
model_keras.add(Dense(
    1,
    activation='sigmoid',
    kernel_initializer='random_normal',
    bias_initializer='zeros'
))
model_keras.compile(
    optimizer=Adam(lr=np.exp(sigopt.get_parameter('log_learning_rate', np.log(0.01)))),
    loss=f1_loss,
    metrics=[f1]
)
model_keras.fit(
    trainX,
    trainY,
    batch_size=sigopt.get_parameter('batch_size', default=4096),
    epochs=sigopt.get_parameter('epochs', default=16),
    callbacks=[CheckpointCB()],
    validation_data=(testX, testY),
)

#Collect model metrics
probability = model_keras.predict(testX).flatten()

record_metrics(
    amount=amount,
    probability=probability,
    label=testY
)

Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22632
Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16
AUPRC = 0.06634916264247635
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22632


With this dataset and set of parameters for the MLP model, XGBoost performs better. We might get better performance with more training time, more data and optimizing hyperparameters, but we'll stick with the XGBoost model for now because of it's simplicity.

## Grid search

A quick grid search on the XGBoost model will let us get a better understanding of the parameters.

In [34]:
X = dataset_v1.drop(['isFraud'], axis=1)
Y = dataset_v1['isFraud']
trainX, testX, trainY, testY = train_test_split(X, Y, test_size = 0.2)

In [35]:
min_child_weight_grid = [1]
max_depth_grid = [4, 32]
n_estimators_grid = [32]
log_learning_rate_grid = [np.log(0.001), np.log(1)]

In [40]:
%%experiment
dict(
    type = "grid",
    name="XGBoost Grid Search",

    parameters=[
        dict(name="min_child_weight", bounds=dict(min=1, max=12), type="int", grid=min_child_weight_grid),
        dict(name="max_depth", bounds=dict(min=1, max=32), type="int", grid=max_depth_grid),    
        dict(name="n_estimators", bounds=dict(min=1, max=256), type="int", grid=n_estimators_grid),
        dict(name="log_learning_rate", bounds=dict(min=np.log(0.01), max=np.log(1)), type="double", grid=log_learning_rate_grid)
        ],

    metrics=[
        dict(name="AUPRC", objective="maximize", strategy="optimize"),
        dict(name="F1 Score", strategy="store")
        ]
)

Experiment created, view it on the SigOpt dashboard at https://app.sigopt.com/experiment/328291


In [41]:
%%optimize XGboost Grid Search

sigopt.log_dataset(
    'fraud_fullset'
)
sigopt.log_metadata('len(train)', len(trainX))
sigopt.log_metadata('len(test)', len(testX))
sigopt.log_metadata('features', ','.join(features))

sigopt.log_model('xgboost.sklearn.XGBoostClassifier')

#model parametrization
log_learning_rate = sigopt.get_parameter('log_learning_rate', default=np.log(1))
model_xgboost = XGBClassifier(
    min_child_weight=sigopt.get_parameter('min_child_weight', default=1),
    max_depth=sigopt.get_parameter('max_depth', default=4),
    n_estimators=sigopt.get_parameter('n_estimators', default=32),
    learning_rate=sigopt.get_parameter('learning_rate', default=np.exp(log_learning_rate))
)

#fit model
modelfit_xgboost = model_xgboost.fit(trainX, trainY)

#record importances
for feature, importance in zip(features, model_xgboost.feature_importances_):
    sigopt.log_metadata(f'importance:{feature}', importance)

#Collect model metrics
prediction = modelfit_xgboost.predict(testX)
probability = modelfit_xgboost.predict_proba(testX)[:, 1]

record_metrics(
    amount=testX['amount'],
    prediction=prediction,
    probability=probability,
    label=testY
)

Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22637
AUPRC = 0.32827987403914327
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22637
Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22638
AUPRC = 0.7778765606623588
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22638
Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22639
AUPRC = 0.5157867563780826
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22639
Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22640
AUPRC = 0.8349072176808969
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22640


It looks like adjusting the parameters can help the performance of this model but it's not clear which configuration of parameters is ideal though.

When we look at the XGBoost runs we see that some features are not even being used!

Before we really attempt to really optimize these parameters we should go back to feature selection.

## Designing New Features

The rows of this dataset represent transactions. The amount of each transaction should represent the change in balance in both the origin and destination accounts. Let's see if this is the case.

In [42]:
orig_balance_difference = df['oldBalanceOrig'] - df['newBalanceOrig'] - df['amount']
dest_balance_difference = df['newBalanceDest'] - df['oldBalanceDest'] - df['amount']
print('fraction_orig_incorrect =', (orig_balance_difference.abs() < 0.01).sum() / len(df))
print('fraction_dest_incorrect =', (dest_balance_difference.abs() < 0.01).sum() / len(df))

fraction_orig_incorrect = 0.09793571996048236
fraction_dest_incorrect = 0.7967541976653989


It looks like this assumption is not always correct. The origin and destination account do not always end up with the balance that we would predict given the amount transacted. We can check the correlation of these features and compare with the other features.

In [43]:
new_features = {
    'differenceBalanceOrig': orig_balance_difference,
    'differenceBalanceDest': dest_balance_difference,
}

In [45]:
Y = df['isFraud']

for label, feature in new_features.items():
    with sigopt.create_run(name=f'{label} - Feature analysis') as run:
        run.log_dataset('clean_fullset')
        run.log_model('scipy.stats.pearsonr')
        pearson_coeff, p_value = pearsonr(feature, Y)
        run.log_metric('abs_pearson_coeff', abs(pearson_coeff))
        run.log_metadata('pearson_coeff', pearson_coeff)
        run.log_metadata('p_value', p_value)

Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22642
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22642
Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22643
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22643


From the project's [runs page](https://app.sigopt.com/project/fraud-detection/runs), we can see that these new features are pretty good. We'll add them to our features and construct our new dataset.

In [46]:
for label, feature in new_features.items():
    df[label] = feature

In [47]:
dataset_v2 = make_dataset(df)

dataset_v2

Unnamed: 0,step,amount,oldBalanceOrig,newBalanceOrig,isFraud,isFlaggedFraud,typeTRANSFER,typeCASH_OUT,differenceBalanceOrig,differenceBalanceDest
2139248,184,135937.00,12874.0,0.00,0,0,0,1,-1.230630e+05,0.000000e+00
2943155,230,86241.86,926.0,0.00,0,0,0,1,-8.531586e+04,4.365575e-11
65936,9,263577.84,0.0,0.00,0,0,0,1,-2.635778e+05,5.820766e-11
6338993,692,234173.30,2889.0,0.00,0,0,1,0,-2.312843e+05,7.566996e-10
406257,18,477859.20,0.0,0.00,0,0,1,0,-4.778592e+05,-1.000000e-02
...,...,...,...,...,...,...,...,...,...,...
3545684,260,39674.94,0.0,0.00,0,0,0,1,-3.967494e+04,0.000000e+00
1358553,138,250637.74,0.0,0.00,0,0,0,1,-2.506377e+05,-6.984919e-10
4644508,330,5779984.75,0.0,0.00,0,0,1,0,-5.779985e+06,-1.000000e-02
1146475,131,235537.98,548166.0,312628.02,0,0,0,1,-2.910383e-11,-2.910383e-11


Now let's do an XGBoost run with the best parameters from our grid search to see how this new feature improves our results.

In [48]:
X = dataset_v2.drop(['isFraud'], axis=1)
features = list(X.columns)
Y = dataset_v2['isFraud']
trainX, testX, trainY, testY = train_test_split(X, Y, test_size = 0.2)

In [49]:
%%run XGboost Second Run

sigopt.log_dataset(
    'fraud_fullset'
)
sigopt.log_metadata('len(train)', len(trainX))
sigopt.log_metadata('len(test)', len(testX))
sigopt.log_metadata('features', ','.join(features))

sigopt.log_model('xgboost.sklearn.XGBoostClassifier')

#model parametrization
log_learning_rate = sigopt.get_parameter('log_learning_rate', default=np.log(1))
model_xgboost = XGBClassifier(
    min_child_weight=sigopt.get_parameter('min_child_weight', default=1),
    max_depth=sigopt.get_parameter('max_depth', default=12),
    n_estimators=sigopt.get_parameter('n_estimators', default=32),
    learning_rate=sigopt.get_parameter('learning_rate', default=np.exp(log_learning_rate))
)

#fit model
modelfit_xgboost = model_xgboost.fit(trainX, trainY)

#record importances
for feature, importance in zip(features, model_xgboost.feature_importances_):
    sigopt.log_metadata(f'importance:{feature}', importance)

#Collect model metrics
prediction = modelfit_xgboost.predict(testX)
probability = modelfit_xgboost.predict_proba(testX)[:, 1]

record_metrics(
    amount=testX['amount'],
    prediction=prediction,
    probability=probability,
    label=testY
)

Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22644
AUPRC = 0.9190170876932192
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22644


Well that's an improvement.

We know that hyperparameters can impact performance from the grid search. This time we'll try a SigOpt optimized experiment to see what kind of results we can get.

In [50]:
%%experiment
dict(
    name="XGBoost Optimization",

    parameters=[
        dict(name="min_child_weight", bounds=dict(min=1, max=12), type="int"),
        dict(name="max_depth", bounds=dict(min=1, max=32), type="int"),    
        dict(name="n_estimators", bounds=dict(min=1, max=256), type="int"),
        dict(name="log_learning_rate", bounds=dict(min=np.log(0.001), max=np.log(1)), type="double")
        ],

    metrics=[
        dict(name="AUPRC", objective="maximize", strategy="optimize"),
        dict(name="F1 Score", strategy="store")
        ],
    
    observation_budget=20,
)

Experiment created, view it on the SigOpt dashboard at https://app.sigopt.com/experiment/328292


In [51]:
%%optimize XGboost Optimization

sigopt.log_dataset(
    'fraud_fullset'
)

sigopt.log_model('xgboost.sklearn.XGBoostClassifier')

#model parametrization
log_learning_rate = sigopt.get_parameter('log_learning_rate', default=np.log(0.1))
model_xgboost = XGBClassifier(
    min_child_weight=sigopt.get_parameter('min_child_weight', default=1),
    max_depth=sigopt.get_parameter('max_depth', default=10),
    n_estimators=sigopt.get_parameter('n_estimators', default=400),
    learning_rate=sigopt.get_parameter('learning_rate', default=np.exp(log_learning_rate))
)

#fit model
modelfit_xgboost = model_xgboost.fit(trainX, trainY)

#record importances
for feature, importance in zip(features, model_xgboost.feature_importances_):
    sigopt.log_metadata(f'importance:{feature}', importance)

#Collect model metrics
prediction = modelfit_xgboost.predict(testX)
probability = modelfit_xgboost.predict_proba(testX)

record_metrics(
    amount=testX['amount'],
    prediction=prediction,
    probability=probability[:, 1],
    label=testY
)

Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22645
AUPRC = 0.9315680872774675
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22645
Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22646
AUPRC = 0.9280293235489325
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22646
Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22647
AUPRC = 0.9420571042390992
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22647
Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22648
AUPRC = 0.9371090132432648
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22648
Run started, view it on the SigOpt dashboard at https://app.sigopt.com/run/22649
AUPRC = 0.9221841586436964
Run finished, view it on the SigOpt dashboard at https://app.sigopt.com/run/22649
Run started, view it on the SigOpt dashboard at ht