# Credit Card Fraud Detection Model

## Model training

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier as RFC
from xgboost import XGBClassifier as XGBC
import multiprocessing as mp
from sklearn.ensemble import VotingClassifier
from utils import clean_data, KFold, predict, feature_importance

Load and process train data

Data preprocessing (feature engineering, selection) and data cleaning from `utils.py`

In [2]:
# Load train data from Kaggle dataset [Credit Card Transactions Fraud Detection Dataset]
# Dataset created by KARTIK SHENOY and available under CC0
df_train = pd.read_csv('data/fraudTrain.csv')

# Clean train data
X, y = clean_data(df_train)

Train Random Forest model with different class weights with K-Fold cross-validation

In [3]:
## Random Forest
# RFC list for different class weights
rf_classifiers = []
rfc_class_weights = [None, {0: 1, 1: 50}, {0: 1, 1: 75}, {0: 1, 1: 100}]

# Define model with parameters
for cw in rfc_class_weights:
    rf_params = {
        'n_estimators': 50,
        'max_depth': 20,
        'class_weight': cw,
        'random_state': 42,
        'n_jobs': mp.cpu_count()
        }
    
    rf_classifier = RFC(**rf_params)
    rf_classifiers.append(rf_classifier)
    KFold(rf_classifier, cw, X, y)

Class weights: None
              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1289169
           1       0.90      0.68      0.77      7506

    accuracy                           1.00   1296675
   macro avg       0.95      0.84      0.89   1296675
weighted avg       1.00      1.00      1.00   1296675

Class weights: {0: 1, 1: 50}
              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1289169
           1       0.78      0.75      0.77      7506

    accuracy                           1.00   1296675
   macro avg       0.89      0.88      0.88   1296675
weighted avg       1.00      1.00      1.00   1296675

Class weights: {0: 1, 1: 75}
              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1289169
           1       0.75      0.77      0.76      7506

    accuracy                           1.00   1296675
   macro avg       0.87      0.88      0.88   1296

The same for XGBoost

In [4]:
## XGBoost
# XGBC list for different class weights
xgb_classifiers = []
xgbc_class_weights = [None, 10, 20, 30]

# Define model with parameters
for cw in xgbc_class_weights:
    xgb_params = {
        'max_depth': 20,
        'n_estimators': 50,
        'learning_rate': 0.1,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
        'scale_pos_weight': cw,
        'random_state': 42,
        'n_jobs': -1
        }
    
    xgb_classifier = XGBC(**xgb_params)
    xgb_classifiers.append(xgb_classifier)
    KFold(xgb_classifier, cw, X, y)

Class weights: None
              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1289169
           1       0.89      0.70      0.78      7506

    accuracy                           1.00   1296675
   macro avg       0.94      0.85      0.89   1296675
weighted avg       1.00      1.00      1.00   1296675

Class weights: 10
              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1289169
           1       0.82      0.76      0.79      7506

    accuracy                           1.00   1296675
   macro avg       0.91      0.88      0.89   1296675
weighted avg       1.00      1.00      1.00   1296675

Class weights: 20
              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1289169
           1       0.79      0.78      0.79      7506

    accuracy                           1.00   1296675
   macro avg       0.90      0.89      0.89   1296675
weighted avg      

All models look good and will be used for deployment. (Stored in lists)

Savings and number of false detections will be evaluated in the deployment stage for further insights on the different class weights.

## Deployment on test data

Load and process test data

In [5]:
# Load test data from Kaggle dataset [Credit Card Transactions Fraud Detection Dataset]
# Dataset created by KARTIK SHENOY and available under CC0
df_test = pd.read_csv('data/fraudTest.csv')

# Clean test data
X_test, y_test = clean_data(df_test)

Make predictions with Random Forest models and XGBoost models respectively.

Savings are calculated through *total detected fraud amount* - *total cost of false detections*.

Random Forest models:

In [6]:
# Predict with Random Forest
rfc_best = None
for idx, rfc in enumerate(rf_classifiers):
    
    # Fit the model
    rfc.fit(X, y)
    
    # Predict
    y_pred = rfc.predict(X_test)
    
    # Predict and best RFC
    if not rfc_best:
        rfc_best = rfc
        savings_best = predict(y_pred, rfc_class_weights, rfc, X_test, y_test, idx)
    else:
        total_savings = predict(y_pred, rfc_class_weights, rfc, X_test, y_test, idx)
        if total_savings > savings_best:
            rfc_best = rfc
            savings_best = total_savings

Class weight: None
Recall: 0.6615384615384615
Precision: 0.8636640292148509
F1 score: 0.749208025343189
AUPRC: 0.7783081993842402
Total Fraud Amount: 1133324.6800000002
Total Detected Fraud Amount: 880840.3699999999
Total Cost of Detection: 63253.6500875
Total Savings: 817586.7199124999
Total False Detections: 224

Class weight: {0: 1, 1: 50}
Recall: 0.7417249417249417
Precision: 0.7002640845070423
F1 score: 0.7203984604935475
AUPRC: 0.7775362428406877
Total Fraud Amount: 1133324.6800000002
Total Detected Fraud Amount: 998703.8999999999
Total Cost of Detection: 89207.0352875
Total Savings: 909496.8647124999
Total False Detections: 681

Class weight: {0: 1, 1: 75}
Recall: 0.7659673659673659
Precision: 0.6319230769230769
F1 score: 0.6925184404636459
AUPRC: 0.7670501989479446
Total Fraud Amount: 1133324.6800000002
Total Detected Fraud Amount: 1026367.81
Total Cost of Detection: 102464.52495
Total Savings: 923903.2850500001
Total False Detections: 957

Class weight: {0: 1, 1: 100}
Recall: 

Feature importance of best Random Forest model:

In [7]:
feature_importance(rfc_best, X) # Feature importance

                        Feature  Importance
0                           amt    0.629406
4   time_since_prev_transaction    0.101995
3              is_working_hours    0.057340
1                      city_pop    0.024766
10       category_gas_transport    0.022753
7                      distance    0.021952
19        category_shopping_net    0.019917
12         category_grocery_pos    0.017760
2                         is_AM    0.016639
14                category_home    0.009529
16            category_misc_net    0.009429
20        category_shopping_pos    0.009119
9          category_food_dining    0.009084
21              category_travel    0.007200
17            category_misc_pos    0.006561
5               day_of_week_sin    0.006415
8        category_entertainment    0.005760
6               day_of_week_cos    0.005639
15           category_kids_pets    0.005473
11         category_grocery_net    0.005131
13      category_health_fitness    0.004458
18       category_personal_care 

XGBoost models:

In [8]:
# Predict with XGBoost
xgbc_best = None
for idx, xgbc in enumerate(xgb_classifiers):
    
    # Fit the model
    xgbc.fit(X, y)
    
    # Predict
    y_pred = xgbc.predict(X_test)
    
    # Predict and best RFC
    if not xgbc_best:
        xgbc_best = xgbc
        savings_best = predict(y_pred, xgbc_class_weights, xgbc, X_test, y_test, idx)
    else:
        total_savings = predict(y_pred, xgbc_class_weights, xgbc, X_test, y_test, idx)
        if total_savings > savings_best:
            xgbc_best = xgbc
            savings_best = total_savings

Class weight: None
Recall: 0.654079254079254
Precision: 0.8735990037359901
F1 score: 0.7480671820847774
AUPRC: 0.8076745423514735
Total Fraud Amount: 1133324.6800000002
Total Detected Fraud Amount: 857378.67
Total Cost of Detection: 61674.666925
Total Savings: 795704.003075
Total False Detections: 203

Class weight: 10
Recall: 0.7314685314685314
Precision: 0.7642474427666829
F1 score: 0.7474988089566459
AUPRC: 0.8035576984161339
Total Fraud Amount: 1133324.6800000002
Total Detected Fraud Amount: 978850.18
Total Cost of Detection: 79980.0518875
Total Savings: 898870.1281125001
Total False Detections: 484

Class weight: 20
Recall: 0.7482517482517482
Precision: 0.7322080291970803
F1 score: 0.7401429559603412
AUPRC: 0.7998477858040268
Total Fraud Amount: 1133324.6800000002
Total Detected Fraud Amount: 995242.23
Total Cost of Detection: 85500.8609375
Total Savings: 909741.3690625
Total False Detections: 587

Class weight: 30
Recall: 0.7659673659673659
Precision: 0.6961864406779661
F1 score:

Feature importance of best XGBoost model:

In [9]:
feature_importance(xgbc_best, X) # Feature importance

                        Feature  Importance
10       category_gas_transport    0.356184
12         category_grocery_pos    0.089642
11         category_grocery_net    0.080958
21              category_travel    0.072154
3              is_working_hours    0.064494
0                           amt    0.056835
14                category_home    0.054262
8        category_entertainment    0.039145
9          category_food_dining    0.025886
17            category_misc_pos    0.025070
19        category_shopping_net    0.022345
13      category_health_fitness    0.021984
2                         is_AM    0.018143
15           category_kids_pets    0.016955
18       category_personal_care    0.016239
20        category_shopping_pos    0.011562
16            category_misc_net    0.010004
4   time_since_prev_transaction    0.005685
1                      city_pop    0.003611
5               day_of_week_sin    0.003191
6               day_of_week_cos    0.003005
7                      distance 

Feature importance from both models show two different training patterns.

Therefore, both models are combined into an ensemble learning model to make predictions.

In [10]:
# Create a VotingClassifier with the cross-validated predictions as inputs
voting_classifier = VotingClassifier(
    estimators=[('random_forest', rfc_best), ('xgboost', xgbc_best)],
    voting='soft',
    weights=[1, 1]
)

# Train the VotingClassifier on the entire training dataset
voting_classifier.fit(X, y)

# Ensemble model
ensemble_preds = voting_classifier.predict(X_test)
predict(ensemble_preds, None, voting_classifier, X_test, y_test, None)

Recall: 0.7701631701631702
Precision: 0.6789971228935471
F1 score: 0.7217125382262998
AUPRC: 0.7935791579496215
Total Fraud Amount: 1133324.6800000002
Total Detected Fraud Amount: 1031386.13
Total Cost of Detection: 95310.8618625
Total Savings: 936075.2681375
Total False Detections: 781



936075.2681375

The total savings achieved is higher than the best models from Random Forest and XGBoost.

Although the number of false detections is slightly higher than the best model from XGBoost, it is significantly lower than the best model from Random Forest.

Hence, this will be the final best model.