# Credit Card Fraud Detection - Training/Testing/Evaluating Models 

## 0) Run the code cell below to start

We begin with the many imports required for modelling (train/test split, performance metrics, and the models) 

In [1]:
import numpy as np 
import pandas as pd 

from sklearn.model_selection import train_test_split 
from sklearn.metrics import (
    classification_report, 
    confusion_matrix, 
    roc_auc_score, 
    average_precision_score,
)
from sklearn.dummy import DummyClassifier 
from sklearn.linear_model import LogisticRegression 
from sklearn.tree import DecisionTreeClassifier

import matplotlib.pyplot as plt

## 1) Load Data 

We'll begin with splitting the data into features and the target

The target is `Class` (0 = Real, 1 = Fraud). We'll split the data into `X` (features) and `y` (labels) 

In [2]:
DATA_PATH = "../data/raw/creditcard.csv"
df = pd.read_csv(DATA_PATH) 

X = df.drop(columns=["Class"]) 
y = df["Class"] 

print("X shape:", X.shape) 
print("y mean (fraud rate):", y.mean()) 

X shape: (284807, 30)
y mean (fraud rate): 0.001727485630620034


## 2) Stratified Train/Test Split 

The `stratify` parameter of sklearn's train_test_split ensures that the training and testing sets both have the same proportion of labels (i.e fraud). 

We stratify by `y` to preserve the fraud rate in both the train and test sets. 

This is very important for imbalanced data; small differences in the quantity of fraud in either set can lead to detrimental differences in the model's predictions between the training and testing sets. 

Stratifying ensures we properly test our model while still preserving any patterns for our model to train on

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y
)

print("Train fraud rate:", y_train.mean()) 
print("Test fraud rate:", y_test.mean()) 

Train fraud rate: 0.001729245759178389
Test fraud rate: 0.0017204452090867595


## 3) Evaluation Helper Function 

We'll evaluate our models using: 
- Confusion Matrix 
- Classification Report (precision/recall/F1) 
- ROC-AUC 
- PR-AUC (often more informative for imbalanced sets)

In [8]:
def evaluate_model(name, model, X_tr, y_tr, X_te, y_te): 
    model.fit(X_tr, y_tr) 

    y_pred = model.predict(X_te) 

    # Since only some models support predict_proba, we have a handler 
    if hasattr(model, "predict_proba"):
        y_score = model.predict_proba(X_te)[:, 1]
    else: 
        y_score = None 
    
    print(f"\n=== {name} ===") 
    print("Confusion Matrix:\n", confusion_matrix(y_te, y_pred)) 
    print("\nClassification Report:\n", classification_report(y_te, y_pred, digits=4))

    if y_score is not None: 
        roc = roc_auc_score(y_te, y_score) 
        pr = average_precision_score(y_te, y_score) 
        print(f"ROC-AUC: {roc:.4f}")
        print(f"PR-AUC (Average Precision): {pr:.4f}") 
    else: 
        print("No probability scores available for ROC/PR-AUC")


## 4) Baseline Model (DummyClassifier) 

This model will show why accuracy is very misleading. 

Our DummyClassifier will predict "Real" everytime, giving us a 99.83% accuracy (since 0.17% are fraud from our previous analysis), however, it will have a recall of 0 since it never flags any fraud. 

In [9]:
dummy = DummyClassifier(strategy="most_frequent", random_state=42) 
evaluate_model("Dummy Classifier", dummy, X_train, y_train, X_test, y_test) 


=== Dummy Classifier ===
Confusion Matrix:
 [[56864     0]
 [   98     0]]

Classification Report:
               precision    recall  f1-score   support

           0     0.9983    1.0000    0.9991     56864
           1     0.0000    0.0000    0.0000        98

    accuracy                         0.9983     56962
   macro avg     0.4991    0.5000    0.4996     56962
weighted avg     0.9966    0.9983    0.9974     56962

ROC-AUC: 0.5000
PR-AUC (Average Precision): 0.0017


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## 5) Logistic Regression 

A simple and strong baseline for binary classification. 

We will use `class_weight="balanced"` to penalize missing fraud cases more heavily 

In [10]:
log_reg = LogisticRegression(
    max_iter=1000, 
    class_weight="balanced", 
    n_jobs=-1
)
evaluate_model("Logistic Regression", log_reg, X_train, y_train, X_test, y_test)




=== Logistic Regression ===
Confusion Matrix:
 [[55091  1773]
 [    8    90]]

Classification Report:
               precision    recall  f1-score   support

           0     0.9999    0.9688    0.9841     56864
           1     0.0483    0.9184    0.0918        98

    accuracy                         0.9687     56962
   macro avg     0.5241    0.9436    0.5379     56962
weighted avg     0.9982    0.9687    0.9826     56962

ROC-AUC: 0.9719
PR-AUC (Average Precision): 0.7309


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## 6) Decision Tree

Trees are a powerful and flexible tool as well, but they can be overfit. We'll start with a simple decision tree that we will later compare to XGBoosted trees in a future notebook. 

In [11]:
tree = DecisionTreeClassifier(
    random_state=42, 
    class_weight="balanced", 
    max_depth=None
) 
evaluate_model("Decision Tree", tree, X_train, y_train, X_test, y_test) 


=== Decision Tree ===
Confusion Matrix:
 [[56830    34]
 [   27    71]]

Classification Report:
               precision    recall  f1-score   support

           0     0.9995    0.9994    0.9995     56864
           1     0.6762    0.7245    0.6995        98

    accuracy                         0.9989     56962
   macro avg     0.8379    0.8619    0.8495     56962
weighted avg     0.9990    0.9989    0.9989     56962

ROC-AUC: 0.8619
PR-AUC (Average Precision): 0.4904


## 7) Conclusion & Next Notebook 

So far we have: 
- Built some baseline models that perform decently
- used balanced class weightings to address the imabalanced data 

Next Notebook: 
- We will tune the decision threshold, optimizing for recall/precision 
- Look into SMOTE for imbalanced classification 
- Train and compare XGBoost using imbalanced tools