# 5.0 - Model Training, Evaluation, and Business Simulation

_by Michael Joshua Vargas_

This notebook implements the full machine learning workflow. It covers:
1.  **Data Preparation**: Loading the final feature set and splitting it into training, validation, and holdout sets.
2.  **Preprocessing**: Creating a robust pipeline to scale numerical features and one-hot encode categorical features.
3.  **Model Tuning**: Training and tuning two separate XGBoost models optimized for different business goals (Precision and AUC-PR).
4.  **Business Evaluation**: Using the tuned models on the holdout set to simulate a real-world, cost-sensitive fraud detection system.

## Setup and Data Preparation

In [None]:
# --- Core Libraries ---
import pandas as pd
import numpy as np
import os
import sys
from pathlib import Path
import warnings
from collections import Counter

# --- Preprocessing & Modeling ---
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier

# --- Evaluation ---
from sklearn.metrics import (
    confusion_matrix,
    precision_score,
    recall_score,
    average_precision_score,
    roc_auc_score,
    precision_recall_curve
)

# --- Model Persistence & Visualization ---
import joblib
import matplotlib.pyplot as plt
import seaborn as sns

# Suppress all warnings for cleaner output
warnings.filterwarnings('ignore')

In [5]:
# --- Path Setup ---
# Get the current working directory of the notebook
notebook_dir = Path(os.getcwd())

# Navigate up one level to reach the project root directory
project_root = notebook_dir.parent

if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Import from config.py
from bank_fraud.config import PROCESSED_DATA_DIR, REFERENCES_DIR

### Load Final Dataset

In [6]:
# Load the final, curated dataset from the feature selection phase
FINAL_DATA_PATH = PROCESSED_DATA_DIR / '3.0_selected_features.parquet'
df = pd.read_parquet(FINAL_DATA_PATH)

print(f"Dataset loaded successfully from: {FINAL_DATA_PATH.relative_to(project_root)}")
print(f"Dataset shape: {df.shape}")

Dataset loaded successfully from: data\processed\3.0_selected_features.parquet
Dataset shape: (493189, 65)


### Identify Feature Types and Define Target

In [9]:
# --- Dynamically Drop Identifier Columns ---

# Load the identifier data dictionary to get the authoritative list of identifiers
IDENTIFIER_DICT_PATH = REFERENCES_DIR / 'identifier_data_dictionary.csv'
identifier_df = pd.read_csv(IDENTIFIER_DICT_PATH)
all_identifiers = identifier_df['feature_name'].tolist()

# Find which of these identifiers are actually present in our current DataFrame
# This ensures the script doesn't fail if a column was already dropped in a previous step.
identifiers_to_drop = [col for col in all_identifiers if col in df.columns]

# Define the target variable
TARGET_COL = 'fraud_status'

# Define the feature matrix X by dropping the target and all identified identifiers
X = df.drop(columns=[TARGET_COL] + identifiers_to_drop, errors='ignore')
y = df[TARGET_COL]

print(f"Dropped {len(identifiers_to_drop)} identifier columns: {identifiers_to_drop}")


# Identify numerical and categorical features from the final feature matrix X
numerical_features = X.select_dtypes(include=np.number).columns.tolist()
categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()

print(f"Identified {len(numerical_features)} numerical features.")
print(f"Identified {len(categorical_features)} categorical features.")

Dropped 2 identifier columns: ['profile_id', 'account_no']
Identified 59 numerical features.
Identified 3 categorical features.


### Split Data into Training, Validation, and Holdout Sets

We will perform a stratified split to ensure the proportion of fraud cases is consistent across all datasets.
- **Training Set (70%)**: For training the model.
- **Validation Set (15%)**: For tuning hyperparameters.
- **Holdout Set (15%)**: For final, unbiased evaluation.

In [10]:
# First split: Create the training set (70%) and a temporary set (30%)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, 
    test_size=0.30, 
    random_state=42, 
    stratify=y
)

# Second split: Split the temporary set into validation (15%) and holdout (15%)
# This is equivalent to splitting the 30% temp set in half (0.5)
X_val, X_holdout, y_val, y_holdout = train_test_split(
    X_temp, y_temp, 
    test_size=0.50, 
    random_state=42, 
    stratify=y_temp
)

print("Data splitting complete.")
print(f"Training set shape:   {X_train.shape}")
print(f"Validation set shape: {X_val.shape}")
print(f"Holdout set shape:    {X_holdout.shape}")
print("\nProportion of fraud in each set:")
print(f"Training:   {y_train.mean():.4f}")
print(f"Validation: {y_val.mean():.4f}")
print(f"Holdout:    {y_holdout.mean():.4f}")

Data splitting complete.
Training set shape:   (345232, 62)
Validation set shape: (73978, 62)
Holdout set shape:    (73979, 62)

Proportion of fraud in each set:
Training:   0.0166
Validation: 0.0166
Holdout:    0.0166


### Establish Baseline with Proportion Chance Criterion (PCC)

Before building complex models, it's crucial to establish a baseline to understand the minimum performance we must exceed. For imbalanced classification tasks, simple accuracy can be misleading. The **Proportion Chance Criterion (PCC)** provides this baseline.

The PCC represents the accuracy a naive model would achieve by always guessing the majority class. A common rule of thumb is that a useful model's accuracy should be at least 25% greater than the PCC.

This calculation will demonstrate why we focus on metrics like Precision, Recall, and AUC-PR instead of accuracy alone.

In [12]:
from collections import Counter

# Calculate PCC on the training data
class_counts = Counter(y_train)
total_samples = len(y_train)

pcc = ((class_counts[0] / total_samples)**2) + ((class_counts[1] / total_samples)**2)
pcc_threshold = 1.25 * pcc

print(f"Proportion Chance Criterion (PCC): {pcc:.2%}")
print(f"1.25 * PCC Threshold: {pcc_threshold:.2%}")

Proportion Chance Criterion (PCC): 96.74%
1.25 * PCC Threshold: 120.92%


**Interpretation:**

The PCC of approximately 0.97 indicates that a model that does nothing but predict 'NON_FRAUD' for every case would be about 97% accurate. This high value underscores the inadequacy of accuracy as a primary metric for this problem. Our model must demonstrate a much more nuanced understanding of the data to be considered effective, which is why our evaluation will focus on its ability to correctly identify the rare fraud cases (Precision and Recall).