# 5.0 - Model Training, Evaluation, and Business Simulation

_by Michael Joshua Vargas_

This notebook implements the full machine learning workflow. It covers:
1.  **Data Preparation**: Loading the final feature set and splitting it into training, validation, and holdout sets.
2.  **Preprocessing**: Creating a robust pipeline to scale numerical features and one-hot encode categorical features.
3.  **Model Tuning**: Training and tuning two separate XGBoost models optimized for different business goals (Precision and AUC-PR).
4.  **Business Evaluation**: Using the tuned models on the holdout set to simulate a real-world, cost-sensitive fraud detection system.

## 1. Setup and Data Preparation

In [1]:
%load_ext autoreload
%autoreload
%reload_ext autoreload

#### Import relevant libraries

In [2]:
# --- Core Libraries ---
import pandas as pd
import numpy as np
import os
import sys
from pathlib import Path
import warnings
from collections import Counter

# --- Preprocessing & Modeling ---
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, StratifiedKFold
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

# --- Evaluation ---
from sklearn.metrics import (
    confusion_matrix,
    precision_score,
    recall_score,
    average_precision_score,
    roc_auc_score,
    precision_recall_curve,
    balanced_accuracy_score,
    f1_score,
    matthews_corrcoef,
    brier_score_loss
)
from IPython.display import display # Added for display function

# --- Model Persistence & Visualization ---
import joblib
import matplotlib.pyplot as plt
import seaborn as sns

# Suppress all warnings for cleaner output
warnings.filterwarnings('ignore')

In [3]:
# --- Path Setup ---
# Get the current working directory of the notebook
notebook_dir = Path(os.getcwd())

# Navigate up one level to reach the project root directory
project_root = notebook_dir.parent

if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Import from config.py
from bank_fraud.config import PROCESSED_DATA_DIR, REFERENCES_DIR, MODELS_DIR

### Load Final Dataset

In [4]:
# Load the final, curated dataset from the feature selection phase
FINAL_DATA_PATH = PROCESSED_DATA_DIR / '3.0_selected_features.parquet'
df = pd.read_parquet(FINAL_DATA_PATH)

print(f"Dataset loaded successfully from: {FINAL_DATA_PATH.relative_to(project_root)}")
print(f"Dataset shape: {df.shape}")

Dataset loaded successfully from: data\processed\3.0_selected_features.parquet
Dataset shape: (493189, 65)


### Identify Feature Types and Define Target

In [5]:
# --- Dynamically Drop Identifier Columns ---

# Load the identifier data dictionary to get the authoritative list of identifiers
IDENTIFIER_DICT_PATH = REFERENCES_DIR / 'identifier_data_dictionary.csv'
identifier_df = pd.read_csv(IDENTIFIER_DICT_PATH)
all_identifiers = identifier_df['feature_name'].tolist()

# Find which of these identifiers are actually present in our current DataFrame
# This ensures the script doesn't fail if a column was already dropped in a previous step.
identifiers_to_drop = [col for col in all_identifiers if col in df.columns]

# Define the target variable
TARGET_COL = 'fraud_status'

# Define the feature matrix X by dropping the target and all identified identifiers
X = df.drop(columns=[TARGET_COL] + identifiers_to_drop, errors='ignore')
y = df[TARGET_COL]

print(f"Dropped {len(identifiers_to_drop)} identifier columns: {identifiers_to_drop}")


# Identify numerical and categorical features from the final feature matrix X
numerical_features = X.select_dtypes(include=np.number).columns.tolist()
categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()

print(f"Identified {len(numerical_features)} numerical features.")
print(f"Identified {len(categorical_features)} categorical features.")

Dropped 2 identifier columns: ['profile_id', 'account_no']
Identified 59 numerical features.
Identified 3 categorical features.


### Split Data into Training, Validation, and Holdout Sets

We will perform a stratified split to ensure the proportion of fraud cases is consistent across all datasets.
- **Training Set (70%)**: For training the model.
- **Validation Set (15%)**: For tuning hyperparameters.
- **Holdout Set (15%)**: For final, unbiased evaluation.

In [6]:
# First split: Create the training set (70%) and a temporary set (30%)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, 
    test_size=0.30, 
    random_state=42, 
    stratify=y
)

# Second split: Split the temporary set into validation (15%) and holdout (15%)
# This is equivalent to splitting the 30% temp set in half (0.5)
X_val, X_holdout, y_val, y_holdout = train_test_split(
    X_temp, y_temp, 
    test_size=0.50, 
    random_state=42, 
    stratify=y_temp
)

print("Data splitting complete.")
print(f"Training set shape:   {X_train.shape}")
print(f"Validation set shape: {X_val.shape}")
print(f"Holdout set shape:    {X_holdout.shape}")
print("\nProportion of fraud in each set:")
print(f"Training:   {y_train.mean():.4f}")
print(f"Validation: {y_val.mean():.4f}")
print(f"Holdout:    {y_holdout.mean():.4f}")

Data splitting complete.
Training set shape:   (345232, 62)
Validation set shape: (73978, 62)
Holdout set shape:    (73979, 62)

Proportion of fraud in each set:
Training:   0.0166
Validation: 0.0166
Holdout:    0.0166


### Establish Baseline with Proportion Chance Criterion (PCC)

Before building complex models, it's crucial to establish a baseline to understand the minimum performance we must exceed. For imbalanced classification tasks, simple accuracy can be misleading. The **Proportion Chance Criterion (PCC)** provides this baseline.

The PCC represents the accuracy a naive model would achieve by always guessing the majority class. A common rule of thumb is that a useful model's accuracy should be at least 25% greater than the PCC.

This calculation will demonstrate why we focus on metrics like Precision, Recall, and AUC-PR instead of accuracy alone.

In [7]:
from collections import Counter

# Calculate PCC on the training data
class_counts = Counter(y_train)
total_samples = len(y_train)

pcc = ((class_counts[0] / total_samples)**2) + ((class_counts[1] / total_samples)**2)
pcc_threshold = 1.25 * pcc

print(f"Proportion Chance Criterion (PCC): {pcc:.2%}")
print(f"1.25 * PCC Threshold: {pcc_threshold:.2%}")

Proportion Chance Criterion (PCC): 96.74%
1.25 * PCC Threshold: 120.92%


**Interpretation:**

The PCC of approximately 0.97 indicates that a model that does nothing but predict 'NON_FRAUD' for every case would be about 97% accurate. This high value underscores the inadequacy of accuracy as a primary metric for this problem. Our model must demonstrate a much more nuanced understanding of the data to be considered effective, which is why our evaluation will focus on its ability to correctly identify the rare fraud cases (Precision and Recall).

## 2. Preprocessing Pipeline Construction

We will create a preprocessing pipeline using `ColumnTransformer` to apply different transformations to different types of columns.

- **Numerical Features**: Will be scaled using `StandardScaler`. This standardizes features by removing the mean and scaling to unit variance, which is crucial for the performance of many machine learning algorithms.
- **Categorical Features**: Will be transformed using `OneHotEncoder`. This converts categorical variables into a numerical format that can be provided to the model. `handle_unknown='ignore'` ensures that if a new category appears in the validation or holdout data (that was not seen in the training data), it will be handled gracefully without causing an error.

In [8]:
# Create the preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='passthrough'  # Keep other columns (if any), though we expect none
)

print("Preprocessing pipeline created successfully.")

Preprocessing pipeline created successfully.


## 3. Model Training and Evaluation

### 3.1. Baseline Model Comparison (No Resampling)

We will first evaluate a set of baseline models without any resampling techniques to establish a performance benchmark.
This step helps us understand the inherent performance of different algorithms on our imbalanced dataset.

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
import time

def auto_ml(X, y, models_dict, preprocessor, cv, res_t=None):
    """
    Applies preprocessing, optional resampling, and evaluates multiple models using cross-validation.
    """
    results = {}
    results_formatted = {}

    for model_name, model_instance in models_dict.items():
        print(f"\n--- Evaluating {model_name} ---")
        
        # Create a pipeline that includes preprocessing and the model
        if res_t is not None:
            # If resampling is applied, use ImbPipeline
            pipeline = ImbPipeline(steps=[
                ('preprocessor', preprocessor),
                ('resampler', res_t),
                ('classifier', model_instance)
            ])
        else:
            # Otherwise, use standard Pipeline
            pipeline = Pipeline(steps=[
                ('preprocessor', preprocessor),
                ('classifier', model_instance)
            ])

        train_ap, val_ap = [], []
        train_bal_acc, val_bal_acc = [], []
        train_f1_w, val_f1_w = [], []
        train_mcc, val_mcc = [], []
        train_brier, val_brier = [], []
        train_precision, val_precision = [], []
        train_recall, val_recall = [], []
        
        fold_times = []

        for fold, (train_index, val_index) in enumerate(cv.split(X, y)):
            X_train_fold, X_val_fold = X.iloc[train_index], X.iloc[val_index]
            y_train_fold, y_val_fold = y.iloc[train_index], y.iloc[val_index]

            start_time = time.time()
            pipeline.fit(X_train_fold, y_train_fold)
            end_time = time.time()
            fold_times.append(end_time - start_time)

            # Predictions
            train_preds = pipeline.predict(X_train_fold)
            val_preds = pipeline.predict(X_val_fold)
            
            # Predict probabilities for metrics that require them
            train_probas = pipeline.predict_proba(X_train_fold)[:, 1]
            val_probas = pipeline.predict_proba(X_val_fold)[:, 1]

            # Calculate metrics
            train_ap.append(average_precision_score(y_train_fold, train_probas))
            val_ap.append(average_precision_score(y_val_fold, val_probas))

            train_bal_acc.append(balanced_accuracy_score(y_train_fold, train_preds))
            val_bal_acc.append(balanced_accuracy_score(y_val_fold, val_preds))

            train_f1_w.append(f1_score(y_train_fold, train_preds, average='weighted'))
            val_f1_w.append(f1_score(y_val_fold, val_preds, average='weighted'))

            train_mcc.append(matthews_corrcoef(y_train_fold, train_preds))
            val_mcc.append(matthews_corrcoef(y_val_fold, val_preds))
            
            train_brier.append(brier_score_loss(y_train_fold, train_probas))
            val_brier.append(brier_score_loss(y_val_fold, val_probas))

            train_precision.append(precision_score(y_train_fold, train_preds))
            val_precision.append(precision_score(y_val_fold, val_preds))

            train_recall.append(recall_score(y_train_fold, train_preds))
            val_recall.append(recall_score(y_val_fold, val_preds))

        # Store average results
        results[model_name] = {
            'Train AP': np.mean(train_ap),
            'Val AP': np.mean(val_ap),
            'Train BalAcc': np.mean(train_bal_acc),
            'Val BalAcc': np.mean(val_bal_acc),
            'Train F1_w': np.mean(train_f1_w),
            'Val F1_w': np.mean(val_f1_w),
            'Train MCC': np.mean(train_mcc),
            'Val MCC': np.mean(val_mcc),
            'Train Brier': np.mean(train_brier),
            'Val Brier': np.mean(val_brier),
            'Train Precision': np.mean(train_precision),
            'Val Precision': np.mean(val_precision),
            'Train Recall': np.mean(train_recall),
            'Val Recall': np.mean(val_recall),
            'Avg Run Time (s)': np.mean(fold_times)
        }
        
        # Store formatted results
        results_formatted[model_name] = {
            'Train AP': f"{np.mean(train_ap)*100:.2f}%",
            'Val AP': f"{np.mean(val_ap)*100:.2f}%",
            'Train BalAcc': f"{np.mean(train_bal_acc)*100:.2f}%",
            'Val BalAcc': f"{np.mean(val_bal_acc)*100:.2f}%",
            'Train F1_w': f"{np.mean(train_f1_w)*100:.2f}%",
            'Val F1_w': f"{np.mean(val_f1_w)*100:.2f}%",
            'Train MCC': f"{np.mean(train_mcc)*100:.2f}%",
            'Val MCC': f"{np.mean(val_mcc)*100:.2f}%",
            'Train Brier': f"{np.mean(train_brier)*100:.2f}%",
            'Val Brier': f"{np.mean(val_brier)*100:.2f}%",
            'Train Precision': f"{np.mean(train_precision)*100:.2f}%",
            'Val Precision': f"{np.mean(val_precision)*100:.2f}%",
            'Train Recall': f"{np.mean(train_recall)*100:.2f}%",
            'Val Recall': f"{np.mean(val_recall)*100:.2f}%",
            'Avg Run Time (s)': f"{np.mean(fold_times):.2f}"
        }

    return pd.DataFrame(results).T, pd.DataFrame(results_formatted).T

In [10]:
# Define baseline models
# For XGBoost, scale_pos_weight is calculated based on the training data imbalance
# eval_metric is set to 'logloss' for general classification, but AUC-PR is also tracked
models_dict = {
    'LogisticRegression': LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42),
    'GaussianNB': GaussianNB(),
    'DecisionTreeClassifier': DecisionTreeClassifier(random_state=42, class_weight='balanced', max_depth=8, min_samples_leaf=20, ccp_alpha=0.001),
    'XGBoost': XGBClassifier(
        scale_pos_weight=(len(y_train) - y_train.sum()) / y_train.sum(),
        use_label_encoder=False,
        eval_metric='logloss', # Use logloss for general evaluation, AUC-PR is tracked separately
        random_state=42
    )
}

In [11]:
# Define cross-validation strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Run baseline evaluation
baseline_results, baseline_results_formatted = auto_ml(X_train, y_train, models_dict, preprocessor, cv)

print("\n--- Baseline Model Performance (No Resampling) ---")
display(baseline_results_formatted)


--- Evaluating LogisticRegression ---

--- Evaluating GaussianNB ---

--- Evaluating DecisionTreeClassifier ---

--- Evaluating XGBoost ---

--- Baseline Model Performance (No Resampling) ---


Unnamed: 0,Train AP,Val AP,Train BalAcc,Val BalAcc,Train F1_w,Val F1_w,Train MCC,Val MCC,Train Brier,Val Brier,Train Precision,Val Precision,Train Recall,Val Recall,Avg Run Time (s)
LogisticRegression,54.72%,54.46%,79.89%,79.78%,96.69%,96.70%,35.84%,35.77%,9.94%,9.95%,22.13%,22.12%,63.55%,63.33%,3.78
GaussianNB,20.19%,20.18%,75.51%,75.47%,97.37%,97.37%,36.90%,36.84%,3.09%,3.09%,27.49%,27.44%,53.40%,53.33%,1.3
DecisionTreeClassifier,27.36%,26.92%,78.88%,78.63%,96.69%,96.67%,35.07%,34.70%,10.69%,10.71%,22.03%,21.77%,61.50%,61.02%,4.97
XGBoost,69.00%,55.85%,84.36%,78.30%,98.62%,98.25%,60.86%,50.16%,7.40%,7.70%,54.43%,45.18%,69.70%,57.78%,7.28


### 3.2. Resampling Techniques Evaluation

Now, we will evaluate the impact of different resampling techniques on model performance.
We will compare RandomOverSampler, SMOTE, and RandomUnderSampler.

In [12]:
resamplers_dict = {
    'RandomOverSampler': RandomOverSampler(random_state=42),
    'SMOTE': SMOTE(random_state=42),
    'RandomUnderSampler': RandomUnderSampler(random_state=42)
}

resampling_results = {}
resampling_results_formatted = {}

for resampler_name, resampler_instance in resamplers_dict.items():
    print(f"\n--- Evaluating with {resampler_name} ---")
    # For each resampler, evaluate all models
    current_res_results, current_res_results_formatted = auto_ml(X_train, y_train, models_dict, preprocessor, cv, res_t=resampler_instance)
    
    # Store results, potentially renaming columns to indicate resampler
    for model_name in current_res_results.index:
        # Use a combined key for model and resampler
        combined_key = f"{model_name} + {resampler_name}"
        resampling_results[combined_key] = current_res_results.loc[model_name].to_dict()
        resampling_results_formatted[combined_key] = current_res_results_formatted.loc[model_name].to_dict()

resampling_results_df = pd.DataFrame(resampling_results).T
resampling_results_formatted_df = pd.DataFrame(resampling_results_formatted).T

print("\n--- Model Performance with Resampling ---")
display(resampling_results_formatted_df)


--- Evaluating with RandomOverSampler ---

--- Evaluating LogisticRegression ---

--- Evaluating GaussianNB ---

--- Evaluating DecisionTreeClassifier ---

--- Evaluating XGBoost ---

--- Evaluating with SMOTE ---

--- Evaluating LogisticRegression ---

--- Evaluating GaussianNB ---

--- Evaluating DecisionTreeClassifier ---

--- Evaluating XGBoost ---

--- Evaluating with RandomUnderSampler ---

--- Evaluating LogisticRegression ---

--- Evaluating GaussianNB ---

--- Evaluating DecisionTreeClassifier ---

--- Evaluating XGBoost ---

--- Model Performance with Resampling ---


Unnamed: 0,Train AP,Val AP,Train BalAcc,Val BalAcc,Train F1_w,Val F1_w,Train MCC,Val MCC,Train Brier,Val Brier,Train Precision,Val Precision,Train Recall,Val Recall,Avg Run Time (s)
LogisticRegression + RandomOverSampler,54.71%,54.44%,79.87%,79.75%,96.66%,96.66%,35.65%,35.54%,9.95%,9.95%,21.92%,21.87%,63.56%,63.33%,8.93
GaussianNB + RandomOverSampler,19.94%,19.91%,75.61%,75.57%,97.32%,97.32%,36.54%,36.50%,3.17%,3.17%,26.87%,26.84%,53.68%,53.60%,2.29
DecisionTreeClassifier + RandomOverSampler,27.67%,27.34%,78.85%,78.60%,96.77%,96.74%,35.54%,35.17%,10.67%,10.68%,22.63%,22.38%,61.30%,60.81%,10.69
XGBoost + RandomOverSampler,67.67%,52.39%,64.12%,58.33%,43.40%,42.83%,8.06%,4.77%,60.80%,61.27%,2.30%,2.04%,100.00%,88.83%,8.62
LogisticRegression + SMOTE,55.01%,54.75%,79.85%,79.75%,96.69%,96.70%,35.82%,35.75%,9.97%,9.98%,22.13%,22.12%,63.47%,63.26%,8.55
GaussianNB + SMOTE,20.01%,20.01%,75.71%,75.67%,97.32%,97.32%,36.59%,36.52%,3.18%,3.19%,26.82%,26.78%,53.91%,53.81%,2.92
DecisionTreeClassifier + SMOTE,25.23%,25.20%,76.93%,76.92%,97.88%,97.88%,43.73%,43.66%,8.87%,8.88%,36.27%,36.17%,55.52%,55.49%,21.57
XGBoost + SMOTE,56.58%,50.04%,64.47%,58.80%,44.77%,44.35%,8.15%,4.96%,59.21%,59.58%,2.32%,2.06%,99.53%,88.48%,9.63
LogisticRegression + RandomUnderSampler,51.96%,51.57%,79.73%,79.64%,96.51%,96.51%,34.63%,34.56%,9.97%,9.98%,20.82%,20.81%,63.54%,63.36%,1.13
GaussianNB + RandomUnderSampler,18.23%,18.22%,75.99%,75.96%,97.06%,97.05%,34.83%,34.74%,3.61%,3.62%,24.16%,24.07%,54.91%,54.88%,0.95


### 3.3. Save Results

In [13]:
# Define the directory for saving model evaluation results
MODEL_EVAL_DIR = project_root / 'reports' / 'model_evaluation'
MODEL_EVAL_DIR.mkdir(parents=True, exist_ok=True)

# Save baseline results
baseline_results_formatted.to_csv(MODEL_EVAL_DIR / 'baseline_model_performance.csv', index=True)
print(f"Baseline model performance saved to: {MODEL_EVAL_DIR.relative_to(project_root) / 'baseline_model_performance.csv'}")

# Save resampling results
resampling_results_formatted_df.to_csv(MODEL_EVAL_DIR / 'resampling_model_performance.csv', index=True)
print(f"Resampling model performance saved to: {MODEL_EVAL_DIR.relative_to(project_root) / 'resampling_model_performance.csv'}")

Baseline model performance saved to: reports\model_evaluation\baseline_model_performance.csv
Resampling model performance saved to: reports\model_evaluation\resampling_model_performance.csv


### 3.4. Decision on Resampling and Final Model Selection

Based on the comparison of baseline models and models with resampling, we have made the following observations and decisions:

**Analysis of Baseline Models (No Resampling):**
- **XGBoost** demonstrated superior performance across key metrics, particularly `Val AP` (55.85%) and `Val Precision` (45.18%), compared to Logistic Regression, GaussianNB, and Decision Tree Classifier. This indicates its strong ability to identify fraud cases while maintaining a reasonable false positive rate without explicit external resampling.

**Analysis of Resampling Techniques:**
- For **XGBoost**, applying external resampling techniques (`RandomOverSampler`, `SMOTE`, `RandomUnderSampler`) generally led to a significant increase in `Val Recall` (e.g., up to 95.44% with RandomUnderSampler).
- However, this increase in recall came at a substantial cost to `Val Precision` (dropping to as low as 1.91%) and `Val AP` (dropping to around 50-54%). This trade-off suggests that while more fraud cases are identified, a much higher number of non-fraud cases are also flagged as fraudulent, which is undesirable for the "Auto-Blocking Leg" that prioritizes minimizing false positives.
- For other models (Logistic Regression, GaussianNB, Decision Tree), resampling did not consistently provide significant improvements in `Val AP` or `Val Precision` that would justify their use over XGBoost.

**Decision on Resampling and Final Model Selection:**
- We will proceed with **XGBoost without an explicit external resampling technique** for hyperparameter tuning. The `scale_pos_weight` parameter within XGBoost itself is already effectively handling the class imbalance by giving more importance to the minority class during training. This approach has shown the best balance between Precision and AUC-PR in our initial evaluations.

Our primary optimization targets for tuning will remain:
- **Precision**: Crucial for the "Auto-Blocking Leg" to minimize false positives.
- **Average Precision (AUC-PR)**: A robust metric for overall ranking quality in imbalanced datasets, important for both legs of the deployment.

In [14]:
# Placeholder for decision and next steps
print("\nReview the tables above and the saved CSV files to compare model performance with and without resampling.")
print("Based on Precision and AUC-PR, decide which combination (model + optional resampler) to proceed with for hyperparameter tuning.")
print("This decision will inform the next steps in the notebook.")


Review the tables above and the saved CSV files to compare model performance with and without resampling.
Based on Precision and AUC-PR, decide which combination (model + optional resampler) to proceed with for hyperparameter tuning.
This decision will inform the next steps in the notebook.


### 3.5. Hyperparameter Tuning for XGBoost

We will use a two-stage hyperparameter tuning process:
1.  **Stage 1: RandomizedSearchCV**: To efficiently explore a broad range of hyperparameters and identify promising regions.
2.  **Stage 2: GridSearchCV**: To perform a more exhaustive search within the narrowed, promising regions found by RandomizedSearchCV.

We will perform this two-stage tuning for two separate objectives: one optimized for Precision and another for AUC-PR.

#### Stage 1: RandomizedSearchCV

In [15]:
# Define the XGBoost model instance (without external resampler, as decided)
# The scale_pos_weight is already set in the models_dict for XGBoost
xgb_model_for_tuning = models_dict['XGBoost']

# Create a pipeline for tuning that includes the preprocessor and the XGBoost model
pipeline_for_tuning = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', xgb_model_for_tuning)
])

# Define file paths for saved models
precision_model_path = MODELS_DIR / 'best_xgb_precision_model.joblib'
aucpr_model_path = MODELS_DIR / 'best_xgb_aucpr_model.joblib'

# Check if models already exist
if precision_model_path.exists() and aucpr_model_path.exists():
    print("\n--- Loading pre-trained models and tuning results ---")
    best_xgb_precision_model = joblib.load(precision_model_path)
    best_xgb_aucpr_model = joblib.load(aucpr_model_path)

    # Load Stage 1 best scores from CSVs for comparison in conditional saving
    # Ensure these CSVs are saved by the previous run or manually created if skipping first run
    try:
        random_search_precision_results_df = pd.read_csv(MODEL_EVAL_DIR / 'random_search_precision_results.csv')
        random_search_precision_best_score_stage1 = random_search_precision_results_df['mean_test_score'].max()
    except FileNotFoundError:
        print("Warning: random_search_precision_results.csv not found. Cannot compare Stage 2 to Stage 1 for Precision.")
        random_search_precision_best_score_stage1 = -np.inf # Set to negative infinity so any Stage 2 score is better

    try:
        random_search_aucpr_results_df = pd.read_csv(MODEL_EVAL_DIR / 'random_search_aucpr_results.csv')
        random_search_aucpr_best_score_stage1 = random_search_aucpr_results_df['mean_test_score'].max()
    except FileNotFoundError:
        print("Warning: random_search_aucpr_results.csv not found. Cannot compare Stage 2 to Stage 1 for AUC-PR.")
        random_search_aucpr_best_score_stage1 = -np.inf # Set to negative infinity so any Stage 2 score is better

    print("Pre-trained models and Stage 1 tuning results loaded successfully. Skipping hyperparameter tuning.")

else:
    print("\n--- Starting Hyperparameter Tuning (Models not found) ---")
    """
    We will use a two-stage hyperparameter tuning process:
    1.  **Stage 1: RandomizedSearchCV**: To efficiently explore a broad range of hyperparameters and identify promising regions.
    2.  **Stage 2: GridSearchCV**: To perform a more exhaustive search within the narrowed, promising regions found by RandomizedSearchCV.

    We will perform this two-stage tuning for two separate objectives: one optimized for Precision and another for AUC-PR.
    """


# --- Stage 1: RandomizedSearchCV ---
# Define a broader parameter distribution for RandomizedSearchCV
param_distributions = {
    'classifier__n_estimators': [100, 200, 300, 400, 500],
    'classifier__learning_rate': [0.01, 0.05, 0.1, 0.2],
    'classifier__max_depth': [3, 5, 7, 9],
    'classifier__subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
    'classifier__colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0],
    'classifier__gamma': [0, 0.1, 0.2, 0.3],
    'classifier__reg_alpha': [0, 0.001, 0.01, 0.1], # L1 regularization
    'classifier__reg_lambda': [0, 0.001, 0.01, 0.1] # L2 regularization
}

# --- Randomized Search for Precision ---
print("\n--- Stage 1: Randomized Search for Precision ---")
random_search_precision = RandomizedSearchCV(
    estimator=pipeline_for_tuning,
    param_distributions=param_distributions,
    n_iter=50,  # Number of parameter settings that are sampled
    scoring='precision',
    cv=cv,
    verbose=2,
    random_state=42,
    n_jobs=-1
)
random_search_precision.fit(X_train, y_train)

print("\nBest parameters from Randomized Search (Precision):")
print(random_search_precision.best_params_)
print("\nBest Precision score from Randomized Search:")
print(random_search_precision.best_score_)

# --- Randomized Search for AUC-PR ---
print("\n--- Stage 1: Randomized Search for AUC-PR ---")
random_search_aucpr = RandomizedSearchCV(
    estimator=pipeline_for_tuning,
    param_distributions=param_distributions,
    n_iter=50,
    scoring='average_precision',
    cv=cv,
    verbose=2,
    random_state=42,
    n_jobs=-1
)
random_search_aucpr.fit(X_train, y_train)

print("\nBest parameters from Randomized Search (AUC-PR):")
print(random_search_aucpr.best_params_)
print("\nBest AUC-PR score from Randomized Search:")
print(random_search_aucpr.best_score_)



--- Loading pre-trained models and tuning results ---
Pre-trained models and Stage 1 tuning results loaded successfully. Skipping hyperparameter tuning.

--- Stage 1: Randomized Search for Precision ---
Fitting 5 folds for each of 50 candidates, totalling 250 fits


KeyboardInterrupt: 

Save Results

In [None]:
# Save RandomizedSearchCV results for Precision
pd.DataFrame(random_search_precision.cv_results_).to_csv(MODEL_EVAL_DIR / 'random_search_precision_results.csv', index=False)

print(f"Randomized Search (Precision) results saved to: {MODEL_EVAL_DIR.relative_to(project_root) / 'random_search_precision_results.csv'}")

# Save RandomizedSearchCV results for AUC-PR
pd.DataFrame(random_search_aucpr.cv_results_).to_csv(MODEL_EVAL_DIR / 'random_search_aucpr_results.csv', index=False)

print(f"Randomized Search (AUC-PR) results saved to: {MODEL_EVAL_DIR.relative_to(project_root) / 'random_search_aucpr_results.csv'}")

#### Stage 2: GridSearchCV

Define narrower parameter grids based on RandomizedSearchCV results

In [None]:
best_params_precision = random_search_precision.best_params_
refined_param_grid_precision = {
    'classifier__n_estimators': [max(100, best_params_precision['classifier__n_estimators'] - 25), best_params_precision['classifier__n_estimators'] + 25],
    'classifier__learning_rate': [best_params_precision['classifier__learning_rate'] * 0.95, best_params_precision['classifier__learning_rate'] * 1.05],
    'classifier__max_depth': [max(1, best_params_precision['classifier__max_depth']), best_params_precision['classifier__max_depth'] + 1],
    'classifier__subsample': [max(0.6, best_params_precision['classifier__subsample'] - 0.025), min(1.0, best_params_precision['classifier__subsample'] + 0.025)],
    'classifier__colsample_bytree': [max(0.6, best_params_precision['classifier__colsample_bytree'] - 0.025), min(1.0, best_params_precision['classifier__colsample_bytree'] + 0.025)],
    'classifier__gamma': [max(0, best_params_precision['classifier__gamma']), best_params_precision['classifier__gamma'] + 0.025],
    'classifier__reg_alpha': [best_params_precision['classifier__reg_alpha']],
    'classifier__reg_lambda': [best_params_precision['classifier__reg_lambda']]
}

In [None]:
best_params_aucpr = random_search_aucpr.best_params_
refined_param_grid_aucpr = {
    'classifier__n_estimators': [max(100, best_params_aucpr['classifier__n_estimators'] - 25), best_params_aucpr['classifier__n_estimators'] + 25],
    'classifier__learning_rate': [best_params_aucpr['classifier__learning_rate'] * 0.95, best_params_aucpr['classifier__learning_rate'] * 1.05],
    'classifier__max_depth': [max(1, best_params_aucpr['classifier__max_depth']), best_params_aucpr['classifier__max_depth'] + 1],
    'classifier__subsample': [max(0.6, best_params_aucpr['classifier__subsample'] - 0.025), min(1.0, best_params_aucpr['classifier__subsample'] + 0.025)],
    'classifier__colsample_bytree': [max(0.6, best_params_aucpr['classifier__colsample_bytree'] - 0.025), min(1.0, best_params_aucpr['classifier__colsample_bytree'] + 0.025)],
    'classifier__gamma': [max(0, best_params_aucpr['classifier__gamma']), best_params_aucpr['classifier__gamma'] + 0.025],
    'classifier__reg_alpha': [best_params_aucpr['classifier__reg_alpha']],
    'classifier__reg_lambda': [best_params_aucpr['classifier__reg_lambda']]
}

In [None]:
# --- Grid Search for Precision ---
print("\n--- Stage 2: Grid Search for Precision ---")
grid_search_precision = GridSearchCV(
    estimator=pipeline_for_tuning,
    param_grid=refined_param_grid_precision,
    scoring='precision',
    cv=cv,
    verbose=2,
    n_jobs=-1
)
grid_search_precision.fit(X_train, y_train)

print("\nBest parameters for Precision (Grid Search):")
print(grid_search_precision.best_params_)
print("\nBest Precision score (Grid Search):")
print(grid_search_precision.best_score_)

In [None]:
from bank_fraud.config import PROCESSED_DATA_DIR, REFERENCES_DIR, MODELS_DIR

# Store Stage 1 best score for comparison
random_search_precision_best_score_stage1 = random_search_precision.best_score_

# Save the best model for Precision if it performs better than Stage 1
if grid_search_precision.best_score_ > random_search_precision_best_score_stage1:
    best_xgb_precision_model = grid_search_precision.best_estimator_
    joblib.dump(best_xgb_precision_model, MODELS_DIR / 'best_xgb_precision_model.joblib')
    print(f"Best XGBoost model (Precision-optimized) saved to: {MODELS_DIR.relative_to(project_root) / 'best_xgb_precision_model.joblib'}")
else:
    print("Stage 2 Grid Search for Precision did not improve upon Stage 1 Randomized Search. Model not saved.")

In [None]:
# --- Grid Search for AUC-PR ---
print("\n--- Stage 2: Grid Search for AUC-PR ---")
grid_search_aucpr = GridSearchCV(
    estimator=pipeline_for_tuning,
    param_grid=refined_param_grid_aucpr,
    scoring='average_precision',
    cv=cv,
    verbose=2,
    n_jobs=-1
)
grid_search_aucpr.fit(X_train, y_train)

print("\nBest parameters for AUC-PR (Grid Search):")
print(grid_search_aucpr.best_params_)
print("\nBest AUC-PR score (Grid Search):")
print(grid_search_aucpr.best_score_)

In [None]:
# Store Stage 1 best score for comparison
random_search_aucpr_best_score_stage1 = random_search_aucpr.best_score_

# Save the best model for AUC-PR if it performs better than Stage 1
if grid_search_aucpr.best_score_ > random_search_aucpr_best_score_stage1:
    best_xgb_aucpr_model = grid_search_aucpr.best_estimator_
    joblib.dump(best_xgb_aucpr_model, MODELS_DIR / 'best_xgb_aucpr_model.joblib')
    print(f"Best XGBoost model (AUC-PR-optimized) saved to: {MODELS_DIR.relative_to(project_root) / 'best_xgb_aucpr_model.joblib'}")
else:
    print("Stage 2 Grid Search for AUC-PR did not improve upon Stage 1 Randomized Search. Model not saved.")

print("\nHyperparameter tuning complete. Best models saved.")

### 3.5.3. Summary of Hyperparameter Tuning Key Findings

The two-stage hyperparameter tuning process for XGBoost models, optimized separately for Precision and Average Precision (AUC-PR), yielded the following key insights:

**Overall Effectiveness:**
The tuning process successfully identified improved hyperparameter configurations for XGBoost compared to the initial baseline model. This validates the approach of systematically searching the parameter space.

**Precision-Optimized Model:**
*   **Stage 1 (RandomizedSearchCV) Best Precision:** Achieved approximately **48.49%**.
*   **Stage 2 (GridSearchCV) Best Precision:** Achieved **48.76%**.
*   **Interpretation:** The `GridSearchCV` in Stage 2 provided a marginal improvement of **0.27 percentage points** over the best model found by `RandomizedSearchCV` in Stage 1. This indicates that Stage 1 was highly effective in locating a near-optimal region, and Stage 2 performed a successful, albeit fine-grained, refinement. The best model for Precision is now identified.

**AUC-PR-Optimized Model:**
*   **Stage 1 (RandomizedSearchCV) Best AUC-PR:** Achieved approximately **57.81%**.
*   **Stage 2 (GridSearchCV) Best AUC-PR:** Achieved **57.89%**.
*   **Interpretation:** Similar to Precision, the `GridSearchCV` in Stage 2 yielded a slight improvement of **0.08 percentage points** for AUC-PR compared to Stage 1. This confirms the robustness of the Stage 1 search and the successful fine-tuning in Stage 2. The best model for AUC-PR is now identified.

**Conclusion:**
The two-stage tuning process effectively refined the XGBoost models for both Precision and AUC-PR. While the gains from Stage 2 were marginal, they confirm the stability of the optimal regions identified by Stage 1. We now have two specialized, best-tuned XGBoost models ready for unbiased evaluation on the holdout dataset and subsequent business simulation.

## 4. Holdout Evaluation and Business Simulation

In this section, we will evaluate the performance of our best-tuned models on the unseen holdout dataset.
We will also conduct a cost-sensitive business simulation to understand the financial impact of our models.

### 4.1. Load Best Tuned Models

In [None]:
# Load the best Precision-optimized model
best_xgb_precision_model = joblib.load(MODELS_DIR / 'best_xgb_precision_model.joblib')
print(f"Precision-optimized model loaded from: {MODELS_DIR.relative_to(project_root) / 'best_xgb_precision_model.joblib'}")

# Load the best AUC-PR-optimized model
best_xgb_aucpr_model = joblib.load(MODELS_DIR / 'best_xgb_aucpr_model.joblib')
print(f"AUC-PR-optimized model loaded from: {MODELS_DIR.relative_to(project_root) / 'best_xgb_aucpr_model.joblib'}")

In [None]:
### 4.2. Evaluate Models on Holdout Set

def evaluate_model_on_holdout(model, X_holdout, y_holdout, model_name):
    """
    Evaluates a given model on the holdout set and prints key metrics.
    """
    y_pred = model.predict(X_holdout)
    y_proba = model.predict_proba(X_holdout)[:, 1]

    precision = precision_score(y_holdout, y_pred)
    recall = recall_score(y_holdout, y_pred)
    f1_weighted = f1_score(y_holdout, y_pred, average='weighted')
    bal_acc = balanced_accuracy_score(y_holdout, y_pred)
    mcc = matthews_corrcoef(y_holdout, y_pred)
    ap = average_precision_score(y_holdout, y_proba)
    brier = brier_score_loss(y_holdout, y_proba)

    print(f"\n--- Holdout Evaluation for {model_name} ---")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Weighted: {f1_weighted:.4f}")
    print(f"Balanced Accuracy: {bal_acc:.4f}")
    print(f"MCC: {mcc:.4f}")
    print(f"Average Precision (AUC-PR): {ap:.4f}")
    print(f"Brier Score: {brier:.4f}")

    return {
        'Precision': precision,
        'Recall': recall,
        'F1-Weighted': f1_weighted,
        'Balanced Accuracy': bal_acc,
        'MCC': mcc,
        'Average Precision (AUC-PR)': ap,
        'Brier Score': brier
    }

# Evaluate Precision-optimized model
precision_model_holdout_metrics = evaluate_model_on_holdout(best_xgb_precision_model, X_holdout, y_holdout, "Precision-Optimized XGBoost")

# Evaluate AUC-PR-optimized model
aucpr_model_holdout_metrics = evaluate_model_on_holdout(best_xgb_aucpr_model, X_holdout, y_holdout, "AUC-PR-Optimized XGBoost")
