## 03_modeling — Baseline: Logistic Regression

Purpose:
- Train baseline classification model
- Generate raw prediction probabilities
- Persist model and modeling outputs

Notes:
- This file performs training only.
- Evaluation is handled in 04_evaluation.

### Why Logistic Regression?

- Suitable for binary classification problems.
- Works well with linearly separable data.
- Supports `class_weight`, which is important for handling class imbalance.
- Provides probabilistic outputs required for threshold tuning.
- Uses the lbfgs solver for stable and efficient optimization in baseline training.

**As a linear baseline model, Logistic Regression is not expected to achieve high
precision on the fraud class due to severe class imbalance. Its primary role is to
establish a strong, interpretable reference point for comparison with more complex models.**


### Loading Preprocessed Data
The model is trained using preprocessed datasets saved during the preprocessing stage.
This ensures a clean separation between preprocessing and modeling and guarantees
reproducibility of the training pipeline.

In [None]:
import joblib
import os
import numpy as np
...
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

# Load train/test splits generated during preprocessing
X_train = joblib.load("../artifacts/X_train.pkl")
X_test  = joblib.load("../artifacts/X_test.pkl")
y_train = joblib.load("../artifacts/y_train.pkl")
y_test  = joblib.load("../artifacts/y_test.pkl")

In [None]:
# Baseline model (to be compared with future models)

# Fix NumPy randomness to make results reproducible across runs
np.random.seed(42)

# Initialize Logistic Regression with class balancing
model = LogisticRegression(
    class_weight='balanced',  # Handle class imbalance
    max_iter=1000,             # Ensure convergence
    solver="lbfgs",
    random_state=42
)

# Train the model
model.fit(X_train, y_train)

# Predict fraud probabilities for the test set
y_pred_proba = model.predict_proba(X_test)[:, 1]

#save model
model_outputs = {
    "model_name": "logistic_regression_baseline",
    "y_pred_proba": y_pred_proba,
    "y_test": y_test
}
os.makedirs("../models", exist_ok=True)
os.makedirs("../artifacts", exist_ok=True)

joblib.dump(model, "../models/logistic_regression_baseline.pkl")
joblib.dump(model_outputs,"../artifacts/model_outputs_baseline.pkl")

print("Model saved successfully.")

**Notes**
- A fixed random_state is used to ensure reproducible training results and
consistent comparisons with future models.

- Predicted probabilities represent the model’s confidence that a transaction
is fraudulent. Using probabilities instead of hard predictions allows
flexible threshold selection and better control over the precision–recall
trade-off, which is critical in fraud detection.

### Model Outputs
This dictionary collects all raw outputs produced by the modeling stage.
It acts as a standardized interface between the modeling and evaluation steps.

- **model_name**: Identifier for the trained model, useful when comparing multiple models.
- **y_pred_proba**: Predicted probabilities for the positive class (fraud = 1).
  These probabilities allow flexible threshold selection during evaluation.
- **y_test**: Ground truth labels for the test set, required for computing evaluation metrics.
  
The trained model and its predicted probabilities are persisted for
use in the evaluation and comparison stages.

## Random Forest — Model Training

Purpose:
- Train a non-linear tree-based model
- Capture feature interactions missed by linear models
- Generate probability outputs for downstream evaluation

While Logistic Regression provides a strong and interpretable baseline,
its linear nature limits its ability to capture complex feature interactions.

To assess whether non-linear models can improve fraud detection performance,
tree-based and boosting models are trained next under the same preprocessing setup.

In [None]:
# Initialize Random Forest model
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    class_weight="balanced",
    random_state=42,
    n_jobs=-1
)

#Train the model
rf_model.fit(X_train, y_train)

# Predict fraud probabilities on the test set
rf_proba = rf_model.predict_proba(X_test)[:, 1]

# save outputs
rf_outputs = {
    "model_name": "random_forest",
    "y_pred_proba": rf_proba,
    "y_test": y_test
}
os.makedirs("../models", exist_ok=True)
os.makedirs("../artifacts", exist_ok=True)

joblib.dump(rf_model, "../models/random_forest.pkl")
joblib.dump(rf_outputs, "../artifacts/model_outputs_random_forest.pkl")

print("Model saved successfully.")

## Model Outputs

The trained model and its predicted probabilities are persisted for
use in the evaluation and comparison stages.

## Gradient Boosting Model — XGBoost

Gradient Boosting is evaluated to determine whether a boosted tree-based
approach can achieve a better balance between fraud recall and false
positive reduction compared to both Logistic Regression and Random Forest.

The model is trained and evaluated under the same preprocessing and
evaluation framework to ensure a fair comparison.

In [None]:
# Initialize XGBoost model
xgb_model = XGBClassifier(
    n_estimators=200,
    max_depth=4,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=(y_train == 0).sum() / (y_train == 1).sum(),
    eval_metric="logloss",
    random_state=42,
    n_jobs=-1
)

# Train the model
xgb_model.fit(X_train, y_train)

# Predict fraud probabilities on the test set
xgb_proba = xgb_model.predict_proba(X_test)[:, 1]

# Save outputs
xgb_outputs = {
    "model_name": "xgboost",
    "y_pred_proba": xgb_proba,
    "y_test": y_test
}
os.makedirs("../models", exist_ok=True)
os.makedirs("../artifacts", exist_ok=True)

joblib.dump(xgb_model, "../models/xgboost.pkl")
joblib.dump(xgb_outputs, "../artifacts/model_outputs_xgboost.pkl")

print("Model saved successfully.")

## Model Outputs

The trained model and its predicted probabilities are persisted for
use in the evaluation and comparison stages.