## 03_modeling — Baseline: Logistic Regression

Purpose:
- Train baseline classification model
- Generate raw prediction probabilities
- Persist model and modeling outputs

Notes:
- This file performs training only.
- Evaluation is handled in 04_evaluation.

### Why Logistic Regression?

- Suitable for binary classification problems.
- Works well with linearly separable data.
- Supports `class_weight`, which is important for handling class imbalance.
- Provides probabilistic outputs required for threshold tuning.
- Uses the lbfgs solver for stable and efficient optimization in baseline training.

**As a linear baseline model, Logistic Regression is not expected to achieve high
precision on the fraud class due to severe class imbalance. Its primary role is to
establish a strong, interpretable reference point for comparison with more complex models.**


### Loading Preprocessed Data
The model is trained using preprocessed datasets saved during the preprocessing stage.
This ensures a clean separation between preprocessing and modeling and guarantees
reproducibility of the training pipeline.

In [None]:
import joblib
import os
import numpy as np
from sklearn.linear_model import LogisticRegression

# Load train/test splits generated during preprocessing
X_train = joblib.load("../artifacts/X_train.pkl")
X_test  = joblib.load("../artifacts/X_test.pkl")
y_train = joblib.load("../artifacts/y_train.pkl")
y_test  = joblib.load("../artifacts/y_test.pkl")

In [None]:
# Baseline model (to be compared with future models)

# Fix NumPy randomness to make results reproducible across runs
np.random.seed(42)

# Initialize Logistic Regression with class balancing
model = LogisticRegression(
    class_weight='balanced',  # Handle class imbalance
    max_iter=1000,             # Ensure convergence
    solver="lbfgs",
    random_state=42
)

# Train the model
model.fit(X_train, y_train)

In [None]:
# Predict fraud probabilities for the test set
y_pred_proba = model.predict_proba(X_test)[:, 1]

**Notes**
- A fixed random_state is used to ensure reproducible training results and
consistent comparisons with future models.

- Predicted probabilities represent the model’s confidence that a transaction
is fraudulent. Using probabilities instead of hard predictions allows
flexible threshold selection and better control over the precision–recall
trade-off, which is critical in fraud detection.

### Model Outputs
This dictionary collects all raw outputs produced by the modeling stage.
It acts as a standardized interface between the modeling and evaluation steps.

- **model_name**: Identifier for the trained model, useful when comparing multiple models.
- **y_pred_proba**: Predicted probabilities for the positive class (fraud = 1).
  These probabilities allow flexible threshold selection during evaluation.
- **y_test**: Ground truth labels for the test set, required for computing evaluation metrics.

Storing outputs in a single structure improves code organization and makes the
evaluation stage independent from the training logic.


In [None]:
model_outputs = {
    "model_name": "logistic_regression_baseline",
    "y_pred_proba": y_pred_proba,
    "y_test": y_test
}

In [None]:
# Ensure models directory exists
os.makedirs("../models", exist_ok=True)

# Save the trained model to disk for reproducibility and future inference
joblib.dump(model, "../models/logistic_regression_baseline.pkl")

# Save raw model outputs (predicted probabilities and ground truth)
# to be consumed by the evaluation stage independently from training
joblib.dump(model_outputs,"../artifacts/model_outputs_baseline.pkl")


print("Model saved successfully.")