# Modeling: Predicting Employee Attrition

## 1. Notebook overview

This notebook trains and evaluates classification models to predict employee attrition.

Objectives:
- Load preprocessed training and test sets
- Train baseline models
- Evaluate model performance using classification metrics
- Tune hyperparameters for best-performing models
- Save final model for interpretation in the next notebook

## 2. Load preprocessed datasets

We load the training and test sets exported from the preprocessing notebook.
- `X_train_resampled.csv` and `y_train_resampled.csv` (SMOTE-balanced training)
- `X_test.csv` and `y_test.csv` (untouched test set)

In [None]:
import pandas as pd

# Define base directory
base_path = "../data/processed"

# Load CSVs
X_train = pd.read_csv(f"{base_path}/X_train.csv")
y_train = pd.read_csv(f"{base_path}/y_train.csv")

X_test = pd.read_csv(f"{base_path}/X_test.csv")
y_test = pd.read_csv(f"{base_path}/y_test.csv")

X_train_resampled = pd.read_csv(f"{base_path}/X_train_resampled.csv")
y_train_resampled = pd.read_csv(f"{base_path}/y_train_resampled.csv")

# --- Shape confirmation ---
print("Data loaded successfully:")
print(f"  X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"  X_test:  {X_test.shape}, y_test:  {y_test.shape}")
print(f"  X_train_resampled: {X_train_resampled.shape}, y_train_resampled: {y_train_resampled.shape}")

# --- Null checks ---
datasets = {
    "X_train": X_train,
    "y_train": y_train,
    "X_test": X_test,
    "y_test": y_test,
    "X_train_resampled": X_train_resampled,
    "y_train_resampled": y_train_resampled
}

print("\n🔍 Null value check:")
for name, df in datasets.items():
    nulls = df.isnull().sum().sum()
    if nulls > 0:
        print(f"{name} contains {nulls} null values.")
    else:
        print(f"{name} has no nulls.")

# --- Column consistency check ---
if list(X_train.columns) != list(X_test.columns) or list(X_train.columns) != list(X_train_resampled.columns):
    print("\nWarning: Column mismatch between training/test/resampled sets.")
else:
    print("\nAll feature sets have consistent columns.")


## 3. Train baseline modeluntuned 

We train a set of baseline models to establish initial performance benchmarks.

Models:
- Logistic Regression
- Random Forest
- XGBoost or CatBoost (if included in environment)

These models are evaluated using accuracy, recall, precision, F1-score, and ROC AUC.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay,
    roc_auc_score,
    RocCurveDisplay
)

import matplotlib.pyplot as plt

# Initialize and train model
logreg = LogisticRegression(max_iter=1000, random_state=42)
logreg.fit(X_train_resampled, y_train_resampled)

# Predict on test set
y_pred = logreg.predict(X_test)
y_proba = logreg.predict_proba(X_test)[:, 1]

# --- Evaluation metrics ---
print("📊 Classification Report:")
print(classification_report(y_test, y_pred, digits=3))

print(f"\n🔍 ROC AUC: {roc_auc_score(y_test, y_proba):.3f}")

# --- Confusion Matrix ---
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=logreg.classes_)
disp.plot(cmap='Blues')
plt.title("Confusion Matrix - Logistic Regression")
plt.show()

# --- ROC Curve ---
RocCurveDisplay.from_predictions(y_test, y_proba)
plt.title("ROC Curve - Logistic Regression")
plt.show()

## 4. Hyperparameter tuning (Logistic Regression)

We optimize the Logistic Regression model using `GridSearchCV` with stratified 5-fold cross-validation.

### Tuned parameters:
- `penalty`: Regularization type (`l1`, `l2`)
- `C`: Inverse regularization strength (smaller = stronger penalty)
- `class_weight`: Handles imbalance by weighting the minority class
- `solver`: Optimization algorithm

### Scoring:
- Primary metric: **ROC AUC**
- Cross-validation: **Stratified 5-fold**

After tuning, we retrain the best model and evaluate it again on the untouched test set.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np

# Create logistic regression pipeline with scaler
pipe = Pipeline([
    ('scaler', StandardScaler()),  # Redundant but ensures safe scaling inside CV folds
    ('logreg', LogisticRegression(max_iter=1000, random_state=42))
])

# Define parameter grid
param_grid = {
    'logreg__penalty': ['l1', 'l2'],
    'logreg__C': np.logspace(-3, 1, 5),  # [0.001, 0.01, 0.1, 1, 10]
    'logreg__solver': ['liblinear'],  # supports both l1 and l2
    'logreg__class_weight': [None, 'balanced']
}

# Initialize grid search
grid = GridSearchCV(
    pipe,
    param_grid=param_grid,
    scoring='roc_auc',
    cv=5,
    n_jobs=-1,
    verbose=1
)

# Fit to resampled training data
grid.fit(X_train_resampled, y_train_resampled)

# Best model
best_model = grid.best_estimator_

# Results
print("✅ Grid Search complete.")
print(f"Best ROC AUC: {grid.best_score_:.3f}")
print("Best parameters:")
for k, v in grid.best_params_.items():
    print(f"  {k}: {v}")

## Comparison: Untuned vs Tuned Logistic Regression

We compare performance between the original (default) logistic regression model and the tuned model from GridSearchCV using the same evaluation metrics:

- Classification report
- ROC AUC
- Confusion matrix
- ROC curve

This helps determine whether hyperparameter tuning produced a meaningful gain in predictive performance and class separation.

In [None]:
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    classification_report, confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay
)

import matplotlib.pyplot as plt

# --- Untuned model prediction (logreg from earlier cell) ---
y_pred_untuned = logreg.predict(X_test)
y_proba_untuned = logreg.predict_proba(X_test)[:, 1]

# --- Tuned model prediction ---
y_pred_tuned = best_model.predict(X_test)
y_proba_tuned = best_model.predict_proba(X_test)[:, 1]

# --- Metric comparison ---
def summarize_metrics(name, y_true, y_pred, y_proba):
    print(f"\n{name}")
    print("-" * len(name))
    print(f"Accuracy:  {accuracy_score(y_true, y_pred):.3f}")
    print(f"Precision: {precision_score(y_true, y_pred):.3f}")
    print(f"Recall:    {recall_score(y_true, y_pred):.3f}")
    print(f"F1 Score:  {f1_score(y_true, y_pred):.3f}")
    print(f"AUC:       {roc_auc_score(y_true, y_proba):.3f}")

summarize_metrics("Untuned Logistic Regression", y_test, y_pred_untuned, y_proba_untuned)
summarize_metrics("Tuned Logistic Regression", y_test, y_pred_tuned, y_proba_tuned)

# --- Side-by-side confusion matrix plots ---
fig, ax = plt.subplots(1, 2, figsize=(12, 4))

ConfusionMatrixDisplay.from_predictions(y_test, y_pred_untuned, ax=ax[0], cmap="Blues")
ax[0].set_title("Untuned Logistic Regression")

ConfusionMatrixDisplay.from_predictions(y_test, y_pred_tuned, ax=ax[1], cmap="Greens")
ax[1].set_title("Tuned Logistic Regression")

plt.tight_layout()
plt.show()

# --- Side-by-side ROC curves ---
fig, ax = plt.subplots(1, 1, figsize=(6, 5))
RocCurveDisplay.from_predictions(y_test, y_proba_untuned, ax=ax, name="Untuned")
RocCurveDisplay.from_predictions(y_test, y_proba_tuned, ax=ax, name="Tuned")
ax.set_title("ROC Curve Comparison")
plt.show()

### Precision-Recall Curve: Tuned vs Untuned

The Precision-Recall (PR) curve is useful for evaluating model performance when the dataset is imbalanced.

- **Precision**: What proportion of predicted positives are truly positive?
- **Recall**: What proportion of actual positives are correctly predicted?

We compare PR curves for both the untuned and tuned logistic regression models to assess how well they retrieve the positive class (`Attrition = 1`) across different thresholds.

In [None]:
from sklearn.metrics import precision_recall_curve, PrecisionRecallDisplay

# Calculate precision-recall points
prec_untuned, rec_untuned, _ = precision_recall_curve(y_test, y_proba_untuned)
prec_tuned, rec_tuned, _ = precision_recall_curve(y_test, y_proba_tuned)

# Plot both
plt.figure(figsize=(6, 5))
PrecisionRecallDisplay(precision=prec_untuned, recall=rec_untuned).plot(ax=plt.gca(), label="Untuned")
PrecisionRecallDisplay(precision=prec_tuned, recall=rec_tuned).plot(ax=plt.gca(), label="Tuned")

plt.title("Precision-Recall Curve Comparison")
plt.legend(loc="lower left")
plt.grid(True)
plt.show()

In [None]:
## Threshold tuning (Decision boundary adjustment)

By default, classifiers use a threshold of 0.5 to convert probabilities into class labels.

However, in imbalanced classification (like attrition), adjusting this threshold can:
- Improve **recall** (catch more attrition cases)
- Improve **precision** (reduce false positives)
- Maximize **F1 score** (balance both)

We analyze how precision, recall, and F1 score change across thresholds and select a custom decision boundary.

In [None]:
import numpy as np
from sklearn.metrics import f1_score

# Generate precision, recall, thresholds
prec, rec, thresholds = precision_recall_curve(y_test, y_proba_tuned)
f1s = 2 * (prec * rec) / (prec + rec + 1e-8)  # avoid divide-by-zero

# Find best threshold by max F1
best_idx = np.argmax(f1s)
best_threshold = thresholds[best_idx]
best_f1 = f1s[best_idx]

print(f"📌 Best threshold (by F1): {best_threshold:.3f}")
print(f"Best F1 score: {best_f1:.3f}")

In [None]:
plt.figure(figsize=(10, 5))

plt.plot(thresholds, prec[:-1], label='Precision', linestyle='--')
plt.plot(thresholds, rec[:-1], label='Recall', linestyle='--')
plt.plot(thresholds, f1s[:-1], label='F1 Score', linewidth=2)

plt.axvline(best_threshold, color='red', linestyle=':')
plt.title("Precision, Recall, and F1 vs Decision Threshold")
plt.xlabel("Threshold")
plt.ylabel("Score")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
# Apply new threshold to probabilities
y_pred_custom = (y_proba_tuned >= best_threshold).astype(int)

# Metrics at custom threshold
print("📊 Classification Report (Custom Threshold):")
print(classification_report(y_test, y_pred_custom, digits=3))

# Confusion Matrix
ConfusionMatrixDisplay.from_predictions(y_test, y_pred_custom, cmap='Purples')
plt.title(f"Confusion Matrix (Threshold = {best_threshold:.2f})")
plt.show()

## 6. Final model selection, evaluation, and export

We finalize the best-performing model (tuned Logistic Regression), evaluate it on the test set using our custom threshold, and export it for use in the explainability notebook.

Steps:
- Retrain tuned model on full SMOTE-resampled training set
- Apply custom decision threshold (from F1 optimization)
- Evaluate using classification report, confusion matrix, ROC, and PR curve
- Export the trained model and threshold using `joblib`


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    ConfusionMatrixDisplay,
    RocCurveDisplay,
    precision_recall_curve,
    PrecisionRecallDisplay
)
import joblib
import matplotlib.pyplot as plt
import os

# Rebuild final model using best parameters
final_model = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(
        penalty=grid.best_params_['logreg__penalty'],
        C=grid.best_params_['logreg__C'],
        solver=grid.best_params_['logreg__solver'],
        class_weight=grid.best_params_['logreg__class_weight'],
        max_iter=1000,
        random_state=42
    ))
])

# Retrain on full resampled training set
final_model.fit(X_train_resampled, y_train_resampled)

# Predict on test set
y_proba_final = final_model.predict_proba(X_test)[:, 1]
y_pred_final = (y_proba_final >= best_threshold).astype(int)

# --- Evaluation ---
print("📊 Final Model (Custom Threshold) Evaluation:")
print(classification_report(y_test, y_pred_final, digits=3))
print(f"ROC AUC: {roc_auc_score(y_test, y_proba_final):.3f}")

# --- Confusion Matrix ---
ConfusionMatrixDisplay.from_predictions(y_test, y_pred_final, cmap="PuBu")
plt.title(f"Confusion Matrix (Final Model, Threshold = {best_threshold:.2f})")
plt.show()

# --- ROC Curve ---
RocCurveDisplay.from_predictions(y_test, y_proba_final)
plt.title("ROC Curve – Final Logistic Regression")
plt.grid(True)
plt.show()

# --- Precision-Recall Curve ---
prec, rec, _ = precision_recall_curve(y_test, y_proba_final)
PrecisionRecallDisplay(precision=prec, recall=rec).plot()
plt.title("Precision-Recall Curve – Final Logistic Regression")
plt.grid(True)
plt.show()

# --- Export model and threshold ---
os.makedirs("../models", exist_ok=True)
joblib.dump(final_model, "../models/logreg_final_model.joblib")
joblib.dump(best_threshold, "../models/logreg_threshold.joblib")

print("Final model and threshold exported to '../models/'")
