# Modeling and Evaluation

## 1. Notebook Overview

This notebook builds, tunes, and evaluates machine learning models for employee attrition prediction using the cleaned dataset and the saved preprocessing pipeline from the previous notebook.

The focus here is on interpretability. Logistic Regression is used as the primary model because it's straightforward to explain and interpret, making it the perfect choice for understanding what factors most strongly influence attrition.

Specifically, this notebook:

- Loads the cleaned dataset (`data_01.csv`)
- Loads the saved preprocessing pipeline (`preprocessing_pipeline.pkl`)
- Applies the transformations defined during preprocessing via the pipeline (feature engineering, encoding, scaling)
- Trains and evaluates a Logistic Regression classifier
- Assesses performance using both cross-validation and test set metrics
- Interprets results using explainability techniques to highlight drivers of attrition

In [None]:
import pandas as pd
import joblib

## 2. Load Dataset and Preprocessing Pipeline

- Load `data_01.csv` (unaltered clean dataset)
- Load `preprocessing_pipeline.pkl` using `joblib`
- Confirm compatibility and inspect sample rows

In [None]:
# Load dataset and preprocessing pipeline

# Load the unaltered clean dataset
df = pd.read_csv('../data/processed/data_01.csv')

# Load the saved preprocessing pipeline
preprocessing_pipeline = joblib.load('../models/preprocessing_pipeline.pkl')

# Confirm structure of raw data
print("Shape of data_01.csv:", df.shape)
display(df.head())

# Check pipeline input compatibility
# Extract the list of expected input features from the pipeline
# by running a transform on a few rows and catching errors early
try:
    _ = preprocessing_pipeline.transform(df.head(3))
    print("Dataset is compatible with the preprocessing pipeline.")
except Exception as e:
    print("Compatibility check failed.")
    print("Error:", str(e))

## 3. Train-Test Split

- Separate features (`X`) and target (`y`)
- Perform stratified train-test split (preserving class distribution)

## 4. Build Full Modeling Pipeline

- Append model (e.g. `LogisticRegression`) to preprocessing pipeline
- Optionally add oversampling (e.g. `SMOTE`) to handle class imbalance
- Define complete pipeline for training and evaluation

## 5. Model Training and Evaluation

- Fit model on training data
- Predict on test set
- Evaluate with metrics:
  - Accuracy
  - Precision
  - Recall
  - F1-score
  - ROC-AUC
- Display confusion matrix and ROC curve

## 6. Hyperparameter Tuning

- Use `GridSearchCV` or `Optuna` to optimize hyperparameters
- Cross-validate model performance
- Compare tuned vs. baseline model results
- Save best-performing model

## 7. Model Explainability

- Apply SHAP and/or LIME for feature attribution
- Visualize global feature importance
- Analyze local predictions and edge cases
- Identify drivers of attrition risk

## 8. Business Insights and Final Summary

- Summarize model performance and key findings
- Highlight impactful features driving attrition
- Provide actionable recommendations for stakeholders
- Outline potential next steps or deployment considerations