# Final Report – Predicting Employee Attrition with Explainable ML

## 1. Project Overview

This project analyzes the IBM HR Analytics Employee Attrition dataset to predict whether an employee is likely to leave the company.

The goal is to:
- Build a reliable classification model to identify attrition risk
- Interpret the model's decisions using explainability tools
- Provide actionable insights for HR decision-makers

Tools used: Python, pandas, scikit-learn, SHAP, LIME, matplotlib

## 2. Dataset Summary

- **Source**: IBM HR Analytics Employee Attrition dataset (Kaggle)
- **Rows**: 1470 employees
- **Features**: 35 columns including age, department, income, job satisfaction, etc.
- **Target**: `Attrition` (Yes = left, No = stayed)

Notable characteristics:
- No missing values
- Highly imbalanced target (≈16% attrition)
- Mix of numerical and categorical variables

## 3. Methodology

The project followed a standard machine learning workflow:

1. **EDA**: Feature distributions, correlation analysis, and outlier detection
2. **Preprocessing**:
   - Dropped constant and non-informative columns
   - Engineered features (e.g., tenure ratios, satisfaction averages)
   - Encoded categorical variables
   - Scaled numerical features
3. **Class imbalance**: Handled using SMOTE oversampling
4. **Modeling**:
   - Trained Logistic Regression with GridSearchCV
   - Tuned hyperparameters and threshold
   - Evaluated using accuracy, F1, ROC AUC, confusion matrix
5. **Explainability**:
   - Global + local explanations using SHAP
   - Per-instance breakdowns using LIME

## 4. Final Model Performance (Logistic Regression)

| Metric       | Value |
|--------------|-------|
| Accuracy     | X.XXX |
| Precision    | X.XXX |
| Recall       | X.XXX |
| F1 Score     | X.XXX |
| ROC AUC      | X.XXX |
| Threshold    | X.XXX (optimized via F1) |

Classified using a threshold optimized for F1 balance.

Confusion matrix and ROC curve show strong performance on identifying attrition cases while minimizing false positives.

## 5. Explainability Highlights

- **Top global predictors** (SHAP):
  - `OverTime`, `JobLevel`, `Age`, `MonthlyIncome`, `DistanceFromHome`
- **False negatives** were often low-income, high-commute cases with no overtime
- **SHAP waterfall plots** clarified how feature values push predictions toward/away from attrition
- **LIME explanations** confirmed individual predictions and aligned with SHAP insights

Together, SHAP and LIME made the model’s logic transparent and trustable.

## 6. Business Insights

- Employees with **frequent overtime** and **low job levels** are at elevated attrition risk.
- High-income employees are generally less likely to leave unless paired with high stress signals (e.g., low satisfaction, high commute).
- Some at-risk cases may go unnoticed without explainable models — thresholds and probability ranking matter.

This model can help HR teams proactively identify and support at-risk employees.
## 6. Limitations and Future Work 

## 7. Limitations & Future Work

**Limitations:**
- Small dataset (1470 rows)
- Based on simulated HR data, not real company logs
- Interpretability limited by feature granularity

**Future improvements:**
- Test on real organizational data
- Compare logistic regression with tree-based models (e.g., XGBoost with SHAP)
- Integrate explanations into HR dashboards for ongoing monitoring

## 8. Conclusion

This project delivered a transparent, effective employee attrition prediction pipeline. Through careful preprocessing, tuning, and explainability integration, the final model balances predictive power with interpretability — making it suitable for real-world HR decision support.

Next steps: deployment, monitoring, and business alignment.