✅ PROJECT: HR People Analytics & Attrition Insights Using Employee Attrition Dataset

This project will include:

Data Cleaning & Validation (Python)

Exploratory HR Analytics

Attrition Insights Dashboard (Power BI)

Predictive Attrition Model (Python)

Executive Insights Report (PDF)

GitHub-Ready Project Structure

## Step 1 — Install dependencies

In [None]:
!pip install matplotlib seaborn scikit-learn



## Step 2 — Load Data

In [None]:
import pandas as pd


train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

train.head(), test.head()

(   Employee ID  Age  Gender  Years at Company    Job Role  Monthly Income  \
 0         8410   31    Male                19   Education            5390   
 1        64756   59  Female                 4       Media            5534   
 2        30257   24  Female                10  Healthcare            8159   
 3        65791   36  Female                 7   Education            3989   
 4        65026   56    Male                41   Education            4821   
 
   Work-Life Balance Job Satisfaction Performance Rating  Number of Promotions  \
 0         Excellent           Medium            Average                     2   
 1              Poor             High                Low                     3   
 2              Good             High                Low                     0   
 3              Good             High               High                     1   
 4              Fair        Very High            Average                     0   
 
    ... Number of Dependents  Job Le

## Step 3 — Split train.csv into features & target

In [None]:
target_col = "Attrition" # Define the target column
X = train.drop(columns=[target_col])
y = train["Attrition"].map({'Left': 1, 'Stayed': 0})

## Step 4 — Combine train+test for shared preprocessing

In [None]:
combined = pd.concat([X, test], axis=0)

combined.shape



(74498, 24)

## Step 5 — Clean + encode

In [None]:
combined = pd.get_dummies(combined, drop_first=True)

X_processed = combined.iloc[:len(train)]
test_processed = combined.iloc[len(train):]

## Step 6 — Train/Test split

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

X_train, X_valid, y_train, y_valid = train_test_split(
    X_processed, y, test_size=0.2, random_state=42
)

## Step 7 — Train a model

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

model = RandomForestClassifier(
    n_estimators=300,
    max_depth=12,
    random_state=42
)
model.fit(X_train, y_train)

preds = model.predict(X_valid)
probs = model.predict_proba(X_valid)[:,1]

print(classification_report(y_valid, preds))
print("ROC-AUC:", roc_auc_score(y_valid, probs))

              precision    recall  f1-score   support

           0       0.76      0.77      0.76      6253
           1       0.74      0.73      0.73      5667

    accuracy                           0.75     11920
   macro avg       0.75      0.75      0.75     11920
weighted avg       0.75      0.75      0.75     11920

ROC-AUC: 0.8353125915124531


## Step 8 — Predict on test.csv

In [None]:
test_probs = model.predict_proba(test_processed)[:,1]
submission = pd.DataFrame({
    "EmployeeID": test["Employee ID"],
    "Attrition_Probability": test_probs
})
submission.head()

Unnamed: 0,EmployeeID,Attrition_Probability
0,52685,0.44909
1,30585,0.642482
2,54656,0.124587
3,33442,0.786209
4,15667,0.057866


In [None]:
def risk_level(p):
    if p >= 0.70:
        return "High Risk"
    elif p >= 0.40:
        return "Medium Risk"
    else:
        return "Low Risk"

submission["Risk_Level"] = submission["Attrition_Probability"].apply(risk_level)
submission.head()

Unnamed: 0,EmployeeID,Attrition_Probability,Risk_Level
0,52685,0.44909,Medium Risk
1,30585,0.642482,Medium Risk
2,54656,0.124587,Low Risk
3,33442,0.786209,High Risk
4,15667,0.057866,Low Risk


## Step 9 — Save output

In [None]:
submission.to_csv("submission.csv", index=False)

In [None]:
files.download("submission.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Feature Importance Insights (Why Are They Leaving?)

In [None]:
# For Random Forest:
import pandas as pd
import numpy as np

importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

importance.head(20)

Unnamed: 0,feature,importance
30,Job Level_Senior,0.160339
28,Marital Status_Single,0.125238
33,Remote Work_Yes,0.102559
27,Marital Status_Married,0.062111
5,Distance from Home,0.049993
4,Number of Promotions,0.036983
2,Years at Company,0.035137
6,Number of Dependents,0.033876
0,Employee ID,0.033836
3,Monthly Income,0.033122


## SHAP Explainability
- Global factors driving attrition

- Individual employee explanations

- Summary plots

In [None]:
# Sample from data to run SHAP faster
sample = X_valid.sample(300, random_state=42)  # 300 rows is PLENTY for SHAP

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(sample)

shap.summary_plot(shap_values[1], sample)

NameError: name 'X_valid' is not defined

In [None]:
!pip install shap

import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_valid)

shap.summary_plot(shap_values[1], X_valid)

