# Using machine learning to identity clusters of 'at risk' employees.

In [47]:
## Step 1 - Import necessary libraries
import pandas as pd
import sklearn
import joblib  # for saving an ml model

# Robust data-file finder and loader (uses a relative path for reading)
from pathlib import Path
import os


Identifying causes of attrition and enabling the business to identify different groups of 'at risk' employees is key. 

Steps 1-5 - This notebook will create a model that will assess the 1400 rows of data currently available (split by 'train' and 'test' groups) to identify if there are logical groupings of exited staff, to enable future departures to be anticipated and, if desired, attempts made to retain.

Step 6 - we will then look to cluster the data to identify any features that would help the business devise a retention strategy for distinct groups of leavers.

### Step 1 - import the cleaned data

In [48]:
# Load cleaned data using an explicit relative path
df = pd.read_csv('../Data files/HR_Attrition_Cleaned.csv')

print(df.info())
print(df.head())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 39 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Age                           1470 non-null   int64  
 1   Attrition                     1470 non-null   object 
 2   BusinessTravel                1470 non-null   object 
 3   DailyRate                     1470 non-null   int64  
 4   Department                    1470 non-null   object 
 5   DistanceFromHome              1470 non-null   int64  
 6   Education                     1470 non-null   object 
 7   EducationField                1470 non-null   object 
 8   EnvironmentSatisfaction       1470 non-null   object 
 9   Gender                        1470 non-null   object 
 10  HourlyRate                    1470 non-null   int64  
 11  JobInvolvement                1470 non-null   object 
 12  JobLevel                      1470 non-null   int64  
 13  Job

### Step 2 - Data preparation for ML

Next, as we're working with a mix of numeric and string columns, we'll identify the data types in readiness to encoding. But our target column (Attrition) needs to be removed.

In [49]:
target_col = "Attrition"

numeric_cols = df.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_cols = df.select_dtypes(include=["object", "category"]).columns.tolist()

# Remove target from both lists if present
numeric_cols = [col for col in numeric_cols if col != target_col]
categorical_cols = [col for col in categorical_cols if col != target_col]

Now we need to split our data into a training set and then a testing set.

In [50]:
from sklearn.model_selection import train_test_split

X = df.drop(target_col, axis=1)
y = df[target_col]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=101
)

print("Train shape:", X_train.shape, y_train.shape)
print("Test shape:", X_test.shape, y_test.shape)

Train shape: (1029, 38) (1029,)
Test shape: (441, 38) (441,)


Now we need to pre-process our data to allow the pipeline to handle those different data types

In [51]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

preprocessor = ColumnTransformer(transformers=[
    ("num", StandardScaler(), numeric_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_cols)
])

### Step 3 - build pipeline, create model

Now the pipeline model can be built. We are using Random Forest because it handles mixed data sets, can identify complex interactions (ydata-profiling already identified that there's no single strong correlation for Attrition) and can rank importance. It is also more appropriate for 'imbalanced' data sets, like attrition, where values are more likely to be in the 'still employed' side of the data.

In [52]:
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ("preprocessing", preprocessor),
    ("model", RandomForestClassifier(random_state=42))
])

Now we train the model using the 'train' data set

In [53]:
pipeline.fit(X_train, y_train)
print("Model score:", pipeline.score(X_test, y_test))

Model score: 0.8526077097505669


While a score of 0.85 indicates a strong accuracy rate, this can sometimes be misleading if there is not an even split in the data. So we need to understand if the prediction of 'Yes' to attrition is strong, rather than a mean score across the group.

In [None]:
#show the split of data between yes and no for attrition
df["Attrition"].value_counts(normalize=True)

Attrition
No     0.838776
Yes    0.161224
Name: proportion, dtype: float64

In [None]:
#check accuracy for each of the two groups
from sklearn.metrics import classification_report
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          No       0.86      0.98      0.92       371
         Yes       0.65      0.16      0.25        70

    accuracy                           0.85       441
   macro avg       0.75      0.57      0.59       441
weighted avg       0.83      0.85      0.81       441



The model is right 65% of the time when it predicts attrition to be Yes (precision) but it's only picking up 16% of the actual leavers (recall).

So we need to try to fine tune the model to improve its accuracy.

### Step 4 - improve outcomes

We have several options to refine the model. We'll look at two of those options:

1) weight the mistakes more strongly, to take attrition more seriously, and
2) change the balance of Yes and No cases (create some fake-but-similar Yes rows or remove some of the No rows).

We'll do both of these, with 2) first to improve the training data, followed by 1).

In [69]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

# Step 0: Encode categorical columns in X
X_encoded = X.copy()
label_encoders = {}

for col in X_encoded.select_dtypes(include="object").columns:
    le = LabelEncoder()
    X_encoded[col] = le.fit_transform(X_encoded[col])
    label_encoders[col] = le


In [70]:
# Step 1: Split encoded data
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, stratify=y, test_size=0.2, random_state=42
)


In [71]:
# Step 2: Build pipeline with SMOTE and class-weighted model
pipeline = Pipeline([
    ("scaler", StandardScaler()),  # Optional
    ("smote", SMOTE(random_state=42)),
    ("model", RandomForestClassifier(class_weight="balanced", random_state=42))
])


In [72]:
# Step 3: Fit pipeline
pipeline.fit(X_train, y_train)


In [73]:
# Step 4: Evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

          No       0.86      0.96      0.91       247
         Yes       0.48      0.21      0.29        47

    accuracy                           0.84       294
   macro avg       0.67      0.58      0.60       294
weighted avg       0.80      0.84      0.81       294



Results are still not strong, and while recall has gone up, precision has decreased.

We will reduce the confidence threshold to see if this improves our outcomes.

In [76]:
import numpy as np

# Step 5: Adjust classification threshold
# Get predicted probabilities for the “Yes” class
y_proba = pipeline.predict_proba(X_test)[:, 1]

# Lower threshold from 0.5 to 0.3 (or test multiple)
y_pred_thresh = (y_proba >= 0.3).astype(int)

# Map numeric predictions back to original labels, to prevent TypeError that was being received
y_pred_labels = np.where(y_pred_thresh == 1, "Yes", "No")

# Evaluate
print(classification_report(y_test, y_pred_labels))

              precision    recall  f1-score   support

          No       0.93      0.81      0.87       247
         Yes       0.41      0.68      0.51        47

    accuracy                           0.79       294
   macro avg       0.67      0.75      0.69       294
weighted avg       0.85      0.79      0.81       294



It's now picking up 68% of the attrition Yes cases, but when it predicts Yes, it's only right 41% of the time.

While 'err on the side of caution' is probably acceptable from an employee retention perspective, we will also run a logistic regression to see if that provides stronger results.

### Step 5 - run and compare alternative model

Logistic regression is another model that might be used to identify leavers, so we will run this to see if the results are more accurate.

In [77]:
from sklearn.linear_model import LogisticRegression
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Step 0: Encode categorical features (if not already done)
# Use X_encoded from earlier

# Step 1: Split encoded data
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, stratify=y, test_size=0.2, random_state=42
)

# Step 2: Build pipeline with SMOTE and weighted Logistic Regression
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("smote", SMOTE(random_state=42)),
    ("model", LogisticRegression(class_weight="balanced", max_iter=1000, random_state=42))
])

# Step 3: Fit and evaluate
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          No       0.94      0.75      0.84       247
         Yes       0.37      0.77      0.50        47

    accuracy                           0.75       294
   macro avg       0.66      0.76      0.67       294
weighted avg       0.85      0.75      0.78       294



Conclusion: while the precision of the Logistic regression model isn't high, it is better to capture extra (not at risk) staff when deploying a retention strategy but ensuring we maximise the actual leavers that are captured. So this model gives slightly better results when it comes to accurately identifying leavers.

Now we can save this model so it can be deployed against future tranches of the same data.

In [81]:
# Save the pipeline to a file
joblib.dump(pipeline, "logistic_attrition_model.pkl")

['logistic_attrition_model.pkl']

### Clustering - finding structure in the attrition data

We'll now look at whether there are any patterns in our attrition data, to see if this provides insight on how retention strategies might be developed.

In [84]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Step 1: Scale features (excluding target)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_encoded)

# Step 2: Run KMeans (try 5 clusters)
kmeans = KMeans(n_clusters=5, random_state=42)
cluster_labels = kmeans.fit_predict(X_scaled)

# Step 3: Add cluster and attrition info to a new DataFrame
cluster_df = pd.DataFrame(X_encoded)
cluster_df["Cluster"] = cluster_labels
cluster_df["Attrition"] = y.values  # original string labels

# Step 4: Profile clusters
cluster_summary = cluster_df.groupby("Cluster")["Attrition"].value_counts(normalize=True).unstack().fillna(0)
print(cluster_summary)

Attrition        No       Yes
Cluster                      
0          0.910000  0.090000
1          0.896296  0.103704
2          0.825581  0.174419
3          0.694969  0.305031
4          0.879070  0.120930


  super()._check_params_vs_input(X, default_n_init=10)


| K-Means clusters              | Outcomes                                      | Summary                          | 
|----------------------|------------------------------------------------------|--------------------------------------------------------|
| 3 cluster| <img src="KMeans 3 cluster.png" width="60%">| One group above average for total data set. Group1 is 'very safe  |
| 4 cluster|  <img src="KMeans 4 cluster.png" width="50%">| Two groups above average for total data set |
| 5 cluster|  <img src="KMeans 5 cluster.png" width="60%">| Two groups above average for total data set |

Having run 3,4 and 5 cluster tests, 5 clusters gives us two groups which would be worth assessing for possible retention strategy in the Yes group (dataset average for attrition is 16.1%). While 4 clusters also gives us two groups, the 5-cluster grouping will be slightly more tailored by virtue of the additional split.

Group3 contains 30% employees who leave, and group2 is made up of 17% leavers. So Group3 is our highest priority but no harm considering a strategy for Group2 who are marginally abovew average.

Now we need to look at what common features cause these groups to be clustered together

In [90]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Step 1: Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_encoded)

# Step 2: Run KMeans with 5 clusters
kmeans = KMeans(n_clusters=5, random_state=42)
cluster_labels = kmeans.fit_predict(X_scaled)

# Step 3: Build full cluster DataFrame
cluster_df = pd.DataFrame(X_encoded.copy())
cluster_df["Cluster"] = cluster_labels
cluster_df["Attrition"] = y.values  # original string labels

# Step 4: Filter high-risk clusters
high_risk_clusters = cluster_df[cluster_df["Cluster"].isin([2, 3])]

# Step 5: Restrict to numeric columns only
numeric_cols = high_risk_clusters.select_dtypes(include=["int64", "float64"]).columns

profile_summary = high_risk_clusters.groupby("Cluster")[numeric_cols].mean().T
print("Feature Averages for Clusters 2 and 3:\n", profile_summary)

# Step 6: Attrition breakdown
attrition_counts = high_risk_clusters.groupby("Cluster")["Attrition"].value_counts(normalize=True).unstack().fillna(0)
print("\nAttrition Rates:\n", attrition_counts)


Feature Averages for Clusters 2 and 3:
 Cluster                                  2             3
Age                              35.563953     29.754717
DailyRate                       795.290698    811.503145
DistanceFromHome                  9.959302      8.503145
HourlyRate                       65.186047     66.584906
JobLevel                          1.709302      1.279874
MonthlyIncome                  4889.052326   3455.443396
MonthlyRate                   13820.232558  14676.128931
NumCompaniesWorked                2.482558      1.716981
PercentSalaryHike                21.715116     14.632075
StockOptionLevel                  0.843023      0.641509
TotalWorkingYears                 9.250000      4.100629
TrainingTimesLastYear             2.767442      2.867925
YearsAtCompany                    6.226744      2.515723
YearsInCurrentRole                4.273256      1.248428
YearsSinceLastPromotion           1.831395      0.575472
YearsWithCurrManager              4.191860      

  super()._check_params_vs_input(X, default_n_init=10)


In [91]:
# Step 1: Compute dataset-wide averages
dataset_avg = pd.DataFrame(X_encoded.select_dtypes(include=["int64", "float64"]).mean(), columns=["Dataset_Avg"])

# Step 2: Get cluster averages for clusters 2 and 3
cluster_avg = high_risk_clusters.groupby("Cluster")[dataset_avg.index].mean().T

# Step 3: Concatenate for comparison
comparison_table = pd.concat([cluster_avg, dataset_avg], axis=1)
print("Cluster vs Dataset Averages:\n", comparison_table)

Cluster vs Dataset Averages:
                                          2             3   Dataset_Avg
Age                              35.563953     29.754717     36.923810
DailyRate                       795.290698    811.503145    802.485714
DistanceFromHome                  9.959302      8.503145      9.192517
HourlyRate                       65.186047     66.584906     65.891156
JobLevel                          1.709302      1.279874      2.063946
MonthlyIncome                  4889.052326   3455.443396   6502.931293
MonthlyRate                   13820.232558  14676.128931  14313.103401
NumCompaniesWorked                2.482558      1.716981      2.693197
PercentSalaryHike                21.715116     14.632075     15.209524
StockOptionLevel                  0.843023      0.641509      0.793878
TotalWorkingYears                 9.250000      4.100629     11.279592
TrainingTimesLastYear             2.767442      2.867925      2.799320
YearsAtCompany                    6.226744     

So this is telling us a couple of things for our strategy development:

- Group 3 (highest risk) are relatively young (avg 29.8), low level roles (avg 1.28) and only been at the company an avg of 2.5yrs.