# Employee Attrition Prediction

## Third Notebook: Behavioral Archetypes Integration and Global EDA

This notebook builds upon the previous clustering analysis, where employee behavioral archetypes were identified based on attendance patterns, workload, and schedule volatility. The purpose here is to integrate these archetypes into the main dataset and perform a comprehensive exploratory data analysis across all features, combining traditional demographic, job-related, and behavioral dimensions.

By incorporating the previously derived archetypes, we aim to better understand the patterns associated with employee attrition and prepare the dataset for subsequent predictive modeling. The focus of this notebook is on feature exploration, distribution analysis, correlations, and potential relationships between behavioral archetypes and attrition outcomes.

**Author**: J-F Jutras  
**Date**: January 2026  
**Dataset**: HR Analytics Case Study — Kaggle


## 3.1-Data Loading and Integration

In [None]:
import kagglehub
import pandas as pd
import os
import pickle

#Download latest version
path = kagglehub.dataset_download("vjchoudhary7/hr-analytics-case-study")

#Define dataset path
dataset_dir = "/kaggle/input/hr-analytics-case-study"

#Load datasets
employee_csv = os.path.join(dataset_dir, "employee_survey_data.csv")
df_employee = pd.read_csv(employee_csv)

general_csv = os.path.join(dataset_dir, "general_data.csv")
df_general = pd.read_csv(general_csv)

manager_csv = os.path.join(dataset_dir, "manager_survey_data.csv")
df_manager = pd.read_csv(manager_csv)

#Load archetypes from Kaggle input
archetype_path = "/kaggle/input/employee-archetypes/employee_archetypes.pkl"
with open(archetype_path, "rb") as f:
    df_archetypes = pickle.load(f)

#Clone the public GitHub repository "jfj-utils" into the current Kaggle working directory.
#This downloads all files and folders from the repo so they can be used in the notebook.
!rm -rf /kaggle/working/jfj-utils
!git clone https://github.com/jfjutras07/jfj-utils.git

#Add the cloned repository to the Python path so Python can import modules from it
import sys
sys.path.append("/kaggle/working/jfj-utils")

In [None]:
#Check shapes
print(df_general.shape)
print(df_employee.shape)
print(df_manager.shape)
print(df_archetypes.shape)

In [None]:
#Create a list of the 2888 valid EmployeeIDs
valid_ids = df_archetypes['EmployeeID'].unique()

#Filter the other DataFrames to match
df_general = df_general[df_general['EmployeeID'].isin(valid_ids)].copy()
df_employee = df_employee[df_employee['EmployeeID'].isin(valid_ids)].copy()
df_manager = df_manager[df_manager['EmployeeID'].isin(valid_ids)].copy()

#Final shape check
print(f"General: {df_general.shape}")
print(f"Employee: {df_employee.shape}")
print(f"Manager: {df_manager.shape}")
print(f"Archetypes: {df_archetypes.shape}")

In [None]:
#Merge datasets using EmployeeID
df = df_general.copy()
df = df.merge(df_employee, on="EmployeeID", how="left")
df = df.merge(df_manager, on="EmployeeID", how="left")
df = df.merge(df_archetypes, on="EmployeeID", how="left")

print(df.shape) #Expecting 30 columns without EmployeeID columns
print(df.head())

### Column Description

| Variable | Description | Values / Encoding |
|---------|-------------|-------------------|
| Age | Age of the employee | Numeric |
| Attrition | Whether the employee left the company in the previous year | Yes / No |
| Archetype | Employee engagement / behavioral archetype | Quiet Disengaged · Chaotic Contributor · Steady Regular · High Performer |
| BusinessTravel | Frequency of business travel during the last year | Non-Travel / Travel_Rarely / Travel_Frequently |
| Department | Department within the company | Text |
| DistanceFromHome | Distance from home to workplace (in km) | Numeric |
| Education | Education level | 1 Below College · 2 College · 3 Bachelor · 4 Master · 5 Doctor |
| EducationField | Field of education | Text |
| EmployeeCount | Employee count | Constant (1) |
| EnvironmentSatisfaction | Satisfaction level with the work environment | 1 Low · 2 Medium · 3 High · 4 Very High |
| Gender | Gender of the employee | Male / Female |
| JobInvolvement | Level of job involvement | 1 Low · 2 Medium · 3 High · 4 Very High |
| JobLevel | Job level within the company | Scale from 1 to 5 |
| JobRole | Job role title | Text |
| JobSatisfaction | Job satisfaction level | 1 Low · 2 Medium · 3 High · 4 Very High |
| MaritalStatus | Marital status of the employee | Single / Married / Divorced |
| MonthlyIncome | Monthly income (in rupees) | Numeric |
| NumCompaniesWorked | Total number of companies worked for | Numeric |
| Over18 | Whether the employee is over 18 years old | Yes |
| PercentSalaryHike | Percentage salary increase in the last year | Numeric |
| PerformanceRating | Performance rating in the last year | 1 Low · 2 Good · 3 Excellent · 4 Outstanding |
| RelationshipSatisfaction | Relationship satisfaction level | 1 Low · 2 Medium · 3 High · 4 Very High |
| StandardHours | Standard working hours | Numeric |
| StockOptionLevel | Employee stock option level | 0 to 3 |
| TotalWorkingYears | Total years of work experience | Numeric |
| TrainingTimesLastYear | Number of training sessions attended last year | Numeric |
| WorkLifeBalance | Work-life balance level | 1 Bad · 2 Good · 3 Better · 4 Best |
| YearsAtCompany | Total years spent at the company | Numeric |
| YearsSinceLastPromotion | Years since last promotion | Numeric |
| YearsWithCurrManager | Years under the current manager | Numeric |


## 3.2-Data Overview

In [None]:
from ingestion.readers import check_data
check_data(df)

In [None]:
from eda.describe_structure import describe_structure
describe_structure(df, id_cols = ['EmployeeID'])

This dataset contains 2,888 employee records across 30 variables covering demographics, job characteristics, compensation, career history, satisfaction, and performance. It combines numeric, categorical, and ordinal features, with no duplicate records and only minor missing values in experience- and satisfaction-related variables.

The target variable, Attrition, indicates whether an employee left the company, complemented by the derived Archetype category describing employee profiles. Two constant columns (EmployeeCount and Over18) carry no variance and can be excluded from modeling.

Employees are mostly mid-career (average age ~37), working primarily in Research & Development, with moderate income dispersion and generally high performance ratings.

## 3-3-EDA (Global)

**Exploratory Data Analysis – Roadmap**

| Analysis Level | Focus Area | Key Metrics / Variables (Summary) | Strategic Objective |
|---------------|-----------|------------------------|---------------------|
| **Univariate** | Workforce Baseline Profiling | Age, MonthlyIncome, JobLevel, TotalWorkingYears, YearsAtCompany, Gender | Establish the normative employee profile and understand central tendencies vs. dispersion in the workforce. |
| **Univariate** | Satisfaction & Engagement Distribution | JobSatisfaction, EnvironmentSatisfaction, WorkLifeBalance, JobInvolvement | Identify dominant satisfaction levels and detect early signs of disengagement through skewed or polarized distributions. |
| **Bivariate** | Attrition Differentiators | Attrition × (Age, MonthlyIncome, YearsAtCompany, JobLevel, Education, PercentSalaryHike) | Identify which demographic and career-related factors most strongly differentiate leavers from stayers. |
| **Bivariate** | Job Structure & Mobility Risk | Attrition × (JobRole, Department, BusinessTravel, MaritalStatus, EducationField) | Detect structural roles or contexts with elevated attrition risk and organizational exposure points. |
| **Multivariate** | Correlation & Feature Importance | Three career subsets: `Career Trajectory Dynamics`, `Compensation & Recognition Patterns`, `Satisfaction-Engagement Alignment` | Explore monotonic relationships (Spearman correlation) and non-linear variable importance using a Random Forest to identify the strongest predictors for attrition. |
| **Multivariate** | Career Trajectory Dynamics | Attrition × (TotalWorkingYears, YearsAtCompany, YearsSinceLastPromotion, YearsWithCurrManager) | Understand how stagnation, tenure, and managerial stability interact to influence exit decisions. |
| **Multivariate** | Compensation & Recognition Patterns | Attrition × (MonthlyIncome and PercentSalaryHike) | Evaluate whether perceived reward progression aligns with retention outcomes. |
| **Multivariate** | Satisfaction–Engagement Alignment | Attrition × (JobInvolvement, JobSatisfaction, EnvironmentSatisfaction, and Archetypes) | Identify patterns that may signal attrition risk. |



### Univariate Analysis - Workforce Baseline Profiling

In [None]:
#Define columns to plot
workforce_base_cols = ['Age', 'MonthlyIncome', 'TotalWorkingYears', 'YearsAtCompany']
from visualization.explore_continuous import plot_numeric_distribution
plot_numeric_distribution(df, workforce_base_cols)

In [None]:
from visualization.explore_discrete import plot_discrete_distribution_grid
plot_discrete_distribution_grid(df, ['JobLevel', 'Gender'], figsize = (13, 5))

**Age, MonthlyIncome, JobLevel, TotalWorkingYears, YearsAtCompany, Gender**

| Finding | Description |
|--------|-------------|
| Workforce structure | The workforce is predominantly mid-career, with an average age of 36.9 and half of employees between 30 and 43 years old. |
| Gender distribution | The population is male-dominated (~60% men, 40% women), showing a moderate but persistent gender imbalance. |
| Career stage concentration | Employees are mainly at lower to mid job levels (median JobLevel = 2), with typical career paths of around 11 years of total working experience. |
| Tenure and compensation dispersion | Median tenure is about 5 years, while monthly income shows substantial dispersion, driven by a small subset of highly senior, highly compensated employees. |


### Univariate Analysis - Satisfaction and Engagement Distribution

**JobSatisfaction, EnvironmentSatisfaction, WorkLifeBalance, JobInvolvement**

| Finding | Description |
|--------|-------------|
| Overall satisfaction level | Satisfaction and engagement scores are moderately positive, with mean values around 2.7–2.8 on a 4-point scale. |
| Central tendency | Medians at level 3 across all dimensions indicate general satisfaction and involvement rather than high engagement. |
| Score dispersion | Dispersion is limited—especially for WorkLifeBalance and JobInvolvement—suggesting relatively homogeneous experiences. |
| Attrition signal | A non-negligible share of very low satisfaction scores (level 1) points to early disengagement pockets relevant for attrition risk. |


### Bivariate Analysis - Attrition Differentiators

In [None]:
#Define columns to plot
attrition_diff_cols = ['Age', 'MonthlyIncome', 'YearsAtCompany', 'PercentSalaryHike']
from visualization.explore_continuous import plot_box_grid
plot_box_grid(df, attrition_diff_cols, ['Attrition'])

In [None]:
from visualization.explore_discrete_multivariate import plot_discrete_bivariate_grid
plot_discrete_bivariate_grid(df, ['JobLevel', 'Education'], 'Attrition', figsize = (13,4))

**Attrition × (Age, MonthlyIncome, YearsAtCompany, JobLevel, Education, PercentSalaryHike)**

| Finding | Description |
|--------|-------------|
| Age and tenure gap | Leavers are younger (33.6 vs. 37.6 years) and have shorter tenure (5.1 vs. 7.4 years), indicating early-career attrition risk. |
| Career-stage attrition pattern | The combined age–tenure effect shows attrition concentrates in early-to-mid career phases rather than late-stage exits. |
| Compensation structure | Leavers earn slightly less on average, but similar median incomes suggest differences are driven by dispersion rather than central tendency. |
| Limited role of hierarchy | Job level, education, and recent salary hikes show minimal differences, implying attrition is driven more by career stage than formal hierarchy or pay adjustments. |


### Bivariate - Job Structure and Mobility Risk

In [None]:
from visualization.explore_discrete_multivariate import plot_discrete_lollipop_bivariate
plot_discrete_lollipop_bivariate(df, 'JobRole', 'Attrition', figsize = (13, 5))

In [None]:
#Define columns to plot
job_risk_cols = ['Department', 'BusinessTravel', 'MaritalStatus', 'EducationField']
from visualization.explore_discrete_multivariate import plot_discrete_bivariate_grid
plot_discrete_bivariate_grid(df, job_risk_cols, 'Attrition')

**Attrition × (JobRole, Department, BusinessTravel, MaritalStatus, EducationField)**

| Finding | Description |
|--------|-------------|
| Role-level attrition hotspots | Attrition is highest in expert and revenue-facing roles (Research Directors 24.4%, Research Scientists 16.0%, Sales Executives 16.2%), indicating elevated mobility. |
| Departmental risk concentration | Human Resources exhibits a markedly high attrition rate (27.8%), pointing to a localized structural risk. |
| Mobility and personal context | Frequent travelers (23.5%) and single employees (24.1%) show significantly higher attrition, highlighting the role of mobility demands and personal context. |
| Education field heterogeneity | Attrition varies strongly by education field, with Human Resources backgrounds showing exceptionally high exit rates (35.3%). |


### Multivariate Analysis - Correlation and Feature Importance

In [None]:
#Define columns to drop
cols_to_drop = ['EmployeeID', 'EmployeeCount', 'StandardHours']
from visualization.explore_continuous import plot_correlation_heatmap
plot_correlation_heatmap(df.drop(columns = cols_to_drop))

**Correlation Analysis**

| Finding | Description |
|--------|-------------|
| Structural redundancy | Strong Spearman correlations reveal redundancies among career and tenure-related variables, reflecting underlying organizational structures. |
| Career duration block | Age, total experience, tenure, and time since last promotion form a coherent temporal block, suggesting cumulative career dynamics and potential stagnation. |
| Performance–reward linkage | Performance ratings are strongly correlated with salary increases, confirming expected reward mechanisms. |
| Independence of pay and satisfaction | Compensation level and satisfaction metrics show weak links with career progression, suggesting attrition is driven more by contextual and structural alignment. |


**Random Forest**

For this stage of the multivariate exploratory analysis, we will apply a single Random Forest classification model on the full dataset. The goal remains purely exploratory: we are not concerned with correlation or causal inference.

Using one global model allows us to simultaneously evaluate the relative importance of all features, including demographic, career, satisfaction, and behavioral archetype variables. This approach provides a comprehensive overview of which factors carry the strongest signal for attrition. Additionally, Random Forests are robust to multicollinearity and can handle a mix of numerical, ordinal, and categorical features without requiring strict assumptions, making them particularly suitable for this exploratory stage.

In [None]:
#Clone the original dataset
df_rf = df.copy()

#Calculate proportion of rows with missing values
total_rows = df_rf.shape[0]
rows_with_na = df_rf.isna().any(axis = 1).sum()
prop_missing = round((rows_with_na / total_rows) * 100, 2)
print("Proportion of rows with missing values: ", prop_missing, "%")

In [None]:
from visualization.explore_binary import plot_binary_distribution
plot_binary_distribution(df_rf, ['Attrition'])

In this dataset, 3.67 % of the rows contain at least one missing value, which is relatively low given the total sample size of 2,888 employees. The target variable, Attrition, is highly imbalanced, with 84.9 % “No” and 15.1 % “Yes”.

Since the analysis is purely exploratory, the objective is not to build a final predictive model but to identify which variables carry the most signal for attrition. The low proportion of missing data means that removing these rows will not meaningfully bias the results, and the class imbalance is acceptable at this stage because the focus is on relative feature importance and general trends rather than precise predictive performance.

In [None]:
from sklearn.model_selection import train_test_split
from data_preprocessing.encoding import ordinal_encode_columns, one_hot_encode_columns
from modeling.classification_trees import random_forest_classification

#Drop irrelevant columns
cols_to_drop = ['EmployeeID', 'EmployeeCount', 'StandardHours', 'Over18']
df_rf.drop(columns = cols_to_drop, inplace = True)

#Drop rows with missing values
df_rf.dropna(inplace = True)

#Ordinal encode relevant features
ordinal_mappings = {
    'Education': [1,2,3,4,5],
    'EnvironmentSatisfaction': [1,2,3,4],
    'JobSatisfaction': [1,2,3,4],
    'JobInvolvement': [1,2,3,4],
    'WorkLifeBalance': [1,2,3,4],
    'PerformanceRating': [1,2,3,4]
}

df_rf = ordinal_encode_columns(df_rf, ordinal_mappings)

#One-hot encode relevant features
categorical_cols = ['Archetype', 'BusinessTravel', 'Department', 'EducationField', 
                    'Gender', 'JobRole', 'MaritalStatus']

df_rf = one_hot_encode_columns(df_rf, categorical_cols, drop_first=False)

#Prepare X and y
X = df_rf.drop(columns=['Attrition'])
y = df_rf['Attrition'].map({'No':0, 'Yes':1})

#Train/test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.3, random_state = 42, stratify = y
)
train_df = X_train.copy()
train_df['Attrition'] = y_train
test_df = X_test.copy()
test_df['Attrition'] = y_test

#Run Random Forest (balanced for class imbalance)
predictors = [col for col in train_df.columns if col != 'Attrition']
rf_model = random_forest_classification(train_df, test_df, outcome='Attrition', predictors=predictors)

**The unusually high model performance is noted. Since this analysis is purely exploratory and focused on feature ranking, no further investigation is conducted at this stage.**

In [None]:
from eda.explainability import feature_importance
feature_importance(rf_model, X_train, predictors)

Overall, the results are particularly interesting, as attrition appears to be driven more by the timing and dynamics of employees’ career paths than by isolated levels of satisfaction, performance, or compensation. 

Engagement factors and behavioral archetypes enrich the understanding of attrition risk, but on their own they do not emerge as the primary drivers of employee exits.

To uncover the full complexity of these patterns and detect hidden signals, **we will need to drill deeper into the data at the level of job roles and departments**, which will allow us to identify nuanced behaviors and triggers that are otherwise masked in aggregate analyses.

### Multivariate Analysis - Career Trajectory Dynamics

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

#Scatter plots
job_roles = sns.relplot(
    data=df,
    x="Age",
    y="TotalWorkingYears",
    hue="Attrition",
    col="JobRole",    
    col_wrap=3,        
    height=4,          
    aspect=1.2,        
    kind="scatter",
    alpha=0.7
)

job_roles.set_titles("{col_name}")
plt.show()

**Attrition X (Age X TotalWorkingHours X JobRole)**

| Finding | Description |
|--------|-------------|
| Junior exit pattern | Most job roles exhibit a “junior exit” trend, with leavers being younger and having 3–5 fewer years of experience than stayers. |
| Early integration risk | Attrition concentrates before full career integration, indicating difficulty retaining talent beyond early tenure stages. |
| HR seniority drain | Human Resources is a key outlier, where leavers are older and more senior than stayers, signaling a loss of experienced leaders. |
| Mid-level attrition risk | In roles such as Manager and Manufacturing Director, leavers have nearly 50% less experience, highlighting elevated turnover risk for mid-level talent. |


In [None]:
from visualization.explore_continuous import plot_violin_grid
plot_violin_grid(df, ['YearsAtCompany'], 'JobRole', 'Attrition', figsize_single = (13,7))

In [None]:
#Display information by Department
departments = df['Department'].unique()

for dept in departments:
    print(f"\nDepartment: {dept}", "\n")
    subset = df.loc[df['Department'] == dept].copy()
    plot_box_grid(subset, ['YearsWithCurrManager', 'TotalWorkingYears', 'YearsAtCompany', 'YearsSinceLastPromotion'], 'Attrition')

**Attrition X (JobRole X YearsAtCompany)**

**Attrition X (Department X TotalWorkingYears)**

**Attrition X (Department X YearsAtCompany)**

**Attrition X (Department X YearsSinceLastPromotion)**

**Attrition X (Department X YearsWithCurrManager)**

| Finding | Description |
|--------|-------------|
| Early-tenure attrition | Attrition is concentrated in early organizational tenure, marked by short time with the current manager and lower total experience. |
| Managerial exposure | Leavers consistently show lower medians for YearsAtCompany and YearsWithCurrManager across departments. |
| Limited role of promotion timing | YearsSinceLastPromotion remains low for both leavers and stayers, indicating promotion stagnation is not the primary trigger. |
| Department-specific windows | While the early-career pattern is consistent, attrition timing and intensity vary by department, revealing role-specific attrition windows. |


### Multivariate Analysis - Compensation and Recognition Patterns

In [None]:
from visualization.explore_continuous import plot_numeric_bivariate

#Display information by Job Role
job_roles = df['JobRole'].unique()

for role in job_roles:
    print(f"\nJob Role: {role}", "\n")
    subset = df.loc[df['JobRole'] == role].copy()
    plot_numeric_bivariate(subset, ['MonthlyIncome'], 'Attrition')

**Attrition X (JobRole X MonthlyIncome)**

| Finding | Description |
|--------|-------------|
| Pay-driven attrition | In most roles (e.g., Healthcare Representatives, Sales Executives, Managers), leavers earn significantly lower median salaries, indicating pay as a key retention lever. |
| High-earner exits | For Sales Representatives and Manufacturing Directors, leavers earn higher salaries than stayers, suggesting non-compensation drivers such as stress or external poaching. |
| Pay-neutral roles | Research Scientists show nearly identical salaries between leavers and stayers, implying attrition is driven by career progression or work environment rather than pay. |
| Role-specific pay dynamics | Overall, compensation effects on attrition are highly role-dependent, reinforcing the need for differentiated retention strategies. |


In [None]:
import math

#Compute attrition rate by JobRole and PercentSalaryHike
hike_counts = (
    df.groupby(['JobRole', 'PercentSalaryHike', 'Attrition'])
      .size()
      .unstack(fill_value=0) 
      .reset_index()
)

#Calculate attrition rate
hike_counts['AttritionRate'] = hike_counts['Yes'] / (hike_counts['Yes'] + hike_counts['No'])

#Prepare subplot grid
job_roles = hike_counts['JobRole'].unique()
n_roles = len(job_roles)
n_cols = 3
n_rows = math.ceil(n_roles / n_cols)
plt.figure(figsize=(n_cols*5, n_rows*4))

for i, role in enumerate(job_roles):
    plt.subplot(n_rows, n_cols, i+1)
    subset = hike_counts[hike_counts['JobRole'] == role]
    sns.barplot(data=subset, x='PercentSalaryHike', y='AttritionRate', color='steelblue')
    plt.title(role, fontsize=10)
    plt.xlabel("Salary Hike (%)", fontsize=8)
    plt.ylabel("Attrition Rate", fontsize=8)
    plt.ylim(0, 1)

plt.tight_layout()
plt.show()

plt.figure(figsize=(n_cols*5, n_rows*4))

**Attrition Rate X (JobRole X PercentSalaryHike)**

| Finding | Description |
|--------|-------------|
| Heterogeneous raise impact | Attrition response to percent salary hike is highly role-dependent, contradicting the assumption that higher raises uniformly improve retention. |
| Average-raise breaking points | Human Resources and Sales Executives show high proportional attrition around 14–15% hikes, suggesting average raises may not offset workload or engagement gaps (noting small cohort effects). |
| Leadership paradox | For Managers and Research Directors, very high raises (>22%) do not prevent attrition, indicating retention depends more on career stability and long-term incentives than salary alone. |
| Role-specific compensation sensitivity | Research Scientists show rising attrition beyond 13% hikes due to external demand, while Laboratory Technicians display stable, linear sensitivity aligned with standard pay cycles. |


### Multivariate Analysis - Satisfaction-Engagement Alignment

In [None]:
for role in job_roles:
    print(f"\nJob Role: {role}", "\n")
    subset = df.loc[df['JobRole'] == role].copy()
    plot_discrete_bivariate_grid(subset, ['JobSatisfaction'], 'Attrition', figsize = (12,3))

**Attrition X (JobRole X JobSatisfaction)**

| Finding | Description |
|--------|-------------|
| Satisfaction-sensitive roles | Research Directors, Manufacturing Directors, HR, and Healthcare Representatives show sharp attrition drops once job satisfaction exceeds very low levels, making low satisfaction a strong exit trigger. |
| Sales Executives paradox | Sales Executives display higher attrition among satisfied employees than dissatisfied ones, indicating satisfaction is not the primary driver of exits in sales roles. |
| Constant-risk profiles | Research Scientists and Laboratory Technicians maintain high attrition at low-to-mid satisfaction levels, with risk decreasing only at very high satisfaction. |
| Departmental environment effects | EnvironmentSatisfaction predicts attrition in Sales and R&D but is largely irrelevant for HR, where attrition remains high regardless of satisfaction level. |


In [None]:
from visualization.explore_discrete import plot_discrete_distribution
plot_discrete_distribution(df, ['Archetype'], figsize = (10, 5))

In [None]:
plot_discrete_lollipop_bivariate(df, 'Archetype', 'Attrition', figsize = (12, 5))

**Attrition X Archetype**

**Attrition X (JobRole X DistanceFromHome)**

**Attrition X (JobRole X TrainingTimesLastYear)**

| Finding | Description |
|--------|-------------|
| Workhorse Elite archetype | The Workhorse Elite exhibits a striking 30.4% attrition, more than double other archetypes (Quiet Disengaged 12.3%, Balanced Contributor 12.2%), making it a critical risk segment despite mid-tier feature importance. |
| Predictive power | The Workhorse Elite archetype alone multiplies attrition probability by ~2.5 vs a Balanced Contributor, signaling a high-risk, high-volume group central to retention strategy. |
| Distance & Training effects | For HR and Sales Reps, leavers tend to live farther from the office and/or have fewer training sessions, while Sales Executives and Lab Technicians leave regardless of proximity or training, highlighting role-specific attrition drivers. |
| R&D stability | In R&D roles, distance from home and training frequency are weak predictors, with leavers and stayers showing similar values across these features. |




## 3.4-Summary - Notebook 3

| EDA Part | Finding | Description |
|----------|--------|-------------|
| Workforce & Career | Workforce structure | The workforce is predominantly mid-career, with an average age of 36.9 and half of employees between 30 and 43 years old. |
| Workforce & Career | Gender distribution | The population is male-dominated (~60% men, 40% women), indicating a moderate but persistent gender imbalance. |
| Workforce & Career | Career stage concentration | Employees are mainly positioned at lower to mid job levels (median JobLevel = 2), with typical career paths of around 11 years of total working experience. |
| Attrition Patterns | Early-career attrition | Leavers are younger (33.6 vs 37.6 years) and have shorter tenure (5.1 vs 7.4 years), showing that attrition is concentrated in early stages of employment. |
| Attrition Patterns | Departmental risk concentration | Human Resources shows markedly high attrition (27.8%), highlighting a localized structural risk. |
| Engagement & Satisfaction | Satisfaction-sensitive roles | Research Directors, Manufacturing Directors, HR, and Healthcare Reps show sharp attrition drops once satisfaction exceeds low levels, making low satisfaction a strong exit trigger. |
| Engagement & Satisfaction | Sales Executives paradox | Sales Executives display higher attrition among satisfied employees, indicating satisfaction alone is not a primary driver. |
| Compensation | Pay-driven attrition | In most roles (Healthcare Reps, Sales Execs, Managers), leavers earn slightly lower median salaries, suggesting pay as a retention lever. |
| Compensation | High-earner exits | For Sales Reps and Manufacturing Directors, leavers earn higher salaries than stayers, indicating non-compensation drivers. |
| Compensation | Role-specific raise sensitivity | Attrition patterns by % salary hike are heterogeneous: some roles react to mid-level hikes, others (leadership) show no correlation. |
| Distance & Training | Distance & Training effects | For HR and Sales Reps, leavers live farther and/or have fewer training sessions; for Sales Execs and Lab Technicians, exits occur despite proximity or training. |
| Distance & Training | R&D stability | In R&D roles, distance from home and training frequency are weak predictors, with similar values for leavers and stayers. |
| Archetypes | Workhorse Elite archetype | The Workhorse Elite exhibits a striking 30.4% attrition, more than double other archetypes (Quiet Disengaged 12.3%, Balanced Contributor 12.2%), making it a critical risk segment. |
| Archetypes | Predictive power | The Workhorse Elite archetype alone multiplies attrition probability by ~2.5 vs a Balanced Contributor, signaling a high-risk, high-volume group central to retention strategy. |


## 3.5-Data Export

In [None]:
import pickle

#Export integrated dataframe
with open("final_integrated_hr_dataset.pkl", "wb") as f:
    pickle.dump(df, f)

print("Export complete.")