Attrition is one of the most common issues at any organization. The amount of time, money and effort to train new employees can lead to great loss for the company. The loss cost includes on-boarding, advertising, hiring, training and lost productivity. Additionally, attrition causes doubt or distrust among current employees with the management. According to the Gallup, U.S business lose a trillion dollars every year due to employee turnover. It also argues that problem is fixable with right strategy and retention plan.
This project aims to provide in-depth analysis of factors that lead to employee turnover, create predictive model and suggest retention strategy using Kaggle IBM HR dataset. Please follow this link, Understanding and Predicting IBM Employee Attrition for details.
Tools Used : Google Colab
Packages : pandas, numpy, matplotlib, seaborn, sklearn
Data : 35 features with 1470 observations.
EDA
Feature Distributions Attrition is high among employees in late 20's and early 30's.
Attrition Counts
Data is highly imbalanced.
Correlation
Instances of multicollinearity.
Models : Logistic Regression, Random Forest, KNN, SVM
Feature Engineering : One-hot Encoding
Model Performance Techniques: Feature Selection, Feature Scaling, Treat imbalanced dataset(SMOTE, Up sampling & Down sampling), Hyperparameter Tuning
Metrics : F1-Score, ROC Graph
First, developed baseline models without model improvement techniques.
Baseline AUC & F1-score
Baseline ROC Graph
Improved AUC & F1-score
Improved ROC Graph
Random Forest Feature Importance
Retention Strategy
Based on feature importance chart, we can say that that overtime, job level, stock option level, time with current manager, marital status and income also play a vital role in employee attrition. On the contrary, department, job role, education tend not to contribute for turnover. The company can focus on the factors contributing higher contributing attrition. However, there are other factors such as selection bias, type of employment(interns, contractors, part time or full time) that may need to be considered that are not necessarily captured by the model. Additionally, it is recommended that models are tuned at a certain frequency to include recent data and drop features of lower importance. Chi-Square test may be used to determine the dependence between attrition and other features.