# Using Decision Trees To Predict Employee Attrition
> Author: Hannan Khan  
> Last Updated: 2022-04-01 04:49:53

**Note: This notebook has a webapp which can be accessed [here](https://share.streamlit.io/hannankhan888/employee_attrition_webapp/main).**

The original dataset can be found [here](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset).

**Task:** This notebook tries to find the best decision tree model to predict employee attrition.

The data was prepared using the data preparation notebook [here](https://github.com/hannankhan888/Data_Science_Portfolio/tree/main/Decision_Trees_Employee_Attrition/Data_Preparation.ipynb). That notebook covers
* loading data
* EDA
* Data Cleaning
* Feature Engineering

**Employee Attrition** - similar to churn, refers to the employee's departure from the organization.  

## Loading Libraries & Dataset
Let's begin by importing useful libraries.

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

Now we can import the cleaned and prepared dataset.

In [2]:
df = pd.read_csv(r"../data/Employee_Attrition_Prepared.csv").drop("Unnamed: 0", axis=1)
df.head()

Unnamed: 0,Age,Attrition,DailyRate,Dep_HR,Dep_R&D,Dep_Sales,DistanceFromHome,Divorced,Edu_HR,Edu_Life_Sci,...,TotalWorkingYears,TrainingTimesLastYear,Travel_Frequently,Travel_Rarely,WorkLifeBalance,YearsAtCompany,YearsAtOtherCompanies,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,1,1102,0,0,1,1,0,0,1,...,8,0,0,1,1,6,2,4,0,5
1,49,0,279,0,1,0,8,0,0,1,...,10,3,1,0,3,10,0,7,1,7
2,37,1,1373,0,1,0,2,0,0,0,...,7,3,0,1,3,0,7,0,0,0
3,33,0,1392,0,1,0,3,0,0,1,...,8,3,1,0,3,8,0,7,3,0
4,27,0,591,0,1,0,2,0,0,0,...,6,3,0,1,3,2,4,2,2,2


In [17]:
df.columns

Index(['Age', 'Attrition', 'DailyRate', 'Dep_HR', 'Dep_R&D', 'Dep_Sales',
       'DistanceFromHome', 'Divorced', 'Edu_HR', 'Edu_Life_Sci',
       'Edu_Marketing', 'Edu_Medical', 'Edu_Other', 'Edu_Technical_Deg',
       'Education', 'EmployeeCount', 'EmployeeNumber',
       'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement',
       'JobLevel', 'JobSatisfaction', 'Job_HR', 'Job_Healthcare_Rep',
       'Job_Lab_Tech', 'Job_Manager', 'Job_Manuf_Dir', 'Job_Research_Dir',
       'Job_Research_Sci', 'Job_Sales_Exec', 'Job_Sales_Rep',
       'LogDistanceFromHome', 'LogJobLevel', 'LogMonthlyIncome',
       'LogNumCompaniesWorked', 'LogPercentSalaryHike', 'LogTotalWorkingYears',
       'LogYearsAtCompany', 'Married', 'MonthlyIncome', 'MonthlyRate',
       'Non-Travel', 'NumCompaniesWorked', 'NumYearsAtEachCompany', 'Over18',
       'OverTime', 'OverallSatisfaction', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'Single',
       'StandardHours', 'St

## Splitting Data
Now let's split our data into train and test sets.

In [3]:
target = "Attrition"

X_data = df.drop(target, axis=1)
Y_data = df[target]

X_train, X_test, Y_train, Y_test = train_test_split(X_data, Y_data, test_size=0.2, random_state=42)

## Creating & Fitting Models
Now we create a multiple decision tree models and fit our training data to them.  
We will be creating the models with default parameters, and then performing hyperparameter tuning on the best one.

In [4]:
DT = DecisionTreeClassifier()
RF = RandomForestClassifier()
ET = ExtraTreesClassifier()
XGB = XGBClassifier()
AdaB = AdaBoostClassifier()
GBC = GradientBoostingClassifier()

DT = DT.fit(X_train, Y_train)
RF = RF.fit(X_train, Y_train)
ET = ET.fit(X_train, Y_train)
XGB = XGB.fit(X_train, Y_train)
AdaB = AdaB.fit(X_train, Y_train)
GBC = GBC.fit(X_train, Y_train)





Now we look at the predictions for the train and test sets, as well as storing our accuracy results in the scores_df.

In [5]:
scores_df = pd.DataFrame(columns=['DecisionTree'],
                        index = ['train_acc', 'test_acc'])
scores_df.style.set_caption("Accuracy of Tree Models On Prepared Data")

scores_df['DecisionTree'] = [accuracy_score(Y_train, DT.predict(X_train)),
                            accuracy_score(Y_test, DT.predict(X_test))]
scores_df['RandomForest'] = [accuracy_score(Y_train, RF.predict(X_train)),
                            accuracy_score(Y_test, RF.predict(X_test))]
scores_df['ExtraTree'] = [accuracy_score(Y_train, ET.predict(X_train)),
                            accuracy_score(Y_test, ET.predict(X_test))]
scores_df['XGB'] = [accuracy_score(Y_train, XGB.predict(X_train)),
                            accuracy_score(Y_test, XGB.predict(X_test))]
scores_df['AdaB'] = [accuracy_score(Y_train, AdaB.predict(X_train)),
                            accuracy_score(Y_test, AdaB.predict(X_test))]
scores_df['GBC'] = [accuracy_score(Y_train, GBC.predict(X_train)),
                            accuracy_score(Y_test, GBC.predict(X_test))]
scores_df

Unnamed: 0,DecisionTree,RandomForest,ExtraTree,XGB,AdaB,GBC
train_acc,1.0,1.0,1.0,1.0,0.903061,0.960884
test_acc,0.802721,0.877551,0.877551,0.870748,0.87415,0.884354


Currently, our best model seems to be the **ExtraRandomizedTree** model. We could go on to finalize our model by tuning hyperparameters, but I want to test to see if we can increase the accuracy by modifying the features further.

## Considering Creating/Using Polynomial Features
What if we were to introduce polynomial features? These extra features could account for interaction effects between features.

In [6]:
PF = PolynomialFeatures(degree=2, include_bias=False)
X_data_pf = PF.fit_transform(X_data)

In [7]:
X_train_pf, X_test_pf, Y_train_pf, Y_test_pf = train_test_split(X_data_pf, Y_data, test_size=0.2,
                                                               random_state=42)

In [8]:
DT_pf = DecisionTreeClassifier()
RF_pf = RandomForestClassifier()
ET_pf = ExtraTreesClassifier()
XGB_pf = XGBClassifier()
AdaB_pf = AdaBoostClassifier()
GBC_pf = GradientBoostingClassifier()

DT_pf = DT_pf.fit(X_train_pf, Y_train_pf)
RF_pf = RF_pf.fit(X_train_pf, Y_train_pf)
ET_pf = ET_pf.fit(X_train_pf, Y_train_pf)
XGB_pf = XGB_pf.fit(X_train_pf, Y_train_pf)
AdaB_pf = AdaB_pf.fit(X_train_pf, Y_train_pf)
GBC_pf = GBC_pf.fit(X_train_pf, Y_train_pf)





In [9]:
scores_df['DecisionTree_PF'] = [accuracy_score(Y_train_pf, DT_pf.predict(X_train_pf)),
                            accuracy_score(Y_test_pf, DT_pf.predict(X_test_pf))]
scores_df['RandomForest_PF'] = [accuracy_score(Y_train_pf, RF_pf.predict(X_train_pf)),
                            accuracy_score(Y_test_pf, RF_pf.predict(X_test_pf))]
scores_df['ExtraTree_PF'] = [accuracy_score(Y_train_pf, ET_pf.predict(X_train_pf)),
                            accuracy_score(Y_test_pf, ET_pf.predict(X_test_pf))]
scores_df['XGB_PF'] = [accuracy_score(Y_train_pf, XGB_pf.predict(X_train_pf)),
                       accuracy_score(Y_test_pf, XGB_pf.predict(X_test_pf))]
scores_df['AdaB_PF'] = [accuracy_score(Y_train_pf, AdaB_pf.predict(X_train_pf)),
                       accuracy_score(Y_test_pf, AdaB_pf.predict(X_test_pf))]
scores_df['GBC_PF'] = [accuracy_score(Y_train_pf, GBC_pf.predict(X_train_pf)),
                       accuracy_score(Y_test_pf, GBC_pf.predict(X_test_pf))]
scores_df = scores_df.sort_index(axis=1, ascending=True)
scores_df

Unnamed: 0,AdaB,AdaB_PF,DecisionTree,DecisionTree_PF,ExtraTree,ExtraTree_PF,GBC,GBC_PF,RandomForest,RandomForest_PF,XGB,XGB_PF
train_acc,0.903061,0.92602,1.0,1.0,1.0,1.0,0.960884,0.978741,1.0,1.0,1.0,1.0
test_acc,0.87415,0.880952,0.802721,0.778912,0.877551,0.867347,0.884354,0.880952,0.877551,0.867347,0.870748,0.891156


It seems that our **XGBoost model with polynomial features data** is the best model so far.

## Hyperparameter Tuning The XGBoost Model
We can improve upon the model itself by using GridSearch, and tuning the hyperparameters of this model.

In [10]:
params = {'max_depth': [3,6,10],
          'learning_rate': [0.01, 0.05, 0.1],
          'n_estimators': [100, 500, 1000],
          'colsample_bytree': [0.3, 0.7]}

# creating a clean XGB model to do our GridSearch on.
# I have also specified a few parameters to avoid warning messages.
XGB_search = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
clf = GridSearchCV(estimator=XGB_search, 
                   param_grid=params,
                   scoring='accuracy',
                   verbose=2)
clf.fit(X_data_pf, Y_data)

Fitting 5 folds for each of 54 candidates, totalling 270 fits
[CV] END colsample_bytree=0.3, learning_rate=0.01, max_depth=3, n_estimators=100; total time=   0.4s
[CV] END colsample_bytree=0.3, learning_rate=0.01, max_depth=3, n_estimators=100; total time=   0.4s
[CV] END colsample_bytree=0.3, learning_rate=0.01, max_depth=3, n_estimators=100; total time=   0.4s
[CV] END colsample_bytree=0.3, learning_rate=0.01, max_depth=3, n_estimators=100; total time=   0.4s
[CV] END colsample_bytree=0.3, learning_rate=0.01, max_depth=3, n_estimators=100; total time=   0.4s
[CV] END colsample_bytree=0.3, learning_rate=0.01, max_depth=3, n_estimators=500; total time=   2.2s
[CV] END colsample_bytree=0.3, learning_rate=0.01, max_depth=3, n_estimators=500; total time=   2.1s
[CV] END colsample_bytree=0.3, learning_rate=0.01, max_depth=3, n_estimators=500; total time=   2.1s
[CV] END colsample_bytree=0.3, learning_rate=0.01, max_depth=3, n_estimators=500; total time=   2.1s
[CV] END colsample_bytree=0.3

[CV] END colsample_bytree=0.3, learning_rate=0.05, max_depth=10, n_estimators=500; total time=   3.3s
[CV] END colsample_bytree=0.3, learning_rate=0.05, max_depth=10, n_estimators=500; total time=   3.3s
[CV] END colsample_bytree=0.3, learning_rate=0.05, max_depth=10, n_estimators=500; total time=   3.3s
[CV] END colsample_bytree=0.3, learning_rate=0.05, max_depth=10, n_estimators=500; total time=   3.4s
[CV] END colsample_bytree=0.3, learning_rate=0.05, max_depth=10, n_estimators=1000; total time=   5.3s
[CV] END colsample_bytree=0.3, learning_rate=0.05, max_depth=10, n_estimators=1000; total time=   5.3s
[CV] END colsample_bytree=0.3, learning_rate=0.05, max_depth=10, n_estimators=1000; total time=   5.3s
[CV] END colsample_bytree=0.3, learning_rate=0.05, max_depth=10, n_estimators=1000; total time=   5.3s
[CV] END colsample_bytree=0.3, learning_rate=0.05, max_depth=10, n_estimators=1000; total time=   5.3s
[CV] END colsample_bytree=0.3, learning_rate=0.1, max_depth=3, n_estimators=1

[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=6, n_estimators=1000; total time=  11.7s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=6, n_estimators=1000; total time=  11.7s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=10, n_estimators=100; total time=   1.9s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=10, n_estimators=100; total time=   1.8s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=10, n_estimators=100; total time=   1.9s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=10, n_estimators=100; total time=   1.9s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=10, n_estimators=100; total time=   2.1s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=10, n_estimators=500; total time=   9.5s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=10, n_estimators=500; total time=   9.5s
[CV] END colsample_bytree=0.7, learning_rate=0.01, max_depth=10, n_estimators=500;

[CV] END colsample_bytree=0.7, learning_rate=0.1, max_depth=6, n_estimators=100; total time=   1.1s
[CV] END colsample_bytree=0.7, learning_rate=0.1, max_depth=6, n_estimators=500; total time=   4.3s
[CV] END colsample_bytree=0.7, learning_rate=0.1, max_depth=6, n_estimators=500; total time=   4.2s
[CV] END colsample_bytree=0.7, learning_rate=0.1, max_depth=6, n_estimators=500; total time=   4.2s
[CV] END colsample_bytree=0.7, learning_rate=0.1, max_depth=6, n_estimators=500; total time=   4.2s
[CV] END colsample_bytree=0.7, learning_rate=0.1, max_depth=6, n_estimators=500; total time=   4.3s
[CV] END colsample_bytree=0.7, learning_rate=0.1, max_depth=6, n_estimators=1000; total time=   7.0s
[CV] END colsample_bytree=0.7, learning_rate=0.1, max_depth=6, n_estimators=1000; total time=   6.9s
[CV] END colsample_bytree=0.7, learning_rate=0.1, max_depth=6, n_estimators=1000; total time=   7.0s
[CV] END colsample_bytree=0.7, learning_rate=0.1, max_depth=6, n_estimators=1000; total time=   6

GridSearchCV(estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None,
                                     enable_categorical=False,
                                     eval_metric='logloss', gamma=None,
                                     gpu_id=None, importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=None,
                                     missing=nan, monotone_...
                                     num_parallel_tree=None, predictor=None,
                                     random_state=None, reg_alpha=None,
                                     reg_lambda=None, scale_pos_weight=None,
                                  

In [11]:
print("Best params for XGB", clf.best_params_)

Best params for XGB {'colsample_bytree': 0.7, 'learning_rate': 0.05, 'max_depth': 10, 'n_estimators': 1000}


## Testing Accuracy Of The Best Tuned XGBoost Model
We can now make an XGBoost model with the best GridSearch parameters.

In [12]:
best_XGB = XGBClassifier(colsample_bytree=0.7,
                        learning_rate=0.05,
                        max_depth=10,
                        n_estimators=1000)
best_XGB = best_XGB.fit(X_train_pf, Y_train_pf)





In [13]:
scores_df['BEST_XGB_PF'] = [accuracy_score(Y_train_pf, best_XGB.predict(X_train_pf)),
                           accuracy_score(Y_test_pf, best_XGB.predict(X_test_pf))]
scores_df = scores_df.sort_index(axis=1, ascending=True)
scores_df

Unnamed: 0,AdaB,AdaB_PF,BEST_XGB_PF,DecisionTree,DecisionTree_PF,ExtraTree,ExtraTree_PF,GBC,GBC_PF,RandomForest,RandomForest_PF,XGB,XGB_PF
train_acc,0.903061,0.92602,1.0,1.0,1.0,1.0,1.0,0.960884,0.978741,1.0,1.0,1.0,1.0
test_acc,0.87415,0.880952,0.897959,0.802721,0.778912,0.877551,0.867347,0.884354,0.880952,0.877551,0.867347,0.870748,0.891156


Success! *We have increased the accuracy of our XGBoost model from 0.891156 to 0.897959!*

## Saving The Model For Future Use

In [15]:
best_XGB.save_model("models/best_XGB.json")

## Next Steps

**This project has a webapp associated with it, which can be accessed [here](https://share.streamlit.io/hannankhan888/employee_attrition_webapp/main).**

Below is an observation of data that will produce a "not churn" output with the model.

In [65]:
print(df.loc[1].to_string())

Age                            49.000000
Attrition                       0.000000
DailyRate                     279.000000
Dep_HR                          0.000000
Dep_R&D                         1.000000
Dep_Sales                       0.000000
DistanceFromHome                8.000000
Divorced                        0.000000
Edu_HR                          0.000000
Edu_Life_Sci                    1.000000
Edu_Marketing                   0.000000
Edu_Medical                     0.000000
Edu_Other                       0.000000
Edu_Technical_Deg               0.000000
Education                       1.000000
EmployeeCount                   1.000000
EmployeeNumber                  2.000000
EnvironmentSatisfaction         3.000000
Gender                          0.000000
HourlyRate                     61.000000
JobInvolvement                  2.000000
JobLevel                        2.000000
JobSatisfaction                 2.000000
Job_HR                          0.000000
Job_Healthcare_R