
Killian McKee 

## Overview ##

This code steps through several machine learning algorithms and a deep neural net to demonstrate how artificial intelligence can be used to improve employee retention. Using past data on employees, we can treat retention as a classificiation problem i.e. is this employee a flight risk or not? Given this information key decision leaders can act to preserve their talent accordingly. For this simple walkthrough we will be using IBM's attrition [dataset](https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/). We will be walking through 5 powerful, widely used classification methods: extreme gradient boosting, logistic regression, a random forest, support vector machines, and a basic deep neural net. To keep this section relatively compact, we use a a pre-encoded version of the data for our first 4 algorithms, but the full data preparation pipeline is included in the neural net section. Ultimately, we achieve a peak accuracy of about 89.1% using logistic regression, which is acceptable considering the small dataset size and the serious class imbalance (83% of the data falls into the 'did not commit attrition' class). In a large scale implementation, more data would likely be avaialble to improve results. Additionally, there is room for improvement in hyperparameter tuning and training time. The purpose of this guide is not to maximize accuracy, but to demonstrate a potential workflow and the relative simplicity of using machine learning to derive improved attrition insights from an imperfect dataset. 



**All Model Accuracies** 

    1. Logistic Regression:        89.1% 
    2. Extreme Gradient Boosting:  88.5% 
    3. Support Vector Machine:     88.0% 
    4. Deep Neural Net:            87.4%
    5. Random Forest:              87.2% 

## Extreme Gradient Boosting ##

In [1]:
# import necessary packages 
import pandas as pd
import numpy as np
import sklearn
import xgboost as xgb
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, classification_report, precision_score
from sklearn import datasets
from sklearn.model_selection import cross_val_score
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')


In [2]:
# load the dataset into a dataframe
attrition=pd.read_csv('ibm_employee_attrition_encoded.csv')


In [3]:
# examine the dataset 
attrition.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 52 columns):
Unnamed: 0                           1470 non-null int64
Age                                  1470 non-null int64
Attrition                            1470 non-null int64
BusinessTravel                       1470 non-null int64
DailyRate                            1470 non-null int64
DistanceFromHome                     1470 non-null int64
Education                            1470 non-null int64
EmployeeCount                        1470 non-null int64
EmployeeNumber                       1470 non-null int64
EnvironmentSatisfaction              1470 non-null int64
HourlyRate                           1470 non-null int64
JobInvolvement                       1470 non-null int64
JobLevel                             1470 non-null int64
JobSatisfaction                      1470 non-null int64
MonthlyIncome                        1470 non-null int64
MonthlyRate                          1

In [4]:
# check out the first few rows of the dataset 
# attrition.head() # commented out for brevity

In [5]:
# create a train test split in the data 
X = attrition.drop(columns=['Attrition'])
y = attrition['Attrition']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, 
                                                    random_state=18)

In [6]:
# create train test splits for our xgboost classifier to use 
xgd_train=xgb.DMatrix(data=X_train,label=y_train)
xgd_test=xgb.DMatrix(data=X_test,label=y_test)

In [7]:
# initialize the classifier 
# tinker with params in this section to achieve potentially better results 
params = {"objective":"binary:logistic", "max_depth":4,
          "nthread":5,"learning_rate":0.1,"subsample":0.2,
          "colsample_bytree":0.3,"n_estimators":20,"seed":52}

# Initialize the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(params=params)


In [8]:
# Fit the classifier to the training set
xg_cl.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_cl.predict(X_test)

# Compute the accuracy: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("xgboost model accuracy: %f" % (accuracy))

xgboost model accuracy: 0.888587


In [9]:
# printing more detailed model metrics 
model=xg_cl
pred=model.predict(X_test)
cm_df = pd.DataFrame(confusion_matrix(y_test, pred).T, index=model.classes_,
                     columns=model.classes_)

cm_df.index.name = 'Predicted'
cm_df.columns.name = 'True'
print(cm_df)
print('\n',classification_report(y_test, pred))
print('\n',model.score(X_test,y_test))

True         0   1
Predicted         
0          308  36
1            5  19

              precision    recall  f1-score   support

          0       0.90      0.98      0.94       313
          1       0.79      0.35      0.48        55

avg / total       0.88      0.89      0.87       368


 0.8885869565217391


## Logistic Regression ## 

In [10]:
# load packages 

import pandas as pd
import numpy as np
import sklearn
from sklearn import svm
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import confusion_matrix, classification_report, precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [11]:
# load our data into a dataframe 
attrition=pd.read_csv('ibm_employee_attrition_encoded.csv')


In [12]:
# examine the data 
# attrition.info() #remove the comment here to see all columns (same as above)

In [13]:
# check out the first few rows of the dataset 
# attrition.head()

In [14]:
# create a train test split in the data 
X = attrition.drop(columns=['Attrition'])
y = attrition['Attrition']

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                test_size=0.25, random_state=18)

In [15]:
# find a good c value to use in our regressor 

c_values = [0.001,0.005,0.01,0.05,0.1,0.5,1.0,5.0,10.0,50.0] #experiment with different values here 
prediction_err = []
train_err = []

for c_val in c_values:
    clf = LogisticRegression(C=c_val,max_iter=10000,random_state=18)
    clf.fit(X_train, y_train)
    score_train = clf.score(X_train, y_train)
    score_test = clf.score(X_test,y_test)
    train_err.append(1-score_train)
    prediction_err.append(1-score_test)
    #val_error.append(1-score_val) 
    print("Acc Train: %f, Acc test:%f"%(score_train,score_test))

Acc Train: 0.835753, Acc test:0.850543
Acc Train: 0.840290, Acc test:0.855978
Acc Train: 0.842105, Acc test:0.853261
Acc Train: 0.877495, Acc test:0.883152
Acc Train: 0.881125, Acc test:0.894022
Acc Train: 0.886570, Acc test:0.896739
Acc Train: 0.884755, Acc test:0.894022
Acc Train: 0.887477, Acc test:0.896739
Acc Train: 0.887477, Acc test:0.899457
Acc Train: 0.889292, Acc test:0.894022


In [16]:
# create our logistic regressor 
# experiment with different C values, or run in a loop like above
logreg = LogisticRegression(C=50,max_iter=1000,random_state=18) 
logreg.fit(X_train, y_train)

LogisticRegression(C=50, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=1000, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=18, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [50]:
# get the accuracy of our model 
model=logreg
pred=model.predict(X_test)
cm_df = pd.DataFrame(confusion_matrix(y_test, pred).T,
                     index=model.classes_,columns=model.classes_)
cm_df.index.name = 'Predicted'
cm_df.columns.name = 'True'
print(cm_df)
print('\n',classification_report(y_test, pred))
print('\nLogistic Regression Model Accuracy:',model.score(X_test,y_test))

True         0   1
Predicted         
0          305  31
1            8  24

              precision    recall  f1-score   support

          0       0.91      0.97      0.94       313
          1       0.75      0.44      0.55        55

avg / total       0.88      0.89      0.88       368


Logistic Regression Model Accuracy: 0.8940217391304348


## Random Forest ## 

In [18]:
# import packages 
import pandas as pd
import numpy as np
import sklearn 
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import confusion_matrix, classification_report, precision_score
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline


In [19]:
# load our data into a dataframe 
attrition=pd.read_csv('ibm_employee_attrition_encoded.csv')


In [20]:
# examine the data 
# attrition.info() #remove this comment to see all columns, same as above

In [21]:
# check out the first few rows of the dataset 
# attrition.head()

In [22]:
# create a train/test split in the data 
X = attrition.drop(columns=['Attrition'])
y = attrition['Attrition']

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                test_size=0.25, random_state=18)

In [23]:
# building our random forest classifier 
model= RandomForestClassifier(max_depth=10,n_estimators=100,oob_score=True,
                              min_samples_split=5,random_state=18,
                              min_samples_leaf=2,criterion='gini',max_features=20)
model.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features=20, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=5,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=True, random_state=18, verbose=0, warm_start=False)

In [24]:
# Model performance

pred=model.predict(X_test)
cm_df = pd.DataFrame(confusion_matrix(y_test, pred).T, index=model.classes_,columns=model.classes_)
cm_df.index.name = 'Predicted'
cm_df.columns.name = 'True'
print(cm_df)
print('\n',classification_report(y_test, pred))
print('\nmodel accuracy:',model.score(X_test,y_test))


True         0   1
Predicted         
0          309  41
1            4  14

              precision    recall  f1-score   support

          0       0.88      0.99      0.93       313
          1       0.78      0.25      0.38        55

avg / total       0.87      0.88      0.85       368


model accuracy: 0.8777173913043478


## Support Vector Machines ## 

In [25]:
# load packages 
import pandas as pd
import numpy as np
import warnings
import sklearn
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import svm
from sklearn.metrics import confusion_matrix, classification_report, precision_score
%matplotlib inline


In [26]:
# load our data into a dataframe 

attrition=pd.read_csv('ibm_employee_attrition_encoded.csv')


In [27]:
# inspect the columns 
# attrition.info()

In [28]:
# check out the first few rows of the dataframe 
# attrition.head()

In [29]:
# create a train/test split in the data
X = attrition.drop(columns=['Attrition'])
y = attrition['Attrition']

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                            test_size=0.25, random_state=18)

In [30]:
# find a good value for c and testing in a few svms
c_values = [0.01,0.1,1.0,10.0]
prediction_err = []
train_err = []

for c_val in c_values:
    clf = svm.SVC(kernel='linear', C=c_val,random_state=18)
    clf.fit(X_train, y_train)
    score_train = clf.score(X_train, y_train)
    score_test = clf.score(X_test,y_test)
    train_err.append(1-score_train)
    prediction_err.append(1-score_test) 
    print("Acc Train: %f, Acc Test:%f"%(score_train,score_test))

Acc Train: 0.843920, Acc Test:0.877717
Acc Train: 0.842105, Acc Test:0.866848
Acc Train: 0.844828, Acc Test:0.880435
Acc Train: 0.841198, Acc Test:0.880435


In [31]:
# implementing an svm model and getting some statistics 
# SVC stands for support vector classification
    
clf = svm.SVC(kernel='linear', C=1.0,random_state=18)
clf.fit(X_train, y_train)
model=clf

pred=model.predict(X_test)
cm_df = pd.DataFrame(confusion_matrix(y_test, pred).T, index=model.classes_,
                     columns=model.classes_)
cm_df.index.name = 'Predicted'
cm_df.columns.name = 'True'
print(cm_df)
print(classification_report(y_test, pred))
print(model.score(X_test,y_test))

True         0   1
Predicted         
0          309  40
1            4  15
             precision    recall  f1-score   support

          0       0.89      0.99      0.93       313
          1       0.79      0.27      0.41        55

avg / total       0.87      0.88      0.85       368

0.8804347826086957


## Deep Neural Net ##

In [32]:
# import necessary packages 

import numpy 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# neural net packages 
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Activation
from keras.optimizers import SGD
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

Using TensorFlow backend.


In [33]:
# load the dataset into a dataframe

attrition=pd.read_excel('ibm_employee_attrition.xlsx')


In [34]:
# examine the top of the dataset 
#pd.set_option('display.expand_frame_repr', False)
#attrition.head()

In [35]:
# map attrition to 1's and 0's 

attrition_map = {'Yes': 1, 'No': 0}
attrition['Attrition'] = attrition['Attrition'].map(attrition_map)

In [36]:
# view all our column names 

attrition.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

In [37]:
# get some basic info on our data
# attrition.describe()

In [38]:
# view the types of each column so we can one hot encode categorical variables 
# we can see we have 9 columns that will need to be encoded 

attrition.dtypes

Age                          int64
Attrition                    int64
BusinessTravel              object
DailyRate                    int64
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField              object
EmployeeCount                int64
EmployeeNumber               int64
EnvironmentSatisfaction      int64
Gender                      object
HourlyRate                   int64
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction              int64
MaritalStatus               object
MonthlyIncome                int64
MonthlyRate                  int64
NumCompaniesWorked           int64
Over18                      object
OverTime                    object
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StandardHours                int64
StockOptionLevel             int64
TotalWorkingYears   

In [39]:
attrition.dtypes[attrition.dtypes!='int64']

BusinessTravel    object
Department        object
EducationField    object
Gender            object
JobRole           object
MaritalStatus     object
Over18            object
OverTime          object
dtype: object

In [40]:
# one hot encode the categorical columns of our df 
attrition_cat = attrition.select_dtypes(include=['object'])

for column in attrition_cat:
    dummy=pd.get_dummies(attrition_cat[column])
    attrition = pd.concat([attrition,dummy],axis=1)

In [41]:
# drop all the redundant columns 
attrition=attrition.drop(columns=['BusinessTravel','Department','EducationField',
                                  'Gender','JobRole','MaritalStatus','Over18',
                                   'OverTime'])


In [42]:
#view updated columns
attrition.columns

Index(['Age', 'Attrition', 'DailyRate', 'DistanceFromHome', 'Education',
       'EmployeeCount', 'EmployeeNumber', 'EnvironmentSatisfaction',
       'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobSatisfaction',
       'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction',
       'StandardHours', 'StockOptionLevel', 'TotalWorkingYears',
       'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany',
       'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager',
       'Non-Travel', 'Travel_Frequently', 'Travel_Rarely', 'Human Resources',
       'Research & Development', 'Sales', 'Human Resources', 'Life Sciences',
       'Marketing', 'Medical', 'Other', 'Technical Degree', 'Female', 'Male',
       'Healthcare Representative', 'Human Resources', 'Laboratory Technician',
       'Manager', 'Manufacturing Director', 'Research Director',
       'Research Scientist', 'Sales Executive', 'Sales Re

In [43]:
# select the columns to use for prediction in the neural network 
prediction_var= ['Age', 'DailyRate', 'DistanceFromHome', 'Education',
       'EmployeeCount', 'EmployeeNumber', 'EnvironmentSatisfaction',
       'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobSatisfaction',
       'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction',
       'StandardHours', 'StockOptionLevel', 'TotalWorkingYears',
       'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany',
       'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager',
       'Non-Travel', 'Travel_Frequently', 'Travel_Rarely', 'Human Resources',
       'Research & Development', 'Sales', 'Human Resources', 'Life Sciences',
       'Marketing', 'Medical', 'Other', 'Technical Degree', 'Female', 'Male',
       'Healthcare Representative', 'Human Resources', 'Laboratory Technician',
       'Manager', 'Manufacturing Director', 'Research Director',
       'Research Scientist', 'Sales Executive', 'Sales Representative',
       'Divorced', 'Married', 'Single', 'Y', 'No', 'Yes']

X = attrition[prediction_var].values
Y = attrition['Attrition'].values

In [44]:
# confirm the shape of our data is correct 
(X.shape, Y.shape)

((1470, 61), (1470,))

In [45]:
# encode the columns of our dataset
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)

In [46]:
# define a simple neural net using a function
def create_baseline():
    # create model
    model = Sequential()
    model.add(Dense(units=20, input_dim=61, 
                    kernel_initializer='normal', activation='relu'))
    model.add(Dropout(0.02))
    model.add(Dense(units=10, input_dim=61))
    model.add(Dropout(0.02))
    model.add(Dense(units=5,input_dim=10))
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    # for those interesting in creating a sqd optimizer instead of 'adam'
    #sgd = SGD(lr=0.005, decay=1e-6, momentum=0.9, nesterov=True)
    
    # Compile model with the logarithmic loss function and Adam gradient optimizer.
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) 
    return model


In [47]:
# Evaluate model using standardized dataset and 10 fold cross validation
estimators = []
#estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_baseline, 
                                  epochs=2000, batch_size=800, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True)
results = cross_val_score(pipeline, X, Y, cv=kfold)


In [48]:
# model accuracy 
print("Accuracy: %.2f%%" % (results.mean()*100))

Accuracy: 87.41%
