# Employee Attrition using H2O.ai and Lime

## Load Libraries

In [2]:
# Load the following packages
library(tidyquant)  # Loads tidyverse and several other pkgs 
library(readxl)     # Super simple excel reader
library(h2o)        # Professional grade ML pkg
library(lime)       # Explain complex black-box ML models

## Data Processing

In [3]:
# Read excel data
hr_data_raw <- read_excel(path = "WA_Fn-UseC_-HR-Employee-Attrition.xlsx")

In [4]:
# View first 10 rows
hr_data_raw[1:10,]

Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2
32,No,Travel_Frequently,1005,Research & Development,2,2,Life Sciences,1,8,...,3,80,0,8,2,2,7,7,3,6
59,No,Travel_Rarely,1324,Research & Development,3,3,Medical,1,10,...,1,80,3,12,3,2,1,0,0,0
30,No,Travel_Rarely,1358,Research & Development,24,1,Life Sciences,1,11,...,2,80,1,1,2,3,1,0,0,0
38,No,Travel_Frequently,216,Research & Development,23,3,Life Sciences,1,12,...,2,80,0,10,2,3,9,7,1,8
36,No,Travel_Rarely,1299,Research & Development,27,3,Medical,1,13,...,2,80,2,17,3,2,7,7,7,7


In [5]:
hr_data <- hr_data_raw %>%
    mutate_if(is.character, as.factor) %>%
    select(Attrition, everything())

In [6]:
glimpse(hr_data)

Observations: 1,470
Variables: 35
$ Attrition                <fct> Yes, No, Yes, No, No, No, No, No, No, No, ...
$ Age                      <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35...
$ BusinessTravel           <fct> Travel_Rarely, Travel_Frequently, Travel_R...
$ DailyRate                <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 13...
$ Department               <fct> Sales, Research & Development, Research & ...
$ DistanceFromHome         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 2...
$ Education                <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, ...
$ EducationField           <fct> Life Sciences, Life Sciences, Other, Life ...
$ EmployeeCount            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ EmployeeNumber           <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, ...
$ EnvironmentSatisfaction  <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, ...
$ Gender                   <fct> Female, Male, Male, Female, Male, Male, Fe...
$ HourlyRate      

## Modeling Employee Attrition
    Here, by using the h2o.automl() function from the H2O platform we model employee attrition.
    First, initialize the Java Virtual Machine (JVM) that H2O uses locally.

In [7]:
# Initialize H2O JVM

h2o.init()


H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    C:\Users\NILKAN~1\AppData\Local\Temp\Rtmpq8qERK/h2o_Nilkanth_Jadhav_started_from_r.out
    C:\Users\NILKAN~1\AppData\Local\Temp\Rtmpq8qERK/h2o_Nilkanth_Jadhav_started_from_r.err


Starting H2O JVM and connecting: ... Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         19 seconds 310 milliseconds 
    H2O cluster timezone:       Asia/Kolkata 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.20.0.9 
    H2O cluster version age:    10 days  
    H2O cluster name:           H2O_started_from_R_Nilkanth_Jadhav_qfb796 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.74 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Int

In [8]:
#h2o.no_progress()  # Turn off output of progress bars

# Next, change the data to an h2o object that the package can interpret.  Also
# split the data into training, validation, and test sets - 70%, 15%, 15%,
# respectively.

# Split data into Train/Validation/Test Sets

hr_data_h2o <- as.h2o(hr_data)

split_h2o <- h2o.splitFrame(hr_data_h2o, c(0.7, 0.15), seed = 1234)

train_h2o <- h2o.assign(split_h2o[[1]], "train")  # 70%
valid_h2o <- h2o.assign(split_h2o[[2]], "valid")  # 15%
test_h2o <- h2o.assign(split_h2o[[3]], "test")  # 15%



In [9]:
# for building the model, set the target and feature names.
# The target is what we aim to predict (in this case “Attrition”).
# The features (every other column) are what we will use to model the prediction.

# Set names for h2o
y <- "Attrition"
x <- setdiff(names(train_h2o), y)

To run the h2o.automl() setting the arguments it needs to run models against.

- x = x: The names of our feature columns.
- y = y: The name of our target column.
- training_frame = train_h2o: Our training set consisting of 70% of the data.
- leaderboard_frame = valid_h2o: Our validation set consisting of 15% of the data. H2O uses this to ensure the model does not overfit the data.
- max_runtime_secs = 30: We supply this to speed up H2O’s modeling. The algorithm has a large number of complex models so we want to keep things moving at the expense of some accuracy.


In [10]:
# Run the automated machine learning 
automl_models_h2o <- h2o.automl(
    x = x, 
    y = y,
    training_frame    = train_h2o,
    leaderboard_frame = valid_h2o,
    max_runtime_secs  = 30
    )



All of the models are stored the automl_models_h2o object. However, we are only concerned with the leader, which is the best model in terms of accuracy on the validation set. We’ll extract it from the models object.

In [11]:
# Extract leader model
automl_leader <- automl_models_h2o@leader

## Predict using best model

In [12]:
# Predict on hold-out set, test_h2o
pred_h2o <- h2o.predict(object = automl_leader, newdata = test_h2o)



## Performance of model

In [13]:
# Prep for performance assessment
test_performance <- test_h2o %>%
    tibble::as_tibble() %>%
    select(Attrition) %>%
    add_column(pred = as.vector(pred_h2o$predict)) %>%
    mutate_if(is.character, as.factor)
test_performance

Attrition,pred
No,No
No,No
Yes,Yes
No,No
No,No
No,No
Yes,Yes
No,No
No,No
Yes,Yes


In [14]:
# Confusion table counts
confusion_matrix <- test_performance %>%
    table() 
confusion_matrix

         pred
Attrition  No Yes
      No  163  19
      Yes  12  17

In [25]:
# Performance analysis
tn <- confusion_matrix[1]
tp <- confusion_matrix[4]
fp <- confusion_matrix[3]
fn <- confusion_matrix[2]

accuracy <- (tp + tn)/(tp + tn + fp + fn)
misclassification_rate <- 1 - accuracy
recall <- tp/(tp + fn)
precision <- tp/(tp + fp)
null_error_rate <- tn/(tp + tn + fp + fn)

tibble(accuracy, misclassification_rate, recall, precision, null_error_rate) %>% 
    transpose()

cat("Recall for best model is ", recall, "In an HR context, this is ", (recall * 
    100), "% more employees that could potentially be targeted prior to quiting.
From that standpoint, an organization that loses 100 people per year could possibly target", 
    ceiling(recall * 100), "employees by implementing measures to retain.")

Recall for best model is  0.5862069 In an HR context, this is  58.62069 % more employees that could potentially be targeted prior to quiting.
From that standpoint, an organization that loses 100 people per year could possibly target 59 employees by implementing measures to retain.