# Employee Attrition Using Random Forest Algorithm

### Install/Load Dependencies

In [18]:
# Install/Load Dependencies
# using Pkg
# Pkg.add(["XLSX", "DataFrames", DecisionTree"])

The dataset for this post is in *XLSX* format so we’ll use the *XLSX package* to parse it

As usual, I’ll be using *DataFrames* to view the data that we’ll be working with in a nice format (this step isn’t imperative, I just like to inspect my data in this format before going further)

*DecisionTree* is the awesome package that we’ll use to build our random forest model

In [20]:
using DataFrames, DecisionTree, XLSX

### Get Data

The dataset consists of Employee Data along with the Attrition.(whether the employee left the company or not)

We have two datasets : 1) Train Data 2) Test Data

First, reading the train data and storing it as a DataFrame.

In [6]:
train = DataFrame(XLSX.readtable("train.xlsx","Sheet1")...)
select!(train,r"Attrition",:)

Unnamed: 0_level_0,Attrition,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber
Unnamed: 0_level_1,Any,Any,Any,Any,Any,Any,Any
1,1,41,1102,1,2,1,1
2,0,49,279,8,1,1,2
3,1,37,1373,2,2,1,4
4,0,33,1392,3,4,1,5
5,0,27,591,2,1,1,7
6,0,32,1005,2,2,1,8
7,0,59,1324,3,3,1,10
8,0,30,1358,24,1,1,11
9,0,38,216,23,3,1,12
10,0,36,1299,27,3,1,13


Before we can actually build the model, we need to separate our data into one Array containing the features, and another containing the labels. This is pretty straightforward in Julia with the convert function. I chose to drop rows that have missing values, but you can choose to leave them in if you’d like.

### Feature Data

In [21]:
features = convert(Array, dropmissing(train)[2:29])

│   caller = top-level scope at In[21]:1
└ @ Core In[21]:1


847×28 Array{Any,2}:
 41  1102   1  2  1     1  2  1   94  3  …  1  80  0   8  0  1   6  4   0   5
 49   279   8  1  1     2  3  0   61  2     4  80  1  10  3  3  10  7   1   7
 37  1373   2  2  1     4  4  0   92  2     2  80  0   7  3  3   0  0   0   0
 33  1392   3  4  1     5  4  1   56  3     3  80  0   8  3  3   8  7   3   0
 27   591   2  1  1     7  1  0   40  3     4  80  1   6  3  3   2  2   2   2
 32  1005   2  2  1     8  4  0   79  3  …  3  80  0   8  2  2   7  7   3   6
 59  1324   3  3  1    10  3  1   81  4     1  80  3  12  3  2   1  0   0   0
 30  1358  24  1  1    11  4  0   67  3     2  80  1   1  2  3   1  0   0   0
 38   216  23  3  1    12  4  0   44  2     2  80  0  10  2  3   9  7   1   8
 36  1299  27  3  1    13  3  0   94  3     2  80  2  17  3  2   7  7   7   7
 35   809  16  3  1    14  1  0   84  4  …  3  80  1   6  5  3   5  4   0   3
 29   153  15  2  1    15  4  1   49  2     4  80  0  10  3  3   9  5   0   8
 31   670  26  1  1    16  1  0   31  3    

### Label Data

In [8]:
labels = convert(Array, dropmissing(train)[1])

│   caller = top-level scope at In[8]:1
└ @ Core In[8]:1


847-element Array{Any,1}:
 1
 0
 1
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 1
 0
 1
 0
 0
 0
 1
 0
 0
 0
 0

As you can see from this code block, we’re converting the last 29 columns of training_data to an Array called features and then converting the first column to a labels Array. Then, we call the build_forest function from the DecisionTree package, passing in the labels and features.


### Build Model

In [10]:
model = build_forest(labels, features)

Ensemble of Decision Trees
Trees:      10
Avg Leaves: 79.8
Avg Depth:  12.7

The output from build_forest informs us of the primary characteristics of our random forest, which we will put to the test in the next step!

### Test Model

One of the nice things about this dataset is that the test data is already separated out into its own dataset.
Let’s pull in our test dataset using the same functions that we used to pull in the training data.

In [11]:
test = DataFrame(XLSX.readtable("test.xlsx","Sheet1")...)
select!(test,r"Attrition",:)

Unnamed: 0_level_0,Attrition,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber
Unnamed: 0_level_1,Any,Any,Any,Any,Any,Any,Any
1,0,35,636,4,4,1,1185
2,1,43,1372,9,3,1,1188
3,0,32,862,2,1,1,1190
4,0,56,718,4,4,1,1191
5,0,29,1401,6,1,1,1192
6,0,19,645,9,2,1,1193
7,0,45,1457,7,3,1,1195
8,0,37,977,1,3,1,1196
9,0,20,805,3,3,1,1198
10,1,44,1097,10,4,1,1200


Now that we have our test dataset, we need to again create two Arrays, one for our features and another for our labels, and then make use of the apply_forest function to obtain our predictions. Since I used dropmissing for our training data, I did the same for our test data.

In [12]:
features_test = convert(Array, dropmissing(test)[2:29])

│   caller = top-level scope at In[12]:1
└ @ Core In[12]:1


622×28 Array{Any,2}:
 35   636   4  4  1  1185  4  0  47  …  3  2  80  1   2  2  4   2  2  2   2
 43  1372   9  3  1  1188  1  1  85     3  2  80  0   7  2  2   4  3  1   3
 32   862   2  1  1  1190  3  1  76     3  3  80  3   1  3  3   1  0  0   0
 56   718   4  4  1  1191  4  1  92     3  4  80  1  28  2  3   5  2  4   2
 29  1401   6  1  1  1192  2  1  54     3  1  80  1  10  5  3  10  8  0   8
 19   645   9  2  1  1193  3  0  54  …  4  3  80  0   1  4  3   1  1  0   0
 45  1457   7  3  1  1195  1  1  83     3  3  80  1   7  2  2   3  2  0   2
 37   977   1  3  1  1196  4  1  56     3  2  80  1  14  2  2  14  8  3  11
 20   805   3  3  1  1198  1  0  87     3  1  80  0   2  2  2   2  2  1   2
 44  1097  10  4  1  1200  3  0  96     3  3  80  0   6  4  3   6  4  0   2
 53  1223   7  2  1  1201  4  1  50  …  3  2  80  1  26  6  3   7  7  4   7
 29   942  15  1  1  1202  2  1  69     3  1  80  1   6  2  2   5  4  1   3
 22  1256   3  4  1  1203  3  0  48     3  2  80  1   1  5  3   0  

In [13]:
labels_test = convert(Array, dropmissing(test)[1])

│   caller = top-level scope at In[13]:1
└ @ Core In[13]:1


622-element Array{Any,1}:
 0
 1
 0
 0
 0
 0
 0
 0
 0
 1
 0
 0
 1
 ⋮
 0
 0
 0
 1
 0
 0
 0
 0
 0
 0
 0
 0

After that, we can construct an Array of predictions with the apply_forest function. We do this by making use of Julia’s array comprehension feature. We define an Array predictions and construct it by looping through every row of our features_test Array (a two-dimensional Array), grabbing all of the columns for the given row, and passing them (along with our model from the previous step) to apply_forest.

In [15]:
predictions = [apply_forest(model, features_test[i, :]) for i = 1:size(features_test)[1]]

622-element Array{Int64,1}:
 0
 0
 0
 0
 0
 1
 0
 0
 0
 0
 0
 0
 1
 ⋮
 0
 0
 1
 0
 0
 0
 1
 0
 0
 0
 0
 0

You can see from the output above that we now have an Array of predicted labels for our test features. To check how good of a job our model did, we’ll simply compute the percentage of correct answers as follows:

In [16]:
corrects = predictions .== labels_test

622-element BitArray{1}:
 1
 0
 1
 1
 1
 0
 1
 1
 1
 0
 1
 1
 1
 ⋮
 1
 1
 0
 0
 1
 1
 0
 1
 1
 1
 1
 1

### OUTPUT

In [17]:
percent_correct = count(i -> i == true, corrects) / length(corrects)

0.8472668810289389

n the code above, we check whether or not each item in predictions is the same as its corresponding item in labels_test_formatted by broadcasting the == equality operator via dot syntax. This results in the creation of a BitArray(this is not a new cryptocurrency), which we then use to compute the percentage of correct predictions. The latter part is achieved by using the count function on the corrects BitArray, passing to it an iterator that checks whether or not i == true. It then returns the number of elements that passed the test, which we divide by the length of the BitArray. On this run you can see that we achieved 85% accuracy with just default settings!!