# Introduction 

The issue of keeping one's employees happy and satisfied is a perennial and age-old challenge. If an employee you have invested so much time and money leaves for "greener pastures",  then this would mean that you would have to spend even more time and money to hire somebody else. In the spirit of Kaggle, let us therefore turn to our predictive modelling capabilities and see if we can predict employee attrition on this synthetically generated IBM dataset. 

This notebook is structured as follows:

 1. **Exploratory Data Analysis** : In this section, we explore the dataset by taking a look at the feature distributions, how correlated one feature is to the other and create some Seaborn and Plotly visualisations
 2. **Feature Engineering and Categorical Encoding** : Conduct some feature engineering as well as encode all our categorical features into dummy variables
 3. **Implementing Machine Learning models** : We implement a Random Forest, a Support Vector Machine and a Gradient Boosted Model after which we look at feature importances from these respective models

Let's Go.

In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Import statements required for Plotly 
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls


from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss
from imblearn.over_sampling import SMOTE
import xgboost

# Import and suppress warnings
import warnings
warnings.filterwarnings('ignore')

# 1. Explore the Data

Let's load the dataset and take a look at it.

In [3]:
attrition = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')
attrition.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


Attrition column is our target to train the model. The dataset has a mix of categorial and numbers columns as well.

**Data quality checks**

Next we check for null values in the dataset

In [4]:
# Looking for null
attrition.isnull().any()

Age                         False
Attrition                   False
BusinessTravel              False
DailyRate                   False
Department                  False
DistanceFromHome            False
Education                   False
EducationField              False
EmployeeCount               False
EmployeeNumber              False
EnvironmentSatisfaction     False
Gender                      False
HourlyRate                  False
JobInvolvement              False
JobLevel                    False
JobRole                     False
JobSatisfaction             False
MaritalStatus               False
MonthlyIncome               False
MonthlyRate                 False
NumCompaniesWorked          False
Over18                      False
OverTime                    False
PercentSalaryHike           False
PerformanceRating           False
RelationshipSatisfaction    False
StandardHours               False
StockOptionLevel            False
TotalWorkingYears           False
TrainingTimesL

In [5]:
# Now we convert the 'Yes' and 'No' in "Attrition" column into '1' and '0'. First, we create a dictionary.
target_map = {'Yes':1, 'No':0}
# Apply the dictionary to convert the 'Attrition' column .
attrition["Attrition_numerical"] = attrition["Attrition"].apply(lambda x: target_map[x])

### Correlation of Features

Let's see how features are correlated to each other by plotting the correlation matrix. We're going to plot the matrix for numerical features.


In [6]:
# creating a list of only numerical values
numerical = [u'Age', u'DailyRate', u'DistanceFromHome', u'Education', u'EmployeeNumber', u'EnvironmentSatisfaction',
       u'HourlyRate', u'JobInvolvement', u'JobLevel', u'JobSatisfaction',
       u'MonthlyIncome', u'MonthlyRate', u'NumCompaniesWorked',
       u'PercentSalaryHike', u'PerformanceRating', u'RelationshipSatisfaction',
       u'StockOptionLevel', u'TotalWorkingYears',
       u'TrainingTimesLastYear', u'WorkLifeBalance', u'YearsAtCompany',
       u'YearsInCurrentRole', u'YearsSinceLastPromotion',
       u'YearsWithCurrManager']
data = [
    go.Heatmap(
        z= attrition[numerical].astype(float).corr().values, # Generating the Pearson correlation
        x=attrition[numerical].columns.values,
        y=attrition[numerical].columns.values,
        colorscale='Viridis',
        reversescale = False,
        text = True ,
        opacity = 1.0
        
    )
]


layout = go.Layout(
    title='Pearson Correlation of numerical features',
    xaxis = dict(ticks='', nticks=36),
    yaxis = dict(ticks='' ),
    width = 900, height = 700,
    
)


fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='labelled-heatmap')

**From the plots**

A lot of columns are poorly correlated to each other. In general, we want features that are not too correlated to avoid redundancy.

# 2. Feature Engineering & Categorical Encoding

We will encode the categorial columns and drop unnecessary numerical columns.

In [7]:
# We want to drop the target column, namely 'Attrition'
attrition = attrition.drop(['Attrition_numerical'], axis=1)

# Now we seperate categorial columns into a list.
categorical = []
for col, value in attrition.iteritems():
    if value.dtype == 'object':
        categorical.append(col)

# numerical features will be the difference between original dataset and categorial features 
numerical = attrition.columns.difference(categorical)

Next, let's encode the categorial features.

In [8]:
# Store the categorical data in a dataframe called attrition_cat
attrition_cat = attrition[categorical]
attrition_cat = attrition_cat.drop(['Attrition'], axis=1) # Dropping the target column

Use **get_dummies** to encode our categorial features.

In [9]:
attrition_cat = pd.get_dummies(attrition_cat)
attrition_cat.head(3)

Unnamed: 0,BusinessTravel_Non-Travel,BusinessTravel_Travel_Frequently,BusinessTravel_Travel_Rarely,Department_Human Resources,Department_Research & Development,Department_Sales,EducationField_Human Resources,EducationField_Life Sciences,EducationField_Marketing,EducationField_Medical,...,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Divorced,MaritalStatus_Married,MaritalStatus_Single,Over18_Y,OverTime_No,OverTime_Yes
0,0,0,1,0,0,1,0,1,0,0,...,0,0,1,0,0,0,1,1,0,1
1,0,1,0,0,1,0,0,1,0,0,...,0,1,0,0,0,1,0,1,1,0
2,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,1


**Creating new features from Numerical data**

In [10]:
# Store the numerical features to a dataframe attrition_num
attrition_num = attrition[numerical]

For the sake of simplicity of the web application, we will only create the model for numerical features. Also, we will drop some unnecessary numerical features as well. 

In [11]:
# Concat the two dataframes together columnwise
attrition_final = attrition_num
attrition_final.drop(['EmployeeCount','EmployeeNumber','HourlyRate','DailyRate','MonthlyRate',
                     'StandardHours','StockOptionLevel','PercentSalaryHike','Education','JobLevel'], axis = 1, inplace=True)

In [12]:
# Define a dictionary for the target mapping
target_map = {'Yes':1, 'No':0}
# Use the pandas apply method to numerically encode our attrition target variable
target = attrition["Attrition"].apply(lambda x: target_map[x])
target.head(3)

0    1
1    0
2    1
Name: Attrition, dtype: int64

Take a quick look at the target feature, which is 'Attrition'.

In [13]:
data = [go.Bar(
            x=attrition["Attrition"].value_counts().index.values,
            y= attrition["Attrition"].value_counts().values
    )]

py.iplot(data, filename='basic-bar')

There is a big difference between 'Yes' and 'No'. Therefore, we will use oversampling technique to treat this imbalance.

# 3. Implementing Machine Learning Models

We will use different learning models and choose the most accurate one. 

**Splitting Data into Train and Test sets**

Let's split the dataset into Train and Test sets by using scikit's library method

In [14]:
# Import the train_test_split method
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import StratifiedShuffleSplit

# Split data into train and test sets as well as for validation and testing
train, test, target_train, target_val = train_test_split(attrition_final, target, train_size= 0.75,random_state=0);


**SMOTE to oversample due to the imbalance in target**


In [15]:
oversampler=SMOTE(random_state=0)
smote_train, smote_target = oversampler.fit_sample(train,target_train)

## A. Random Forest Classifier 

Let's try the first learning model, which is Random Forest Classifier. Detail about this classifier can be found [here](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

**Setup parameter for our classifier**

In [16]:
seed = 0   # We set our random seed to zero for reproducibility
# Random Forest parameters
rf_params = {
    'n_jobs': -1,
    'n_estimators': 800,
    'warm_start': True, 
    'max_features': 0.3,
    'max_depth': 9,
    'min_samples_leaf': 2,
    'max_features' : 'sqrt',
    'random_state' : seed,
    'verbose': 0
}

In [17]:
rf = RandomForestClassifier(**rf_params) 
attrition_final.head()

Unnamed: 0,Age,DistanceFromHome,EnvironmentSatisfaction,JobInvolvement,JobSatisfaction,MonthlyIncome,NumCompaniesWorked,PerformanceRating,RelationshipSatisfaction,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,1,2,3,4,5993,8,3,1,8,0,1,6,4,0,5
1,49,8,3,2,2,5130,1,4,4,10,3,3,10,7,1,7
2,37,2,4,2,3,2090,6,3,2,7,3,3,0,0,0,0
3,33,3,4,3,3,2909,1,3,3,8,3,3,8,7,3,0
4,27,2,1,3,2,3468,9,3,4,6,3,3,2,2,2,2


Now we train the model with the **fit** method.

In [18]:
rf.fit(smote_train, smote_target)
print("Fitting of Random Forest as finished")

attrition_final.describe()

import requests,json 
from sklearn.externals import joblib

BASE_URL = "http://localhost:5000"


Fitting of Random Forest as finished


After fitting the Train dataset. It's time to test our model with the Test dataset (which the model has never seen before). To use our Random Forest in predicting against our test data, we can use sklearn's **.predict** method as follows:

In [19]:
rf_predictions = rf.predict(test)
print("Predictions finished")
joblib.dump(rf, "random_forest_model.pkl") #Serialize the model for web application
exp = np.array([52,20,2,4,2,6050,6,3,4,10,1,2,1,1,0,1]) #Test a random employee
exp = exp.reshape(1,-1)
print(rf.predict(exp))  #[0] means 'No' , [1] means 'Yes'

Predictions finished
[0]


What about the accuracy of the model ? Here it is :

In [20]:
accuracy_score(target_val, rf_predictions)

0.8288043478260869

### Feature Ranking via the Random Forest 

Among a ton of features we just test, what are the most important features you may ask ? Let's find out with function **feature_importances_** of scikit. Then we're going to plot a graph for a better comparision. 

In [21]:
# Scatter plot 
trace = go.Scatter(
    y = rf.feature_importances_,
    x = attrition_final.columns.values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 13,
        #size= rf.feature_importances_,
        #color = np.random.randn(500), #set color equal to a variable
        color = rf.feature_importances_,
        colorscale='Portland',
        showscale=True
    ),
    text = attrition_final.columns.values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Random Forest Feature Importance',
    hovermode= 'closest',
     xaxis= dict(
         ticklen= 5,
         showgrid=False,
        zeroline=False,
        showline=False
     ),
    yaxis=dict(
        title= 'Feature Importance',
        showgrid=False,
        zeroline=False,
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

#### **Most RF important features** : Work-Life Balance, Monthly Income and Job Satisfaction

As the plot shows, the three most important RF features are Work-Life Balance, Monthly Income and Job Satisfaction, which aren't surprise



### Visualising Tree Diagram with Graphviz

Let us now visualise how a single decision tree traverses the features in our data as the DecisionTreeClassifier object of sklearn comes with a very convenient **export_graphviz** method that exports the tree diagram into a .png format which you can view from the output of this kernel.

In [22]:
from sklearn import tree
from IPython.display import Image as PImage
from subprocess import check_call
from PIL import Image, ImageDraw, ImageFont
import re

decision_tree = tree.DecisionTreeClassifier(max_depth = 4)
decision_tree.fit(train, target_train)

# Predicting results for test dataset
y_pred = decision_tree.predict(test)

# Export our trained model as a .dot file
with open("tree1.dot", 'w') as f:
     f = tree.export_graphviz(decision_tree,
                              out_file=f,
                              max_depth = 4,
                              impurity = False,
                              feature_names = attrition_final.columns.values,
                              class_names = ['No', 'Yes'],
                              rounded = True,
                              filled= True )
        
#Convert .dot to .png to allow display in web notebook
check_call(['dot','-Tpng','tree1.dot','-o','tree1.png'])

# Annotating chart with PIL
img = Image.open("tree1.png")
draw = ImageDraw.Draw(img)
img.save('sample-out.png')
PImage("sample-out.png")

FileNotFoundError: [WinError 2] The system cannot find the file specified

## B. Support Vector Machine

We will use another algorithm to train our model, which is Support Vector Machine. You can read the pros and cons of this algorithm in this link [here](http://scikit-learn.org/stable/modules/svm.html)


In [23]:
from sklearn import svm
#Set up the parameter
svm = svm.SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

svm.fit(smote_train, smote_target)
svm_predictions = svm.predict(test)
print('Predictions have finished')

Predictions have finished


Let us see how accurate our model is :

In [25]:
accuracy_score(target_val, svm_predictions)

0.842391304347826

The accuracy of this model is slightly better than our Random Forest model. However, with kernel **'rbf'** (**rfb** stands for 
radial basis function in case you may wonder, which is more suitable for our dataset than 'linear' kernel) we can not rank 
the features. That lead us to explore another model in the section below.

## C. Gradient Boosting Model

Gradient Boosting is also an ensemble technique much like the Random Forest where a combination of weak Tree learners are brought together to form a relatively stronger learner. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.


**Initialising Gradient Boosting Parameters**

In general there are a handful of key parameter when setting up tree-based or gradient boosted models. These are always going to be the number of estimators, the maximum depth with which you want your model to be trained to, and the minimum samples per leaf

In [26]:
# Gradient Boosting Parameters
gb_params ={
    'n_estimators': 500,
    'max_features': 0.9,
    'learning_rate' : 0.2,
    'max_depth': 11,
    'min_samples_leaf': 2,
    'subsample': 1,
    'max_features' : 'sqrt',
    'random_state' : seed,
    'verbose': 0
}

Having defined our parameters, we can now apply the usual fit and predict methods on our train and test sets respectively

In [27]:
gb = GradientBoostingClassifier(**gb_params)
# Fit the model to our SMOTEd train and target
gb.fit(smote_train, smote_target)
# Get our predictions
gb_predictions = gb.predict(test)
print("Predictions have finished")

Predictions have finished


In [28]:
accuracy_score(target_val, gb_predictions)

0.8396739130434783

### Feature Ranking via the Gradient Boosting Model

Much like the Random Forest, we can invoke the feature_importances_ attribute of the gradient boosting model and dump it in an interactive Plotly chart

In [29]:
# Scatter plot 
trace = go.Scatter(
    y = gb.feature_importances_,
    x = attrition_final.columns.values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 13,
        #size= rf.feature_importances_,
        #color = np.random.randn(500), #set color equal to a variable
        color = gb.feature_importances_,
        colorscale='Portland',
        showscale=True
    ),
    text = attrition_final.columns.values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Gradient Boosting Model Feature Importance',
    hovermode= 'closest',
     xaxis= dict(
         ticklen= 5,
         showgrid=False,
        zeroline=False,
        showline=False
     ),
    yaxis=dict(
        title= 'Feature Importance',
        showgrid=False,
        zeroline=False,
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter')

**Takeaway from the Plot**

**GBM most important features**  : Monthly Income, Age and Distance From Home

It is quite a surprise that the top 3 important features are a bit different from our previous Random Forest model. Interestingly, Monthly Income is still in these 3 most important features. 