# lab 6 - wide and deep

In [260]:
import pandas as pd
import numpy as np

data_path = '../../data/attrition.csv'


## Overview

This data set was created by IBM data scientists.  It describes 35 features for 1470 (fictional) employees including whether or not the employee left the firm (labeled "attrition" in the dataset).  Employees leave companies for a variety of reasons: disatisfaction with their role, their manager or their pay.  Perhaps they aren't necessarily dissatified with their current job but feel like something better is out there.  Or maybe they just feel like they'd been their long enough, and want something different.  Most likely its a combination of all of these things, plus a few others.  

Employers would like to have a sense of why and when an employee might leave.  If an employer believes that an employee that they really value might leave, they could respond and try to prevent them from leaving.  This is what we will attempt to predict using a wide and deep neural network.

In [261]:
df = pd.read_csv(data_path)
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


There are 35 features in total in the dataset about. 
Let's focus on a few of them:
- Age
- Attrition
- Department
- DistanceFromHome
- Education
- EduacationField
- EnvironmentSatisfaction
- Gender
- JobSatisfaction
- MaritalStatus
- MonthlyIncome
- OverTime
- PerformanceRating
- RelationshipSatisfaction
- TotalWorkingYears
- YearsAtCompany
- YearsSinceLastPromotion

In this lab, we will use attrition as our label, to try to predict the attrition status accroding to other attributes. 


In [262]:
to_keep = {'Age', 'Attrition', 'Department','DistanceFromHome', 'Education', 'EducationField', 'EnvironmentSatisfaction', 'Gender', 'JobSatisfaction', 'MaritalStatus',
           'MonthlyIncome', 'OverTime', 'PerformanceRating', 'RelationshipSatisfaction','TotalWorkingYears','YearsAtCompany', 'YearsSinceLastPromotion'}
to_drop = set(df.columns)-to_keep
df.drop(to_drop, axis=1, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 17 columns):
Age                         1470 non-null int64
Attrition                   1470 non-null object
Department                  1470 non-null object
DistanceFromHome            1470 non-null int64
Education                   1470 non-null int64
EducationField              1470 non-null object
EnvironmentSatisfaction     1470 non-null int64
Gender                      1470 non-null object
JobSatisfaction             1470 non-null int64
MaritalStatus               1470 non-null object
MonthlyIncome               1470 non-null int64
OverTime                    1470 non-null object
PerformanceRating           1470 non-null int64
RelationshipSatisfaction    1470 non-null int64
TotalWorkingYears           1470 non-null int64
YearsAtCompany              1470 non-null int64
YearsSinceLastPromotion     1470 non-null int64
dtypes: int64(11), object(6)
memory usage: 195.3+ KB


It's good that we don't have any null value. 

Let's converet the categorical features (Attrition, Department, EducationField, Gender, MaritalStatus and Overtime) to ints because we'll want Keras to one-hot-encode them later. 

In [263]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

categorical_features = {'Attrition', 'Department', 'EducationField', 'Gender','MaritalStatus','OverTime'}
encoders = dict()
for atr in categorical_features:
    encoders[atr] = LabelEncoder()
    df[atr] = encoders[atr].fit_transform(df[atr] )

Then, let's scale the numeric features. 

In [264]:
numeric_features = set(df.columns) - categorical_features
for atr in numeric_features:
    df[atr] = df[atr].astype(np.float)    
    ss = StandardScaler()
    df[atr] = ss.fit_transform(df[atr].values.reshape(-1, 1))

In [265]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 17 columns):
Age                         1470 non-null float64
Attrition                   1470 non-null int64
Department                  1470 non-null int64
DistanceFromHome            1470 non-null float64
Education                   1470 non-null float64
EducationField              1470 non-null int64
EnvironmentSatisfaction     1470 non-null float64
Gender                      1470 non-null int64
JobSatisfaction             1470 non-null float64
MaritalStatus               1470 non-null int64
MonthlyIncome               1470 non-null float64
OverTime                    1470 non-null int64
PerformanceRating           1470 non-null float64
RelationshipSatisfaction    1470 non-null float64
TotalWorkingYears           1470 non-null float64
YearsAtCompany              1470 non-null float64
YearsSinceLastPromotion     1470 non-null float64
dtypes: float64(11), int64(6)
memory usage: 195.3 KB


In [266]:
df.head()

Unnamed: 0,Age,Attrition,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,JobSatisfaction,MaritalStatus,MonthlyIncome,OverTime,PerformanceRating,RelationshipSatisfaction,TotalWorkingYears,YearsAtCompany,YearsSinceLastPromotion
0,0.44635,1,2,-1.010909,-0.891688,1,-0.660531,0,1.153254,2,-0.10835,1,-0.42623,-1.584178,-0.421642,-0.164613,-0.679146
1,1.322365,0,1,-0.14715,-1.868426,1,0.254625,1,-0.660853,1,-0.291719,0,2.346151,1.191438,-0.164511,0.488508,-0.368715
2,0.008343,1,1,-0.887515,-0.891688,4,1.169781,1,0.2462,2,-0.937654,1,-0.42623,-0.658973,-0.550208,-1.144294,-0.679146
3,-0.429664,0,1,-0.764121,1.061787,1,1.169781,0,0.2462,1,-0.763634,1,-0.42623,0.266233,-0.421642,0.161947,0.252146
4,-1.086676,0,1,-0.887515,-1.868426,3,-1.575686,1,-0.660853,1,-0.644858,0,-0.42623,1.191438,-0.678774,-0.817734,-0.058285


In [267]:
print('Breakdown of attrition.  (0) Stayed at company, (1) Left company')

df.Attrition.value_counts()

Breakdown of attrition.  (0) Stayed at company, (1) Left company


0    1233
1     237
Name: Attrition, dtype: int64

### Evaluation Metric

Let's take a moment to break down what the three main metrics mean in our task, i.e. predicting whether or not an employee will leave a company, and what that mean for our businesses using the model:

* accuracy: Values getting true positives and negatives.  This probably isn't very useful for our dataset because there are significantly fewer people that left the company than stayed

* precision: Values a low false positive rate.  This probably isn't the best either, because while we wouldn't **want** to think that an employee is leaving when they aren't, it probably won't hurt the business, unless the employer grossly overreacts and scares them away

* recall: Values a low false negative rate.  This is the best metric for our case on its own.  If our job is to see when employees leave, and if the fact is that they usually **don't** leave, and if its potentially pretty damaging to the firm when the employee **does** leave, we want to make sure that we miss as few cases as possible.

* f1: A combination of the precision and recall.  We'll use this because its usually the best tradeoff between false positives and false negatives.

### Validation Method

Because of the imbalance in our prediction label we'll use a stratified split, this way we'll preserve the distribution in our model.  In an attempt to realistically generalize the overall performance of our model we'll use a nested cross-validation scheme.  We'll use k-fold as opposed to a different cv scheme like shuffle-split, because our dataset is not that large and we would like to train on as much data as possible.  Using a k-fold cv ensures that we train on all of our data.  The inner loop will tune the hyper-parameters of our model which will be discussed later.

## Model Building

Let's just see how we well we can do with a singl network on all of our data.  We'll just split all of our data up into one train and test set.

In [269]:
from sklearn.model_selection import train_test_split

# stratified 90/10 train/test split`
df_train, df_test = train_test_split(df, test_size=0.1, stratify=df.Attrition)

y_train = df_train.Attrition
y_test = df_test.Attrition

X_train = df_train.drop('Attrition', axis=1).values
X_test = df_test.drop('Attrition', axis=1).values

print('train', X_train.shape, 'test', X_test.shape)

train (1323, 16) test (147, 16)


In [270]:
#import some keras stuff
from keras.models import Sequential
from keras.layers import Dense, Activation, Input
from keras.layers import Embedding, Flatten, Merge, concatenate
from keras.models import Model

In [271]:
# This returns a tensor
inputs = Input(shape=(X_train.shape[1],))

# a layer instance is callable on a tensor, and returns a tensor
x = Dense(units=10, activation='relu')(inputs)
predictions = Dense(1,activation='sigmoid')(x)

# This creates a model that includes
# the Input layer and three Dense layers
model = Model(inputs=inputs, outputs=predictions)

In [272]:
model.compile(optimizer='sgd',
              loss='mean_squared_error',
              metrics=['accuracy'])

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         (None, 16)                0         
_________________________________________________________________
dense_934 (Dense)            (None, 10)                170       
_________________________________________________________________
dense_935 (Dense)            (None, 1)                 11        
Total params: 181
Trainable params: 181
Non-trainable params: 0
_________________________________________________________________


In [278]:
%%time

model.fit(X_train, y_train, epochs=100, batch_size=50, verbose=0)

from sklearn import metrics as mt
yhat = np.round(model.predict(X_test))
print(mt.confusion_matrix(y_test,yhat),mt.recall_score(y_test,yhat))

[[118   5]
 [ 18   6]] 0.25
Wall time: 13.8 s


Well it didn't do very well.  It missed the majority of the true positives.  But there are a few issues here that we can address.

1. We don't have that much data to begin with and our target class is a small percentage of it so we're going to have a hard time.

2. We were using a mean squared error for our loss function on a binary classification task.  While this isn't terrible, it would likely work better if we used cross-entropy instead.

Luckily its easy to modify both of these in Keras.  The loss function is simple to change, and the fit function includes a parameter for 'class_weights' which accepts a dictionary of weights for each class value to use when computing the loss function.  This way we can tell the model which class is "more important".  Let's try it.

In [279]:
model = Model(inputs=inputs, outputs=predictions)
model.compile(optimizer='sgd',
              loss='binary_crossentropy',
              metrics=['accuracy'])


model.fit(X_train, y_train, epochs=100, batch_size=50, verbose=0, class_weight={0 : 0.20, 1 : 0.80})

yhat = np.round(model.predict(X_test))
print(mt.confusion_matrix(y_test,yhat),mt.recall_score(y_test,yhat))

[[94 29]
 [11 13]] 0.541666666667


It ended up getting a much better reacll score, at the expense of overall accuracy however.

### Parameter tuning in Deep Network

We identify the following as parameters that we can tune

- Number of layers
- Number of neurons per layer
- The weights of the class in the loss function
- number of epochs

Keras has a wrapper class for a model to be used as an sklearn estimator which we can that pass to GridSearchCV.  The only caveat is that we must use Keras' Sequential Model to do so which means we can only have one input branch.  This is okay though because if we only build it on the deep side we can use the result of that with our wide, cross-category branch later on.

In [161]:
X_train.shape

(1323, 16)

In [186]:
# taken from https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/
# this uses the Keras Wrapper to make the model usable by sk-learn

# Use scikit-learn to grid search the batch size and epochs
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

# Function to create model, required for KerasClassifier
def create_model(num_neurons=12, input_dim=8):
    # create model
    model = Sequential()
    
    # num_neurons is a list of the number nuerons at each layer
    for layer, num in enumerate(num_neurons):
        if layer == 0:
            model.add(Dense(num, input_dim=input_dim, activation='relu'))
        else:
            model.add(Dense(units=num, activation='relu'))
    
    model.add(Dense(1, activation='sigmoid'))
    
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

Now that we can creat models in that can be used by SKLearn, lets GridSearch over the parameters we identified.  Note that the number of neurons and number of layers is combined into one parameter called `num_neurons` which is a list of the number of output nerouns at each layer.

In [244]:
# this will move inside nested CV
from sklearn.model_selection import GridSearchCV
num_neurons = [[5, 10], [5, 10, 20], [10, 20, 15]]
epochs = [5]
class_weight = [{0:x, 1:1-x} for x in np.linspace(0.1, 0.5, 5)]
param_grid = dict(num_neurons=num_neurons,
                  epochs=epochs,
                  class_weight=class_weight)

model = KerasClassifier(build_fn=create_model, input_dim=X_train.shape[-1], epochs=10, verbose=0)
#g = GridSearchCV(estimator=model, param_grid=param_grid, verbose=3, scoring='f1')
#r = g.fit(X_train, y_train)

## Nested Cross Validation set up

In order to generalize performance we'll do a nested cross validation scheme with an outer loop of 5-folds and the inner loop tuning the hyper parameters in the deep network using 3-folds.  We should probably use more, but we don't have all day people.


In [254]:
from sklearn.model_selection import StratifiedKFold

outer_loop = StratifiedKFold(n_splits=5)

for train_idx, test_idx in outer_loop.split(X, y):
    # set up wide branch
        # embed cross
        # concat
    
    # do deep branch
        # embed categorical
        # concat cat_embed + num
        # gridsearch deep cv
        
        
    # concat 

(1175,) (295,)
(1175,) (295,)
(1176,) (294,)
(1177,) (293,)
(1177,) (293,)
