## Overview

This data set was created by IBM data scientists.  It describes 35 features for 1470 (fictional) employees including whether or not the employee has left the firm (labeled "attrition" in the dataset).  Employees leave companies for a variety of reasons: disatisfaction with their role, their manager or their pay.  Perhaps they aren't necessarily dissatified with their current job but feel like something better is out there.  Or maybe they just feel like they'd been there long enough, and want something different. Most likely its a combination of all of these things, plus a few others.  

Employers would like to have a sense of why and when an employee might leave.  If an employer believes that an employee that they really value might leave, they could respond and try to prevent them from leaving.  This is what we will attempt to predict using a wide and deep neural network.

In [31]:
import pandas as pd
import numpy as np
import os

In [32]:
data_path = '../data/'
df = pd.read_csv(os.path.join(data_path, 'WA_Fn-UseC_-HR-Employee-Attrition.csv'))
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


# Expansion

This is a relative small dataset with only 1470 data rows. We want more data to train and test. For data expension, we will try several ways to do so. And first we are going to copy and append some rows to origin dataset and expand it to 2000 rows.
For another attempt, we take numberical datas from the raw dataframe slice, and add some randomly generat noise to these numerical data. Then we insert categorical data rows back to the dataframe of numerical data with random noise.

In [33]:
df_slice = df[:530]

In [34]:
df_new = df.append(df_slice)
df_new = df_new.reset_index(drop=True)

In [35]:
df_slice2 = df_slice[['Age','DistanceFromHome','Education','EnvironmentSatisfaction',
                      'JobSatisfaction','MonthlyIncome','PerformanceRating','RelationshipSatisfaction',
                      'TotalWorkingYears','YearsAtCompany']]

We use df_slice2 to take numberical datas from the raw dataframe slice, and add some randomly generated noise to these numerical data.

In [36]:
df_slice2 = df_slice2 * (1 + np.random.uniform(-0.01,0.01,(df_slice2.shape)))

In [37]:
df_slice2.insert(1, 'Attrition', df_slice['Attrition'])
df_slice2.insert(2, 'Department', df_slice['Department'])
df_slice2.insert(6, 'Gender', df_slice['Gender'])
df_slice2.insert(8, 'MaritalStatus', df_slice['MaritalStatus'])
df_slice2.insert(10, 'OverTime', df_slice['OverTime'])

In [38]:
df_new2 = df.append(df_slice2)
df_new2 = df_new2.reset_index(drop=True)
df = df_new2

So now we have a dataframe with 2000 rows and 15 clomuns expanded dataset

There are 35 features in total in the dataset, but we don't want to use all of them.  
Let's focus on a few of them:
- Age
- Attrition 
- Department
- DistanceFromHome 
- Education 
- EduacationField
- EnvironmentSatisfaction
- Gender
- JobSatisfaction
- MaritalStatus
- MonthlyIncome
- OverTime
- PerformanceRating
- RelationshipSatisfaction
- TotalWorkingYears
- YearsAtCompany
- YearsSinceLastPromotion

These features are what we believe important to predict the attrition status. 
We will use attrition as our label.

So let's first drop the other features. 

In [39]:
to_keep = {'Age', 'Attrition', 'Department','DistanceFromHome', 'Education', 'EnvironmentSatisfaction', 'Gender', 'JobSatisfaction', 'MaritalStatus',
           'MonthlyIncome', 'OverTime', 'PerformanceRating', 'RelationshipSatisfaction','TotalWorkingYears','YearsAtCompany'}
to_drop = set(df.columns)-to_keep
df.drop(to_drop, axis=1, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 15 columns):
Age                         2000 non-null float64
Attrition                   2000 non-null object
Department                  2000 non-null object
DistanceFromHome            2000 non-null float64
Education                   2000 non-null float64
EnvironmentSatisfaction     2000 non-null float64
Gender                      2000 non-null object
JobSatisfaction             2000 non-null float64
MaritalStatus               2000 non-null object
MonthlyIncome               2000 non-null float64
OverTime                    2000 non-null object
PerformanceRating           2000 non-null float64
RelationshipSatisfaction    2000 non-null float64
TotalWorkingYears           2000 non-null float64
YearsAtCompany              2000 non-null float64
dtypes: float64(10), object(5)
memory usage: 234.5+ KB


# Preprocessing

It's good that we don't have any null value. Let's encode the categorical data to ints. There are some categorical values those have been encoded once from the origin and transfered to type int. We want to use some of them for the cross features, so we want to transfer their type to string. 

In [40]:
to_convert = ['Education','EnvironmentSatisfaction','JobSatisfaction',
            'PerformanceRating','RelationshipSatisfaction']
for col in to_convert:
    df[col] = df[col].astype(np.str) 
    

Encode the categorical features:

In [41]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

to_encode = {'Attrition', 'Department','Gender','MaritalStatus','OverTime','Education','EnvironmentSatisfaction','JobSatisfaction',
            'PerformanceRating','RelationshipSatisfaction'}
encoders = dict()

for col in to_encode:
    if col=="attrition":
        tmp = LabelEncoder()
        df[col] = tmp.fit_transform(df[col])
    else:
        encoders[col] = LabelEncoder()
        df[col+'_int'] = encoders[col].fit_transform(df[col])
    

Then, let's scale the numeric features. 

In [42]:
categorical_features =list(to_encode)
categorical_features = [x+'_int' for x in categorical_features]
numerics = set(df.columns) - to_encode
numerics = list(numerics - set(categorical_features))

for atr in numerics:
    df[atr] = df[atr].astype(np.float)    
    ss = StandardScaler()
    df[atr] = ss.fit_transform(df[atr].values.reshape(-1, 1))
    
feature_columns = categorical_features + numerics

In [44]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 25 columns):
Age                             2000 non-null float64
Attrition                       2000 non-null object
Department                      2000 non-null object
DistanceFromHome                2000 non-null float64
Education                       2000 non-null object
EnvironmentSatisfaction         2000 non-null object
Gender                          2000 non-null object
JobSatisfaction                 2000 non-null object
MaritalStatus                   2000 non-null object
MonthlyIncome                   2000 non-null float64
OverTime                        2000 non-null object
PerformanceRating               2000 non-null object
RelationshipSatisfaction        2000 non-null object
TotalWorkingYears               2000 non-null float64
YearsAtCompany                  2000 non-null float64
PerformanceRating_int           2000 non-null int64
Gender_int                      2000 non-

Now our data is ready to split. 

In [45]:
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense, Activation, Input
from keras.layers import Embedding, Flatten, Merge, concatenate
from keras.models import Model
from sklearn.preprocessing import OneHotEncoder
from sklearn import metrics as mt

# stratified 90/10 train/test split`
df_train, df_test = train_test_split(df, test_size=0.1, stratify=df.Attrition)

X_train = ss.fit_transform(df_train[feature_columns].values).astype(np.float32)
X_test = ss.fit_transform(df_test[feature_columns].values).astype(np.float32)

y_train = df_train['Attrition_int'].values.astype(np.int)
y_test = df_test['Attrition_int'].values.astype(np.int)

print('train', X_train.shape, 'test', X_test.shape)

train (1800, 15) test (200, 15)


In [47]:
ohe = OneHotEncoder()
X_train_ohe = ohe.fit_transform(df_train[categorical_features].values)
X_test_ohe = ohe.transform(df_test[categorical_features].values)


X_train_num =  df_train[numerics].values
X_test_num = df_test[numerics].values

In [46]:
cross_columns = [['Gender','MaritalStatus'],
                    ['Education', 'JobSatisfaction'],['Department','PerformanceRating'],
                    ['Education', 'JobSatisfaction','RelationshipSatisfaction'],['Department','OverTime'],
                ]

In [48]:

# we need to create separate sequential models for each embedding
embed_branches = []
X_ints_train = []
X_ints_test = []
all_inputs = []
all_branch_outputs = []

for cols in cross_columns:
    # encode crossed columns as ints for the embedding
    enc = LabelEncoder()
    
    # create crossed labels
    # needs to be commented better, Eric!
    X_crossed_train = df_train[cols].apply(lambda x: '_'.join(x), axis=1)
    X_crossed_test = df_test[cols].apply(lambda x: '_'.join(x), axis=1)
    
    enc.fit(np.hstack((X_crossed_train.values,  X_crossed_test.values)))
    X_crossed_train = enc.transform(X_crossed_train)
    X_crossed_test = enc.transform(X_crossed_test)
    X_ints_train.append( X_crossed_train )
    X_ints_test.append( X_crossed_test )
    
    # get the number of categories
    N = max(X_ints_train[-1]+1) # same as the max(df_train[col])
    
    # create embedding branch from the number of categories
    inputs = Input(shape=(1,),dtype='int32', name = '_'.join(cols))
    all_inputs.append(inputs)
    x = Embedding(input_dim=N, output_dim=int(np.sqrt(N)), input_length=1)(inputs)
    x = Flatten()(x)
    all_branch_outputs.append(x)
    
# merge the branches together
wide_branch = concatenate(all_branch_outputs)

# reset this input branch
all_branch_outputs = []
# add in the embeddings
for col in categorical_features:
    # encode as ints for the embedding
    X_ints_train.append( df_train[col].values )
    X_ints_test.append( df_test[col].values )
    
    # get the number of categories
    N = max(X_ints_train[-1]+1) # same as the max(df_train[col])
    
    # create embedding branch from the number of categories
    inputs = Input(shape=(1,),dtype='int32', name=col)
    all_inputs.append(inputs)
    x = Embedding(input_dim=N, output_dim=int(np.sqrt(N)), input_length=1)(inputs)
    x = Flatten()(x)
    all_branch_outputs.append(x)
    
# also get a dense branch of the numeric features
all_inputs.append(Input(shape=(X_train_num.shape[1],),sparse=False,name='numeric_data'))
x = Dense(units=20, activation='relu')(all_inputs[-1])
all_branch_outputs.append( x )

# merge the branches together
deep_branch = concatenate(all_branch_outputs)
deep_branch = Dense(units=50,activation='relu')(deep_branch)
deep_branch = Dense(units=10,activation='relu')(deep_branch)
    
final_branch = concatenate([wide_branch, deep_branch])
final_branch = Dense(units=1,activation='sigmoid')(final_branch)

model = Model(inputs=all_inputs, outputs=final_branch)

model.compile(optimizer='adagrad',
              loss='mean_squared_error',
              metrics=['accuracy'])

model.fit(X_ints_train+ [X_train_num],
        y_train, epochs=10, batch_size=32, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x262a9545be0>

In [49]:
yhat = np.round(model.predict(X_ints_test + [X_test_num]))
print(mt.confusion_matrix(y_test,yhat),mt.accuracy_score(y_test,yhat))

[[168   0]
 [  0  32]] 1.0
