# **Predicting Employee Attrition Case Study**

## **Define problem statement**

**Problem Statement :** To create a model which can predict whether a current employee will be leaving the organization in the upcoming two quarters (01 Jan 2018 - 01 July 2018) or not.

**Output Target :** 0: if the employee does not leave the organization,
1: if the employee leaves the organization

**Input predictors :** Reporting Date, Emp_ID, Age, Gender, City, Education_Level, Salary, Dateofjoining, LastWorkingDate, Joining Designation, Designation, Total_Business_Value, Quarterly Rating

**Solution :** To create a supervised ML classification model, as the target variable is categorical.

## **Load Data**

**Import libraries**

In [None]:
#import pandas for loading the CSV file
import pandas as pd

#import numpy for maths
import numpy as np

# import seaborn for visualization
import seaborn as sns

from sklearn import preprocessing
from collections import Counter

#import matplotlib for graphs
import matplotlib.pyplot as plt

#To visualise in the notebook
%matplotlib inline

#filter the warning messages
import warnings
warnings.filterwarnings('ignore')

#library for Standardization of a dataset (e.g. Gaussian with 0 mean and unit variance)
from sklearn.preprocessing import StandardScaler

#library for splitting dataset into test and train
from sklearn.model_selection import train_test_split

#metric used for the models of this case study
from sklearn.model_selection import cross_validate, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, f1_score
from sklearn.metrics import balanced_accuracy_score

sns.set_style('darkgrid')

**Read CSV dataset into dataframe**

In [None]:
train_path = '/content/drive/MyDrive/Data/Attrition/train_MpHjUjU.csv'
train_df = pd.read_csv(train_path)

test_path = '/content/drive/MyDrive/Data/Attrition/test_hXY9mYw.csv'
test_df = pd.read_csv(test_path)

In [None]:
empList = train_df.Emp_ID.unique()
print(empList)
print(len(empList))

[   1    2    4 ... 2786 2787 2788]
2381


In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19104 entries, 0 to 19103
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   MMM-YY                19104 non-null  object
 1   Emp_ID                19104 non-null  int64 
 2   Age                   19104 non-null  int64 
 3   Gender                19104 non-null  object
 4   City                  19104 non-null  object
 5   Education_Level       19104 non-null  object
 6   Salary                19104 non-null  int64 
 7   Dateofjoining         19104 non-null  object
 8   LastWorkingDate       1616 non-null   object
 9   Joining Designation   19104 non-null  int64 
 10  Designation           19104 non-null  int64 
 11  Total Business Value  19104 non-null  int64 
 12  Quarterly Rating      19104 non-null  int64 
dtypes: int64(7), object(6)
memory usage: 1.9+ MB


In [None]:
train_df.shape

(19104, 13)

In [None]:
train_df.head(2)

Unnamed: 0,MMM-YY,Emp_ID,Age,Gender,City,Education_Level,Salary,Dateofjoining,LastWorkingDate,Joining Designation,Designation,Total Business Value,Quarterly Rating
0,2016-01-01,1,28,Male,C23,Master,57387,2015-12-24,,1,1,2381060,2
1,2016-02-01,1,28,Male,C23,Master,57387,2015-12-24,,1,1,-665480,2


In [None]:
#converting Date as DateTime type
train_df['MMM-YY'] = pd.to_datetime(train_df['MMM-YY'], format='%Y-%m-%d')
train_df['Dateofjoining'] = pd.to_datetime(train_df['Dateofjoining'], format='%Y-%m-%d')
train_df['LastWorkingDate'] = pd.to_datetime(train_df['LastWorkingDate'], format='%Y-%m-%d')

**Define target**

In [None]:
target_var = 'Target'
train_df[target_var] = 0

for i in range(train_df.shape[0]):
  if pd.notnull(train_df['LastWorkingDate'][i]):
    train_df[target_var][i] = 1

y = pd.DataFrame(train_df[target_var])

In [None]:
from datetime import date
from dateutil import relativedelta

train_df['days_worked'] = 0
train_df['months_worked'] = 0

for i in range(train_df.shape[0]):
  if pd.isnull(train_df['LastWorkingDate'][i]):
    train_df['months_worked'][i] = relativedelta.relativedelta(train_df['MMM-YY'].max(), 
                                                             train_df['Dateofjoining'][i]).months
    train_df['days_worked'][i] = ((train_df['MMM-YY'].max()-train_df['Dateofjoining'][i])).days
  else:
    train_df['months_worked'][i] = relativedelta.relativedelta(train_df['LastWorkingDate'][i], 
                                                             train_df['Dateofjoining'][i]).months
    train_df['days_worked'][i] = ((train_df['LastWorkingDate'][i]-train_df['Dateofjoining'][i])).days

In [None]:
train_df[['Target', 'Emp_ID', 'Gender', 'Education_Level', 
          'Joining Designation', 'Designation', 'Quarterly Rating']] = train_df[['Target', 'Emp_ID', 'Gender', 
                                                                                 'Education_Level', 
                                                                                 'Joining Designation', 
                                                                                 'Designation', 
                                                                                 'Quarterly Rating']].astype('category')
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19104 entries, 0 to 19103
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   MMM-YY                19104 non-null  datetime64[ns]
 1   Emp_ID                19104 non-null  category      
 2   Age                   19104 non-null  int64         
 3   Gender                19104 non-null  category      
 4   City                  19104 non-null  object        
 5   Education_Level       19104 non-null  category      
 6   Salary                19104 non-null  int64         
 7   Dateofjoining         19104 non-null  datetime64[ns]
 8   LastWorkingDate       1616 non-null   datetime64[ns]
 9   Joining Designation   19104 non-null  category      
 10  Designation           19104 non-null  category      
 11  Total Business Value  19104 non-null  int64         
 12  Quarterly Rating      19104 non-null  category      
 13  Target          

In [None]:
print(f"The data is in range from {train_df['MMM-YY'].min()} till {train_df['MMM-YY'].max()}")

The data is in range from 2016-01-01 00:00:00 till 2017-12-01 00:00:00


In [None]:
train_df['Gender'] = train_df['Gender'].map({'Male':1, 'Female':0})
train_df['Education_Level'] = train_df['Education_Level'].map({'Master':2, 'Graduate':1, 'College':0})

In [None]:
from sklearn.preprocessing import MinMaxScaler

#normalization using scaler (rescale values between 0 and 1)
scaler = MinMaxScaler(feature_range=(0, 1))

num_feat_list = ['Age', 'Salary', 'Total Business Value', 'months_worked', 'days_worked']
train_df[num_feat_list] = pd.DataFrame(scaler.fit_transform(train_df[num_feat_list]), 
                                       columns=num_feat_list, index=train_df.index)
print(num_feat_list)

['Age', 'Salary', 'Total Business Value', 'months_worked', 'days_worked']


In [None]:
lstm_df = train_df[['Target', 'Age', 'Gender', 'Education_Level', 
                    'Salary', 'Joining Designation', 'Designation',
                    'Total Business Value', 'Quarterly Rating', 'Emp_ID', 'months_worked', 'days_worked']]

In [None]:
# First, we get the data
df_ = {}
emp_seq_list = []
for i in empList:
    df_[i] = lstm_df[lstm_df.Emp_ID.isin([i])]
    emp_seq_list.append(df_[i])

print(len(emp_seq_list))

2381


In [None]:
from keras.preprocessing.sequence import pad_sequences
pad_sequences(emp_seq_list, value=0, padding='post')
print("pad sequence done")

pad sequence done


In [None]:
result = pd.DataFrame({'Emp_ID':[''], 'Target':[0]})
result.columns=['Emp_ID', 'Target']
cc = 0

for emp_ct in range(0, len(test_df['Emp_ID'])):
  print("#################################")
  print(f"round {emp_ct} for employee ")
  cc = cc + 1
  #if cc == 51:
  #  break
  #creating a separate dataset for LSTM model
  lstm_df = df_[test_df['Emp_ID'][emp_ct]][['Target', 'Age', 'Gender', 'Education_Level', 
                      'Salary', 'Joining Designation', 'Designation',
                      'Total Business Value', 'Quarterly Rating', 'Emp_ID', 'months_worked', 'days_worked']]

  samples_list = lstm_df.to_numpy()#lstm_df.values.tolist() #scaler.fit_transform(lstm_df)

  #print("Normalized data :\n", samples_list)

  #splitting the univariate sequence (the number of features is one, for one target variable)
  #into data samples of select sample size 
  #meaning the last 2 values are used to predict the 3 value
  def split_into_samples(samples, no_of_steps, no_of_future_steps, total_features):
    X_samples, y_samples = list(), list()
    for i in range(0, len(samples)-no_of_steps, 1):
      x_sample, y_sample = samples[i:i+no_of_steps,:-1], samples[i+no_of_steps,-1:]
      X_samples.append(x_sample)
      y_samples.append(y_sample)

    X_samples_arr = np.array(X_samples)
    X_samples_arr = X_samples_arr.reshape(X_samples_arr.shape[0], X_samples_arr.shape[1], total_features)
    y_samples_arr = np.array(y_samples)
    y_samples_arr = y_samples_arr.reshape(y_samples_arr.shape[0], 1)#no_of_steps)
    
    # return X_samples as 3-D and y_samples as 2-D this reshapes the list into 3D and 2D (number of samples, Time Steps, Features)
    return X_samples_arr, y_samples_arr

  #create samples for next day Prediction based on last 2 days
  time_steps = 2 # prediction on past how many days 
  future_time_steps = 3 # prediction for how many days
  no_of_features = 11 # total features

  X, y = split_into_samples(samples_list, no_of_steps=time_steps, no_of_future_steps=future_time_steps, total_features=no_of_features)
  print(f"Shape of input dataset is {X.shape}, and output dataset is {y.shape}")

  #choosing the test dataset size
  test_size = int(round(0.2*X.shape[0],0))
  train_size = int(round(0.6*X.shape[0],0))

  print(test_size, train_size)

  #splitting the data into train and test dataset
  X_train = X[:-test_size]
  (X_train, X_val) = X_train[:train_size], X_train[train_size:] 
  X_test = X[-test_size:]
  y_train = y[:-test_size]
  (y_train, y_val) = y_train[:train_size], y_train[train_size:] 
  y_test = y[-test_size:]
  
  # Print the number of training, validation, and test datasets
  print(y_train.shape, 'output train set')
  print(y_val.shape, 'output train set')
  print(y_test.shape, 'output train set')

  # Visualizing the input and output being sent to the LSTM model
  for inp, out in zip(X_train[0], y_train[0]):
      print(inp,'--', out)

  # Defining Input shapes for LSTM
  TimeSteps=X_train.shape[1]
  TotalFeatures=X_train.shape[2]
  print("Number of TimeSteps:", TimeSteps)
  print("Number of Features:", TotalFeatures)

  

  #splitting the univariate sequence (the number of features is one, for one target variable)
  #into data samples of select sample size 
  def split_into_samples_future(X_sam, y_sam, no_of_steps, no_of_future_steps, total_features):
    X_samples, y_samples = X_sam.tolist(), y_sam.tolist()
    
    NumerOfRows = len(X_sam)
    TimeSteps = 2  # next few day's Price Prediction is based on last how many past day's prices
    FutureTimeSteps = 60 # How many days in future you want to predict the prices

    # Iterate thru the values to create combinations
    for i in range(0 , FutureTimeSteps , 1):
      #print(i-TimeSteps, i+FutureTimeSteps, NumerOfRows)
      x_sample = X_sam[-1:][0]#i-TimeSteps:i][0]
      #print(x_sample)
      y_sample = y_sam[-1:][0]#i:i+FutureTimeSteps]
      X_samples.append(x_sample)
      y_samples.append(y_sample)

    X_samples_arr = np.array(X_samples)
    y_samples_arr = np.array(y_samples)
    
    # return X_samples as 3-D and y_samples as 2-D this reshapes the list into 3D and 2D (number of samples, Time Steps, Features)
    return X_samples_arr, y_samples_arr

  #create samples for next day's Price Prediction based on last 10 days prices
  time_steps = 2 # prediction on past how many days 
  future_time_steps = 60 # prediction for how many days
  no_of_features = 11 # total features

  X_fut = X.copy()
  y_fut = y.copy()

  X_pred_fut, y_pred_fut = split_into_samples_future(X_fut, y_fut, no_of_steps=time_steps, no_of_future_steps=future_time_steps, total_features=no_of_features)
  print(f"Shape of input dataset is {X_pred_fut.shape}, and output dataset is {y_pred_fut.shape}")

  # Choosing the number of testing data records
  TestingRecords = 60

  # Splitting the data into train and test
  X_train=X_pred_fut[:-TestingRecords]
  X_test=X_pred_fut[-TestingRecords:]
  y_train=y_pred_fut[:-TestingRecords]
  y_test=y_pred_fut[-TestingRecords:]

  for i in range(len(y_test)):
    y_test[i] = (y_test[i] > 0.5).astype("int32")

  #############################################
  # Printing the shape of training and testing
  print('\n#### Training Data shape ####')
  print(X_train.shape)
  print(y_train.shape)

  print('\n#### Testing Data shape ####')
  print(X_test.shape)
  print(y_test.shape)

  # Visualizing the input and output being sent to the LSTM model
  # Based on last 10 days prices we are learning the next 5 days of prices
  for inp, out in zip(X_train[0], y_train[0]):
      print(inp)
      print('====>')
      print(out)
      print('#'*20)

  # Defining Input shapes for LSTM
  TimeSteps = X_train.shape[1]
  TotalFeatures = X_train.shape[2]
  print("Number of TimeSteps:", TimeSteps)
  print("Number of Features:", TotalFeatures)

  # Importing the Keras libraries and packages
  from keras.models import Sequential
  from keras.layers import Dense
  from keras.layers import LSTM

  # Initialising the RNN
  model = Sequential()

  # Adding the First input hidden layer and the LSTM layer
  # return_sequences = True, means the output of every time step to be shared with hidden next layer
  model.add(LSTM(units = 10, activation = 'relu', input_shape = (TimeSteps, TotalFeatures), return_sequences=True))

  # Adding the Second Second hidden layer and the LSTM layer
  model.add(LSTM(units = 5, activation = 'relu', input_shape = (TimeSteps, TotalFeatures), return_sequences=True))

  # Adding the Second Third hidden layer and the LSTM layer
  model.add(LSTM(units = 5, activation = 'relu', return_sequences=False ))


  # Adding the output layer
  model.add(Dense(units = 1, activation='sigmoid'))

  # Compiling the model
  model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

  # Fitting the model to the Training set
  model.fit(X_train, y_train, batch_size = 5, epochs = 1, verbose=0)

  # predict test set
  y_pred = (model.predict(X_test) > 0.5).astype("int32")
  # evaluate predictions
  print("accuracy_score: ", accuracy_score(y_test, y_pred)*100)
  print("balanced_accuracy", balanced_accuracy_score(y_test, y_pred)*100)
  print(classification_report(y_test, y_pred)) 

  result = result.append({'Emp_ID':test_df['Emp_ID'][emp_ct], 'Target':(1 if y_pred.sum()!=0 else 0)}, ignore_index = True)
  


Shape of input dataset is (22, 2, 11), and output dataset is (22, 1)
4 13
(13, 1) output train set
(5, 1) output train set
(4, 1) output train set
[0.00000000e+00 2.97297297e-01 0.00000000e+00 2.00000000e+00
 4.89528398e-01 2.00000000e+00 4.00000000e+00 1.55716101e-01
 1.00000000e+00 3.94000000e+02 4.54545455e-01] -- 0.8475954738330976
Number of TimeSteps: 2
Number of Features: 11
Shape of input dataset is (82, 2, 11), and output dataset is (82, 1)

#### Training Data shape ####
(22, 2, 11)
(22, 1)

#### Testing Data shape ####
(60, 2, 11)
(60, 1)
[0.00000000e+00 2.97297297e-01 0.00000000e+00 2.00000000e+00
 4.89528398e-01 2.00000000e+00 4.00000000e+00 1.55716101e-01
 1.00000000e+00 3.94000000e+02 4.54545455e-01]
====>
0.8475954738330976
####################
Number of TimeSteps: 2
Number of Features: 11
accuracy_score:  100.0
balanced_accuracy 100.0
              precision    recall  f1-score   support

         1.0       1.00      1.00      1.00        60

    accuracy                

In [None]:
result.drop(index=0, axis=0, inplace=True)
result.to_csv("sample_submission.csv", index=False)
result