# Predicting Employee Attrition in the Dawn of Recession(Kaggle Competition)

## Summer Analytics 2020 Capstone Project

This was an Inclass Competition held by the Consulting and Analytics group , IIT Guwahati as the final assignment
of their 6 week long Summer Analytics course. I secured a rank 301 in this competition. This was my first Kaggle Competition.

## Overview of Problem

As the COVID-19 keeps unleashing its havoc, the world continues to get pushed into the crisis of the great economic recession,
more and more companies start to cut down their underperforming employees. Companies firing hundreds and thousands of Employees
is a typical headline today. Cutting down employees or reducing an employee salary is a tough decision to take. It needs to be
taken with utmost care as imprecision in the identification of employees whose performance is attriting may lead to sabotaging
of both employees' career and the company's reputation in the market.

In [15]:
#Importing nescessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split 
from sklearn import metrics
import xgboost as xgb
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import KFold

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score


let's import the data to the notebook

In [16]:
df=pd.read_csv(r"C:\Users\jay\Desktop\train.csv",index_col=0)
#now see what data looks like 
df

Unnamed: 0_level_0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,Gender,...,PerformanceRating,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,CommunicationSkill,Behaviour
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,30,0,Non-Travel,Research & Development,2,3,Medical,571,3,Female,...,3,0,12,2,11,7,6,7,4,1
2,36,0,Travel_Rarely,Research & Development,12,4,Life Sciences,1614,3,Female,...,3,2,7,2,3,2,1,1,2,1
3,55,1,Travel_Rarely,Sales,2,1,Medical,842,3,Male,...,3,0,12,3,9,7,7,3,5,1
4,39,0,Travel_Rarely,Research & Development,24,1,Life Sciences,2014,1,Male,...,3,0,18,2,7,7,1,7,4,1
5,37,0,Travel_Rarely,Research & Development,3,3,Other,689,3,Male,...,3,1,10,2,10,7,7,8,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1624,42,1,Travel_Frequently,Research & Development,19,3,Medical,752,3,Male,...,3,0,7,2,2,2,2,2,3,1
1625,55,1,Travel_Rarely,Sales,2,1,Medical,842,3,Male,...,3,0,12,3,9,7,7,3,5,1
1626,25,1,Travel_Rarely,Sales,9,2,Life Sciences,1439,1,Male,...,3,0,6,2,3,2,2,2,5,1
1627,29,1,Travel_Rarely,Human Resources,13,3,Human Resources,1844,1,Male,...,3,3,4,3,2,2,2,0,5,1


lets change all the categorical data into numerical data

In [17]:
df.BusinessTravel[df.BusinessTravel == 'Non-Travel'] = 0
df.BusinessTravel[df.BusinessTravel == 'Travel_Rarely'] = 1
df.BusinessTravel[df.BusinessTravel == 'Travel_Frequently'] = 2
df.Department[df.Department == 'Research & Development'] = 0
df.Department[df.Department == 'Sales'] = 1
df.Department[df.Department == 'Human Resources'] = 2
df.EducationField[df.EducationField == 'Medical'] = 0
df.EducationField[df.EducationField == 'Life Sciences'] = 1
df.EducationField[df.EducationField == 'Other'] = 2
df.EducationField[df.EducationField == 'Marketing'] = 3
df.EducationField[df.EducationField == 'Technical Degree'] = 4
df.EducationField[df.EducationField == 'Human Resources'] = 5
df.MaritalStatus[df.MaritalStatus == 'Single'] = 0
df.MaritalStatus[df.MaritalStatus == 'Married'] = 1
df.MaritalStatus[df.MaritalStatus == 'Divorced'] = 2
df.Gender[df.Gender == 'Male'] = 0
df.Gender[df.Gender == 'Female'] = 1
df.OverTime[df.OverTime == 'No'] = 0
df.OverTime[df.OverTime == 'Yes'] = 1



In [18]:
#lets check the no of rows and column of data and drop JobRole column
df = df.drop(['JobRole'], axis=1)
df.shape

(1628, 27)

# Before starting the project lets look if the data has multiple or duplicate entries it is very crucial as it may affect the accuracy of model , it may show very high accuracy of model in training dataset but may not show in actual dataset because of duplicate or multiple entries.

In [19]:
#these code will check for all duplicate or multiple entries and remove them
df.drop_duplicates(subset='EmployeeNumber', inplace=True)


In [20]:
#new data has following number of rows and column
df.shape

(1000, 27)

In [21]:
df

Unnamed: 0_level_0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,Gender,...,PerformanceRating,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,CommunicationSkill,Behaviour
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,30,0,0,0,2,3,0,571,3,1,...,3,0,12,2,11,7,6,7,4,1
2,36,0,1,0,12,4,1,1614,3,1,...,3,2,7,2,3,2,1,1,2,1
3,55,1,1,1,2,1,0,842,3,0,...,3,0,12,3,9,7,7,3,5,1
4,39,0,1,0,24,1,1,2014,1,0,...,3,0,18,2,7,7,1,7,4,1
5,37,0,1,0,3,3,2,689,3,0,...,3,1,10,2,10,7,7,8,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,36,0,0,1,10,4,0,592,2,0,...,3,0,10,3,10,3,9,7,4,1
997,40,0,1,0,16,3,1,1641,3,1,...,3,0,18,2,4,2,3,3,2,1
998,46,1,1,1,9,2,0,118,3,0,...,3,0,9,3,9,8,4,7,4,1
999,30,0,1,0,2,3,0,833,3,1,...,4,0,12,4,0,0,0,0,5,1


In [22]:
y = df['Attrition']
X = df.drop('Attrition', axis=1)

In [23]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.1,random_state=5000)

# Rough Models
we will try different models and see there accuracy to select which model will give us best result

## RandomForestClassifier

In [24]:
xg = RandomForestClassifier()
xg.fit(X_train, y_train)
y_pred_class = xg.predict(X_test)



print(metrics.accuracy_score(y_test, y_pred_class))

0.87


## DecisionTreeClassifier

In [25]:
xg = DecisionTreeClassifier()
xg.fit(X_train, y_train)
y_pred_class = xg.predict(X_test)



print(metrics.accuracy_score(y_test, y_pred_class))

0.79


## AdaBoostClassifier

In [26]:
xg = AdaBoostClassifier()
xg.fit(X_train, y_train)
y_pred_class = xg.predict(X_test)



print(metrics.accuracy_score(y_test, y_pred_class))

0.88


## LogisticRegression

In [27]:
xg = LogisticRegression()
xg.fit(X_train, y_train)
y_pred_class = xg.predict(X_test)



print(metrics.accuracy_score(y_test, y_pred_class))

0.84


## MLPClassifier

In [28]:
xg = MLPClassifier()
xg.fit(X_train, y_train)
y_pred_class = xg.predict(X_test)



print(metrics.accuracy_score(y_test, y_pred_class))

0.86


## SVC

In [29]:
xg = SVC()
xg.fit(X_train, y_train)
y_pred_class = xg.predict(X_test)



print(metrics.accuracy_score(y_test, y_pred_class))

0.86


Among all the models AdaBoostClassifier gives us the best result , lets modify hyperparameters to get more accurate result

In [30]:

xg = AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=.5, algorithm='SAMME', random_state=10)
xg.fit(X_train, y_train)
y_pred_class = xg.predict(X_test)



print(metrics.accuracy_score(y_test, y_pred_class))

0.9


Lets load the data to test data and predict the probabililty

In [31]:
dft = pd.read_csv(r"C:\Users\jay\Desktop\test.csv",index_col=0)
dft.BusinessTravel[dft.BusinessTravel == 'Non-Travel'] = 0
dft.BusinessTravel[dft.BusinessTravel == 'Travel_Rarely'] = 1
dft.BusinessTravel[dft.BusinessTravel == 'Travel_Frequently'] = 2
dft.Department[dft.Department == 'Research & Development'] = 0
dft.Department[dft.Department == 'Sales'] = 1
dft.Department[dft.Department == 'Human Resources'] = 2
dft.EducationField[dft.EducationField == 'Medical'] = 0
dft.EducationField[dft.EducationField == 'Life Sciences'] = 1
dft.EducationField[dft.EducationField == 'Other'] = 2
dft.EducationField[dft.EducationField == 'Marketing'] = 3
dft.EducationField[dft.EducationField == 'Technical Degree'] = 4
dft.EducationField[dft.EducationField == 'Human Resources'] = 5
dft.MaritalStatus[dft.MaritalStatus == 'Single'] = 0
dft.MaritalStatus[dft.MaritalStatus == 'Married'] = 1
dft.MaritalStatus[dft.MaritalStatus == 'Divorced'] = 2
dft.Gender[dft.Gender == 'Male'] = 0
dft.Gender[dft.Gender == 'Female'] = 1
dft.OverTime[dft.OverTime == 'No'] = 0
dft.OverTime[dft.OverTime == 'Yes'] = 1
dft = dft.drop(['JobRole'], axis=1)

In [32]:
dft['Attrition'] = xg.predict_proba(dft)[::,1]


In [33]:
dft['Attrition']

Id
1      0.395601
2      0.390713
3      0.424201
4      0.416335
5      0.372016
         ...   
466    0.455792
467    0.540774
468    0.434743
469    0.385691
470    0.362513
Name: Attrition, Length: 470, dtype: float64

In [34]:
dft['Attrition'].to_csv (r'C:\Users\jay\Desktop\hackathon.csv', index = True, header=True)