## IBM HR Analytics Employee Attrition Modeling 

### Content

<ol>
<li>Description</li>
<li>Data Gathering</li>
<li>Data Assessment and Cleaning</li>
<li>Data Visualization</li>
<li>Features transformation</li>
<li>Data correlation</li>
<li> Machine Learning</li>
<li>Conclusion</li>
</ol>

### 1. Description

IBM is an American MNC operating in around 170 countries with major business vertical as computing, software, and hardware.
Attrition is a major risk to service-providing organizations where trained and experienced people are the assets of the company. The organization would like to identify the factors which influence the attrition of employees.

Data Dictionary

* Age: Age of employee <br>
* Attrition: Employee attrition status<br>
* Department: Department of work<br>
* DistanceFromHome<br>
* Education: 1-Below College; 2- College; 3-Bachelor; 4-Master; 5-Doctor;<br>
* EducationField<br>
* EnvironmentSatisfaction: 1-Low; 2-Medium; 3-High; 4-Very High;<br>
* JobSatisfaction: 1-Low; 2-Medium; 3-High; 4-Very High;<br>
* MaritalStatus<br>
* MonthlyIncome<br>
* NumCompaniesWorked: Number of companies worked prior to IBM<br>
* WorkLifeBalance: 1-Bad; 2-Good; 3-Better; 4-Best;<br>
* YearsAtCompany: Current years of service in IBM<br>

Analysis Task:<br>
<ol>
<li>Import attrition dataset and import libraries such as pandas, matplotlib.pyplot, numpy, and seaborn.</li>
<li>Exploratory data analysis</li>
* Find the age distribution of employees in IBM<br>
* Explore attrition by age<br>
* Explore data for Left employees<br>
* Find out the distribution of employees by the education field<br>
* Give a bar chart for the number of married and unmarried employees<br>
<li>Build up a logistic regression model to predict which employees are likely to attrite.</li>
</ol>

### 2. Data Gathering

In [None]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, classification_report
from sklearn.metrics import plot_precision_recall_curve, plot_roc_curve
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE

import warnings
%matplotlib inline

In [None]:
data=pd.read_csv ("IBM Attrition Data.csv")

### 3. Data Assessment and Cleaning

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.dtypes

In [None]:
data.isna()

In [None]:
data.isnull().sum

In [None]:
data.count()

In [None]:
type(data)

In [None]:
data.describe()

In [1]:
### 4. Data Visualization

In [None]:
# histogram for age
plt.figure(figsize=(4,2), dpi=130)
data['Age'].hist(bins=70, color="green", alpha=0.8)
plt.xlabel('Age')
plt.ylabel('Employees')
plt.title('Age distribution of employees in IBM')
plt.xlim(15, 65)
plt.ylim(0, 80)
plt.grid(False)
plt.show()

In [None]:
# Explore attrition by age
plt.figure(figsize=(4,4))
plt.bar(data.Attrition, data.Age,width=0.4,align= 'center', color="green", alpha=0.8)
plt.xlabel('attrition by age')
plt.ylabel('Age')
plt.title('Attrition by age')
plt.ylim(0, 70)
plt.grid(True, which='major',axis='y')
plt.show()

In [None]:
# data for Left employees breakdown
plt.figure(figsize=(4,2))
data.Attrition.value_counts().plot(kind="barh", color="green", alpha=0.8)
plt.ylabel('Age')
plt.title('Left employees breakdown')
plt.xlim(0, 1400)

In [None]:
#Find out the distribution of employees by the education field
plt.figure(figsize=(4,4))
data.EducationField	.value_counts().plot(kind="barh", color="green", alpha=0.8)
plt.ylabel('Education field')
plt.title('Employees by the education field')
plt.xlim(0, 700)

In [None]:
#Give a bar chart for the number of married and unmarried employees
plt.figure(figsize=(3,3))
data.MaritalStatus.value_counts().plot(kind="bar", color="green", alpha=0.8)
plt.show()

In [None]:
data.hist()
plt.show

## 5. Features transformation

In [None]:
data['Attrition'].replace('Yes',1, inplace=True)
data['Attrition'].replace('No',0, inplace=True)

In [None]:
data['EducationField'].value_counts()
data['EducationField'].replace("Life Sciences", 1, inplace=True)
data['EducationField'].replace("Medical", 2, inplace=True)
data['EducationField'].replace("Marketing", 3, inplace=True)
data['EducationField'].replace("Technical Degree", 4, inplace=True)
data['EducationField'].replace("Other", 5, inplace=True)
data['EducationField'].replace("Human Resources", 6, inplace=True)

In [None]:
data['Department'].value_counts()
data['Department'].replace("Research & Development",1, inplace=True)
data['Department'].replace("Sales",2, inplace=True)
data['Department'].replace("Human Resources", 3, inplace=True)

In [None]:
data['MaritalStatus'].value_counts()
data['MaritalStatus'].replace("Married", 1, inplace=True)
data['MaritalStatus'].replace("Single", 2, inplace=True)
data['MaritalStatus'].replace("Divorced",3,inplace=True)

## 6. Data correlation

In [None]:
data.dtypes

In [None]:
data.corr()

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(data.corr(), vmin=-1, vmax=1, interpolation = 'none')
fig.colorbar(cax)
plt.show()

In [None]:
plt.figure(figsize=(12,8), dpi=200)
sns.heatmap(data.corr(), annot=True, cmap='viridis');

In [None]:
## 7. Machine Learning
### Train | Test Split and Scaling | Logistic regression model

#### MODEL 1 - Considering all features

In [None]:
x1= data.drop('Attrition', axis=1)
y1= data['Attrition']

In [None]:
#Normalization
scaler = MinMaxScaler(feature_range = (0, 1))
rescaledx1 = scaler.fit_transform(x1)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(rescaledx1, y1, test_size=0.3, random_state=0)

In [None]:
len(x_train), len(x_test)

In [None]:
#Create a StandardScaler object and normalize the X train and test set feature data.

scaler = StandardScaler()
scaled_x_train = scaler.fit_transform(x_train)
scaled_x_test = scaler.transform(x_test)

In [None]:
# Build up a logistic regression model to predict which employees are likely to attrite using all feature
model = LogisticRegression(solver='lbfgs', max_iter=400)
model = model.fit(scaled_x_train, y_train)
model.score(scaled_x_train, y_train)

In [None]:
result = model.score(x_test, y_test)
print("Acuracy: %.3f%%" % (result * 100.0))

In [None]:
model.coef_

In [None]:
y_pred = model.predict(scaled_x_test)

In [None]:
probs = model.predict_proba(scaled_x_test)
print(probs)

In [None]:
# Model Performance Evaluation
confusion_matrix(y_test, y_pred)

In [None]:
plot_confusion_matrix(model, scaled_x_test, y_test);

We can see 71 mistakes

In [None]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [None]:
#Coefficients
coef = pd.Series(data=model.coef_[0], index=x1.columns)
coef

In [None]:
coef=coef.sort_values()
coef

In [None]:
plt.figure(figsize=(8,4))
sns.barplot(x=coef.index, y=coef.values)
plt.xticks(rotation=45)

In [None]:
#Performance Curves
#Create both the precision recall curve and the ROC Curve.

In [None]:
plot_precision_recall_curve(model, scaled_x_test, y_test);

In [None]:
plot_roc_curve(model, scaled_x_test, y_test);

#### MODEL 2 - Considering features with positive correlation

In [None]:
x2 = data[['Age', 'Education', 'EnvironmentSatisfaction', 'JobSatisfaction',
         'MonthlyIncome', 'WorkLifeBalance','YearsAtCompany']]
y2 = data['Attrition']

In [None]:
#Normalization
scaler = MinMaxScaler(feature_range = (0, 1))
rescaledx2 = scaler.fit_transform(x2)

In [None]:
x2_train, x2_test, y2_train, y2_test = train_test_split(rescaledx2, y2, test_size=0.3, random_state=0)

scaler = StandardScaler()

scaled_x2_train = scaler.fit_transform(x2_train)
scaled_x2_test = scaler.transform(x2_test)

In [None]:
model2 = LogisticRegression(solver='lbfgs', max_iter=400)
model2 = model2.fit(scaled_x2_train, y2_train)
model2.score(scaled_x2_train, y2_train)

In [None]:
result = model2.score(x2_test, y2_test)
print("Acuracy: %.3f%%" % (result * 100.0))

In [None]:
model2.coef_

In [None]:
# Model Performance Evaluation
y2_pred = model2.predict(scaled_x2_test)
confusion_matrix(y2_test, y2_pred)
plot_confusion_matrix(model2, scaled_x2_test, y2_test)
print(classification_report(y2_test, y2_pred,zero_division='warn'))

We can see 71 mistakes

In [None]:
#Coefficients
coef = pd.Series(data=model2.coef_[0], index=x2.columns)
coef

In [None]:
coef=coef.sort_values()
coef

In [None]:
plt.figure(figsize=(8,4))
sns.barplot(x=coef.index, y=coef.values)
plt.xticks(rotation=45)

In [None]:
#Performance Curves
#Create both the precision recall curve and the ROC Curve.

In [None]:
plot_precision_recall_curve(model2, scaled_x2_test, y2_test);

#### MODEL 3 - Considering RFE for feature selection

In [None]:
data2 = data [['Attrition', 'Age','Department', 'DistanceFromHome', 'Education',
       'EducationField', 'EnvironmentSatisfaction', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'NumCompaniesWorked',
       'WorkLifeBalance', 'YearsAtCompany']]

In [None]:
data2.head()

In [None]:
x3 = data2.iloc [:,1:12]
y3 = data2.iloc [:,0]

In [None]:
#Normalization
scaler = MinMaxScaler(feature_range = (0, 1))
rescaledx3 = scaler.fit_transform(x3)
print(rescaledx3[0:5,:])

In [None]:
modelo3 = LogisticRegression(solver='lbfgs', max_iter=400)

In [None]:
# RFE
rfe = RFE(estimator = modelo3, n_features_to_select=7)
fit = rfe.fit(rescaledx3, y3)

# Print dos resultados
print("Number of features: %d" % fit.n_features_)
print(data2.columns[1:12])
print("Features selected: %s" % fit.support_)
print("Features Ranking: %s" % fit.ranking_)

In [None]:
x4= data2 [['Age', 'Department','EnvironmentSatisfaction', 'JobSatisfaction',
       'MonthlyIncome', 'NumCompaniesWorked', 'WorkLifeBalance']]
y4 = data2 ['Attrition']

In [None]:
#Normalization
scaler = MinMaxScaler(feature_range = (0, 1))
rescaledx4 = scaler.fit_transform(x4)

In [None]:
x4_train, x4_test, y4_train, y4_test = train_test_split(rescaledx4, y4, test_size=0.3, random_state=763)

scaler = StandardScaler()

scaled_x4_train = scaler.fit_transform(x4_train)
scaled_x4_test = scaler.transform(x4_test)

In [None]:
model4 = LogisticRegression(solver='lbfgs', max_iter=400)
model4 = model4.fit(scaled_x4_train, y4_train)
model4.score(scaled_x4_train, y4_train)

In [None]:
result = model4.score(x4_test, y4_test)
print("Acuracy: %.3f%%" % (result * 100.0))

In [None]:
# Model Performance Evaluation
y4_pred = model4.predict(scaled_x4_test)
confusion_matrix(y4_test, y4_pred)
plot_confusion_matrix(model4, scaled_x4_test, y4_test)
print(classification_report(y4_test, y4_pred))

We can see 80 mistakes

In [None]:
#Performance Curves
#Create both the precision recall curve and the ROC Curve.

In [None]:
plot_precision_recall_curve(model4, scaled_x4_test, y4_test);

## 8. Conclusion

Comparing the three models tested, Model 1 performed high results values of precision, recall and f1-score.

Final Task: An Employee with the following features<br>
Age: 28<br>
Department: 4<br>
DistanceFromHome: 5<br>
Education: 3 <br>
EducationField:4<br>
EnvironmentSatisfaction: 1<br>
JobSatisfaction: 2<br>
MaritalStatus:2<br>
MonthlyIncome: 3 <br>
NumCompaniesWorked: 6<br>
WorkLifeBalance:3<br>
YearsAtCompany:5<br>

In [None]:
#add random values to KK according to the parameters mentioned above to check the proabily of attrition of the employee
kk=[[28,4, 5, 3, 4, 1, 2, 2, 3, 6, 3, 5]]
print(model.predict_proba(kk))

The model predict that the patient belong to target class 0 way more than class 1.