## TITLE :- Employee Attrition Analysis



### Problem Statement :- Based on department data and employee data regarding administrative, work-load and mutual evaluation score predict whether an employee will stay or leave . 

* The target variable in our project is "ATTRITION"

* Attrition refers to the gradual but deliberate reduction in staff that occurs as employees leave a company and aren't replaced. Employees may leave voluntarily or involuntarily.



# Importing necessary libraries:- 

* warnings - Importing the warnings module to handle any warning messages
* warnings.filterwarnings - Ignoring any warning messages that might occur during the execution of the code 
* numpy - To perform statistical functions with the data
* pandas - To perform Exploratory Data Analysis for the dataset
* matplotlib and seaborn - To visualise critical attributes of the dataset and to represent graphical representation of same

In [1]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Description of features
#### 1 Age -Employee's age
#### 2 Gender-Employee's Gender
#### 3 BusinessTravel	-Frequency of employees' business trips
#### 4 DailyRate	Daily -salary rate for employees
#### 5 Department-Office of employees
#### 6 DistanceFromHome-Distance from home in miles to work
#### 7 Education-Level of education achieved by staff
#### 8 EducationField-Employee's field of study
#### 9 EmployeeCount-Total number of employees in the organization
#### 10 EmployeeNumber-A unique identifier for each employee record
#### 11 EnvironmentSatisfaction	-Employee satisfaction with their working environment
#### 12 HourlyRate-Hourly rate for employees
#### 13 JobInvolvement-Level of involvement required for the employee's job
#### 14 JobLevel-Employee's level of work
#### 15 JobRole-The role of employees in the organization
#### 16 JobSatisfaction-Employee satisfaction with their work
#### 17 MaritalStatus-Employee's marital status
#### 18 MonthlyIncome-Employee's monthly income
#### 19 MonthlyRate-Monthly salary rate for employees
#### 20 NumCompaniesWorked-Number of companies the employee worked for
#### 21 Over18-Whether the employee is over 18 years old
#### 22 OverTime-Do employees work overtime
#### 23 PercentSalaryHike-Salary increase rate for employees
#### 24 PerformanceRating-The performance rating of the employee
#### 25 RelationshipSatisfaction-Employee satisfaction with their relationships
#### 26 StandardHours-Standard working hours for employees
#### 27 StockOptionLevel-Employee stock option level
#### 28 TotalWorkingYears-Total number of years the employee has worked
#### 29 TrainingTimesLastYear-Number of times employees were taken to training in the last year
#### 30 WorkLifeBalance-Employees' perception of their work-life balance
#### 31 YearsAtCompany-Number of years employees have been with the company
#### 32 YearsInCurrentRole-Number of years the employee has been in their current role
#### 33 YearsSinceLastPromotion-Number of years since employee's last promotion
#### 34 YearsWithCurrManager-Number of years an employee has been with their current manager
#### 35 Attrition-Does the employee leave the organization

# Generic Process of Exploratory Data Analysis 


1. Import file -- excel file, csv file (data set)


2. To check the dataframe 
    * Number of features - rows and columns
    * To check top 5 rows
    * To check the bottom five rows
    * Check duplicates-- if there are any duplicates drop them
    

3. Check the shape of the dataframe -- Total No of rows and No of columns are there in a dataset


4. To check the info of the dataset --> If the columns of dataset are empty or not along with their data type


5. To check the null values and duplicates within the dataset.


6. If there are null values in the dataset and if present then treat them

     * continuous variable --  mean, median,b-fill,f-fill
     * categorical variable --  mode
    
    
7. The statistical information of the dataset


8. Data Visualization --> finding the insights from the data graphically

      PLOTS
      
    a) one continuous variable -- Box plot, histogram
    
    b) one categorical variable -- value_counts, countplots
    
    c) one continuous variable and one categorical variable -- Box plot, Bar plot
    
    d) Two continuous variable -- scatter plot
    
    e) Two categorical variable and one continuous variable -- Barplot
    
    f) pairplot - We plot both for categorical and  continuous 
    
    g) Heatmaps - To represent the collinearity between all the attributes  


## Importing dataset in variable 'df' and printing it

In [2]:
df=pd.read_csv("C:/Users/shrut/Desktop/Edubridge/IBM_DATASET.csv")
df

FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/shrut/Desktop/Edubridge/IBM_DATASET.csv'

## Checking first five rows of dataset

In [None]:
df.head()

## Checking last five rows of the dataset

In [None]:
df.tail()

## Checking the shape of dataset -- the total number of rows and columns of the dataset

In [None]:
df.shape

## To check the datatype and count of non - null values in the data

 * For Continuous - int64 , float
 * For Categorical - object

In [None]:
df.info()

## Checking if there are duplicates in the dataset

In [None]:
df.duplicated().sum()

## To check the statistical information for continuous variables of the data

In [None]:
df.describe()

## Checking if there are null (missing) values in the data

In [None]:
df.isna().sum()

## Dropping redundant information/Columns which are insignificant to the target variable "Attrition":-

Dropping columns : "EmployeeCount", "EmployeeNumber", "Over18", "StandardHours"

* EmployeeCount - It is '1' for all 
* EmployeeNumber- S no 1 to 1470
* Over18 - Y for all
* StandardHours - 80 for all

In [None]:
#If we have to permanent drop just give inplace =True, axis=1 (for Column)
df.drop(["EmployeeCount","EmployeeNumber","Over18","StandardHours"],inplace =True ,axis=1)
df

## Checking all coulmn names in the data after dropping the redundant coulms

In [None]:
df.columns

## To check the new shape of dataset

In [None]:
df.shape

## To check unique values of in each attributes

In [None]:

for column in df.columns:
    if df[column].dtype==object:
        print(str(column)+':'+str(df[column].unique()))
        print(df[column].value_counts())
        print('_______________________________')

## Performing pandas- profiling (also known as 1 line EDA) :-  It gives a report format of the EDA which we have performed step wise above on its own

In [None]:
!pip install pandas-profiling

In [None]:
pip freeze

In [None]:
from pandas_profiling import ProfileReport

In [None]:
pf = ProfileReport(df)
pf

In [None]:
pf.to_file("output.html")
pf

# Visualisation of Data

## To see the number of continuous variables in data

In [None]:
continuous = df.select_dtypes('int').columns
continuous

##### There are a few columns in this data which are categorical by nature but have already been label-encoded from before so they are converted into countinuous variables and being printed as same

## Plotting boxplot for all continuous variables

In [None]:
for i in continuous:
    sns.boxplot(x  = df[i],data=df,orient="h") 
    plt.show()

#### It can be commented from the above plots, that there are at least 10 variables which seems to have Outliers present in them. But we will not consider them to be outliers as the information of these variables is very sensitive and subjectively valid for particular employee. Also another reason for not considering these values to be Outliers is that there has been no range specified in between which the values need to be cnosidered. 


## To see the number of categorical variables in data

In [None]:
categorical = df.select_dtypes('object').columns
categorical

## Plotting countplot for all categorical variables

In [None]:
for i in categorical:
    sns.countplot(x  = df[i],data=df)
    plt.show()

## DISTRIBUTION OF EMPLOYEE ATTRITION IN THE COMPANY

In [None]:
labels = 'Attrition NO','Attrition YES'
df['Attrition'].astype(str).value_counts().plot(kind='pie',
                            figsize=(15, 6),
                            autopct='%1.1f%%', 
                            startangle=90,    
                            shadow=True,       
                            labels=None,                                
                            )

plt.title('Distribution of Employee Attrition in the Company ', y=1.12) 
plt.axis('equal') 
# add legend
plt.legend(labels=labels, loc='upper left') 
 # show plot
plt.show()

### From the Pie Chart, we can infer that out of 1470 employees, 16.1% of the employees left their job due to some reasons whereas other 83.9% of the employees preferred to continue their job at the company.

# Analysis of the Rating Features

#### JobSatisfaction
#### EnvironmentSatisfaction
#### RelationshipSatisfaction
#### JobInvolvement
#### WorkLifeBalance
#### PerformanceRating

In [None]:
df['JobSatisfaction'].value_counts()

In [None]:
fig = plt.figure() 

ax1 = fig.add_subplot(221) 
ax2 = fig.add_subplot(222)  
ax3 = fig.add_subplot(223) 
ax4 = fig.add_subplot(224)  

labels = 'Low','Medium','High','Very High'

df['JobSatisfaction'].astype(str).value_counts().plot(kind='pie',
                            figsize=(15, 6),
                            autopct='%1.1f%%', 
                            startangle=90,    
                            shadow=True,       
                            labels=None,ax=ax1) # add to subplot 2
ax1.set_title ('Rating of Job Satisfaction by Employees')
fig.legend(labels=labels,loc='center')

df['EnvironmentSatisfaction'].astype(str).value_counts().plot(kind='pie',
                            figsize=(15, 6),
                            autopct='%1.1f%%', 
                            startangle=90,    
                            shadow=True,       
                            labels=None,ax=ax2) 
ax2.set_title('Rating of Environmental Satisfaction by Employees')
df['RelationshipSatisfaction'].astype(str).value_counts().plot(kind='pie',
                            figsize=(15, 6),
                            autopct='%1.1f%%', 
                            startangle=90,    
                            shadow=True,       
                            labels=None,ax=ax3)
ax3.set_title('Rating of Relationship Satisfaction by Employees')

df['JobInvolvement'].astype(str).value_counts().plot(kind='pie',
                            figsize=(15, 6),
                            autopct='%1.1f%%', 
                            startangle=90,    
                            shadow=True,       
                            labels=None,ax=ax4) 
ax4.set_title('Rating of Job Involvement by Employees')

plt.show()

### From the subplot, we can infer that more than 60% of the employees are :

* Not Satisfied in their Job
* Not Satisfied with their Work Environmnet
* Not Satisfied in their Relationship
* Not Getting involved in their job

In [None]:
fig2 = plt.figure() 

ax5 = fig2.add_subplot(121) 
ax6 = fig2.add_subplot(122)  
  
labels_list1 = 'Bad','Good','Better','Best' 
labels_list2 = 'Low','Good','Excellent','Outstanding'

df['WorkLifeBalance'].astype(str).value_counts().plot(kind='pie',
                            figsize=(15, 6),
                            autopct='%1.1f%%', 
                            startangle=90,    
                            shadow=True,       
                            labels=None,ax=ax5) # add to subplot 2
ax5.set_title ('Rating of Work-Life Balance by Employees')
ax5.legend(labels=labels_list1,loc='upper right')

df['PerformanceRating'].astype(str).value_counts().plot(kind='pie',
                            figsize=(15, 6),
                            autopct='%1.1f%%', 
                            startangle=90,    
                            shadow=True,       
                            labels=None,ax=ax6) 
ax6.set_title('Performance Rating of the Employees')
ax6.legend(labels=labels_list2,loc='upper right')

plt.show()

### From the above piecharts, we can see that:

* Almost 60% of the employees have rated their Work-life Balance as Bad
* Almost 85% of the employees have a low performance rating

In [None]:
props = df.groupby("BusinessTravel")['Attrition'].value_counts(normalize=False).unstack()

props.plot(kind='bar', alpha=1, stacked='False')

plt.title('Business Travel VS Attrition')
plt.ylabel('Number of Employee')
plt.show()

#### From the above data it is clear that Employees who travel rarely have more attrition rate followed by Employees who travel frequently

#### Best way to reduce this attrition is to conduct monthly survey and to assign travel according to the Employees' business travel interest

# Analysis of Work Experience
### Years At Company
### Years In CurrentRole
### Years Since LastPromotion
### Years With CurrManager
### Total Working Years

In [None]:
yac = df.groupby("YearsAtCompany")['Attrition'].value_counts(normalize=False).unstack()

yac.plot(kind='bar', stacked='False',figsize=(10,6))

plt.title('Years At Company of Employee')
plt.ylabel('Number of Employees')
plt.show()

#### It is observed that the newly arriving employees quit their jobs most,so more concern should be given to the freshers and their cause of leaving the company should be figured out

In [None]:
ycr = df.groupby("YearsInCurrentRole")['Attrition'].value_counts(normalize=False).unstack()
ysp = df.groupby("YearsSinceLastPromotion")['Attrition'].value_counts(normalize=False).unstack()


fig = plt.figure() # create figure

ax0 = fig.add_subplot(121) # add subplot 1 (1 row, 2 columns, first plot)
ax1 = fig.add_subplot(122) # add subplot 2 (1 row, 2 columns, second plot). See tip below**

# Subplot 1: Box plot
ycr.plot(kind='bar', stacked='False',figsize=(20,6), ax=ax0) # add to subplot 1
ax0.set_title('Same Role')
ax0.set_xlabel('Years In Current Role')
ax0.set_ylabel('Number of Employees')

# Subplot 2: Line plot
ysp.plot(kind='bar', stacked='False',figsize=(20,6), ax=ax1) # add to subplot 2
ax1.set_title ('Last Promotion')
ax1.set_ylabel('Number of Employees')
ax1.set_xlabel('Years Since Last Promotion')

plt.show()

#### From the above two plots, it is very clear that Employees who are in same post or not getting promoted tend to leave the company most. It is a major concern, since experienced Employees quiting their jobs would affect the company most

In [None]:
ycm = df.groupby("YearsWithCurrManager")['Attrition'].value_counts(normalize=False).unstack()

ycm.plot(kind='bar', stacked='False',figsize=(10,6))

plt.title('Years with Current Manager')
plt.ylabel('Number of Employee')
plt.show()

#### It is clear that in the starting of relation of Manager and Employee's are not so happy. It is important that the Manager communication with the employee from the starting itself trying to understand them soon to reduce the increase in Attrition

In [None]:
twy = df.groupby("TotalWorkingYears")['Attrition'].value_counts(normalize=False).unstack()

twy.plot(kind='bar', stacked='False',figsize=(8,5))

plt.title('Total Working Years of Experience')
plt.ylabel('Number of Employee')
plt.show()

#### It is observed that freshers leave the company very likely so it's important that company creates a new policy to handle freshers so they don't leave the company from the start.

## Analysis of Monthly Income

In [None]:
mi = df[df['Attrition']=='Yes']['MonthlyIncome']
mi = mi.reset_index()
mi.drop(['index'], axis=1, inplace=True)


mn = df[df['Attrition']=='No']['MonthlyIncome']
mn = mn.reset_index()
mn.drop(['index'], axis=1, inplace=True)

mi['mn'] = mn
mi.rename(columns={'MonthlyIncome':'Yes', 'mn':'No'}, inplace=True)
mi.head()

In [None]:
mi.plot(kind='box', figsize=(10, 7))

plt.title('Box plot of Monthly Income vs Attrition')
plt.ylabel('Monthly Income')

plt.show()

#### Employees who left their jobs tend to have low average monthly income than those who continued their job in the company.

## Over Time Employee Analysis

In [None]:
dot = df[['OverTime', 'MonthlyIncome', 'Attrition']]
oyay = dot[(df['OverTime']=='Yes') & (df['Attrition']=='Yes')]
oyay = oyay.sort_values(by = 'MonthlyIncome', ascending=False, axis=0) #sorting to get the top values
count, bin_edges = np.histogram(oyay['MonthlyIncome'])

oyay.plot(kind='hist', xticks=bin_edges)

In [None]:
oyan = dot[(df['OverTime']=='Yes') & (df['Attrition']=='No')]
count, bin_edges = np.histogram(oyan['MonthlyIncome'])

oyan.plot(kind='hist', xticks=bin_edges)

In [None]:
onay = dot[(df['OverTime']=='No') & (df['Attrition']=='Yes')]
count, bin_edges = np.histogram(onay['MonthlyIncome'])

onay.plot(kind='hist', xticks=bin_edges)

In [None]:
onan = dot[(df['OverTime']=='No') & (df['Attrition']=='No')]
count, bin_edges = np.histogram(onan['MonthlyIncome'])

onan.plot(kind='hist',alpha =0.4, xticks=bin_edges)

## Analysis on Department

In [None]:
dpt = df[['Department','Attrition']]
dpt.head()

In [None]:
dpt['Department'].value_counts()

In [None]:
dpt['Department'].value_counts().plot(kind='pie',
                            figsize=(15, 6),
                            autopct='%1.1f%%', 
                            startangle=90,    
                            shadow=True,       
                            labels=None)   
plt.axis('equal') 
plt.legend(labels=dpt['Department'].unique(), loc='upper left') 

In [None]:
dpm = df.groupby("Department")['Attrition'].value_counts(normalize=False).unstack()
dpm = dpm.transpose()
dpm

In [None]:
labels = ['Human Resources', 'Research & Development', 'Sales',]
sizes = [63, 961, 446]
labels_attrition = ['Yes','No','Yes','No','Yes','No']
sizes_attrition = [12,51,133,828,92,354]
colors = ['#ff6666', '#ffcc99', '#99ff99']

colors_attrition = ['#0a0e77','#9e0723', '#0a0e77','#9e0723', '#0a0e77','#9e0723', '#0a0e77','#9e0723']
 
# Plot
plt.pie(sizes, autopct='%1.1f%%', pctdistance=.87, labels=labels, colors=colors, startangle=90,frame=True)
plt.pie(sizes_attrition,colors=colors_attrition,radius=0.75,startangle=90)
centre_circle = plt.Circle((0,0),0.5,color='black', fc='white',linewidth=0.5)
fig6 = plt.gcf()
fig6.gca().add_artist(centre_circle)

#legend
import matplotlib.patches as mpatches
pur = mpatches.Patch(color='#0a0e77', label='Yes')
pin = mpatches.Patch(color='#9e0723', label='No')
plt.legend(handles=[pur, pin], loc='upper left')

plt.axis('equal')
plt.tight_layout()
plt.show()


## Gender Analysis

In [None]:
gda = df[['Gender', 'DistanceFromHome', 'Attrition']]
gda.head()

In [None]:
gda['Gender'].value_counts()

In [None]:
gda['Gender'].value_counts().plot(kind='pie',
                            figsize=(15, 6),
                            autopct='%1.1f%%', 
                            startangle=90,    
                            shadow=True,       
                            labels=None)   
plt.axis('equal') 
plt.legend(labels=['Male', 'Female'], loc='upper left')

In [None]:
fma = gda.groupby("Gender")['Attrition'].value_counts(normalize=False).unstack()
fma = fma.transpose()
fma

In [None]:
labels = ['Male', 'Female']
sizes = [882,588]
labels_attrition = ['Yes','No','Yes','No']
sizes_attrition = [150,732,87,501]
colors = ['#ff6666', '#ffcc99']

colors_attrition = ['#c2c2f0','#ffb3e6', '#c2c2f0','#ffb3e6']
 
# Plot
plt.pie(sizes, labels=labels, colors=colors, startangle=90,frame=True)
plt.pie(sizes_attrition,colors=colors_attrition,radius=0.75,startangle=90)
centre_circle = plt.Circle((0,0),0.5,color='black', fc='white',linewidth=0.5)
fig6 = plt.gcf()
fig6.gca().add_artist(centre_circle)

#legend
import matplotlib.patches as mpatches
pur = mpatches.Patch(color='#c2c2f0', label='Yes')
pin = mpatches.Patch(color='#ffb3e6', label='No')
plt.legend(handles=[pur, pin], loc='center')

plt.axis('equal')
plt.tight_layout()
plt.show()

## Analysis of Marital Status

In [None]:
ms = df[['MaritalStatus', 'Attrition']]
ms.head()

In [None]:
ms['MaritalStatus'].value_counts()

In [None]:
ms['MaritalStatus'].value_counts().plot(kind='pie',
                            figsize=(15, 6),
                            autopct='%1.1f%%', 
                            startangle=90,    
                            shadow=True,       
                            labels=None)   
plt.axis('equal') 
plt.legend(labels=['Married', 'Single', 'Divorced'], loc='upper left') 

In [None]:
msa = ms.groupby("MaritalStatus")['Attrition'].value_counts(normalize=False).unstack()
msa = msa.transpose()
msa

In [None]:
labels = ['Married', 'Single', 'Divorced']
sizes = [673, 470, 327]
labels_attrition = ['Yes','No','Yes','No','Yes','No']
sizes_attrition = [84,589,120,350,33,294]
colors = ['#ff6666', '#ffcc99', '#99ff99']

colors_attrition = ['#c2c2f0','#ffb3e6', '#c2c2f0','#ffb3e6', '#c2c2f0','#ffb3e6']
 
# Plot
plt.pie(sizes, labels=labels, colors=colors, startangle=90,frame=True)
plt.pie(sizes_attrition,colors=colors_attrition,radius=0.75,startangle=90)
centre_circle = plt.Circle((0,0),0.5,color='black', fc='white',linewidth=0.5)
fig6 = plt.gcf()
fig6.gca().add_artist(centre_circle)

#legend
import matplotlib.patches as mpatches
pur = mpatches.Patch(color='#c2c2f0', label='Yes')
pin = mpatches.Patch(color='#ffb3e6', label='No')
plt.legend(handles=[pur, pin], loc='center')

plt.axis('equal')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(40, 20))
sns.pairplot(df)

## Co-Relation-To check the collinearity of all independent continuous variables.

In [None]:
df.corr()

## Label Encoding- To convert categorical columns into numerical ones so that they can be fitted by machine learning models which only take numerical data.

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
# Select the categorical columns for label encoding
categorical_cols = ["Attrition","BusinessTravel", "Department", "EducationField", "Gender", "JobRole", "MaritalStatus", "OverTime"]

# Perform label encoding on the categorical columns
label_encoder = LabelEncoder()
for col in categorical_cols:
    df[col] = label_encoder.fit_transform(df[col])

In [None]:
# Calculate the correlation matrix
correlation_matrix = df.corr()

## Heatmap-To visualise how well features correlate with each other.

In [None]:
plt.figure(figsize=(40, 20))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()

# To check the datatype and count of non - null values in the data after Label Encoding
* For Continuous - int64 ,int32, float
* For Categorical - object

In [None]:
df.info()

## Checking top five rows of the dataset

In [None]:
df.head()

# Splitting the data into training set and testing set to build and predict the model using Machine Learning algorithms.

In [None]:
from sklearn.model_selection import train_test_split

### Taking independent variables in X

In [None]:
X = df.drop(['Attrition'], axis=1)

X.head()

### Checking all column names in X

In [None]:
X.columns

### Taking target variable in y

In [None]:
y = df['Attrition']

y.head()

### Splitting the data into train and test

In [None]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

### Checking the shape of training and testing sets.

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

# Scaling-To convert data into common range of values.

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
scaler = MinMaxScaler()

X_train[['Age', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome',
       'Education', 'EducationField', 'EnvironmentSatisfaction',
       'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole',
       'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate',
       'NumCompaniesWorked', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager']] = scaler.fit_transform(X_train[['Age', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome',
       'Education', 'EducationField', 'EnvironmentSatisfaction',
       'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole',
       'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate',
       'NumCompaniesWorked', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager']])

### Checking top row of X-train after scaling

In [None]:
X_train.head(1)

In [None]:
scaler = MinMaxScaler()

X_test[['Age', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome',
       'Education', 'EducationField', 'EnvironmentSatisfaction',
       'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole',
       'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate',
       'NumCompaniesWorked', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager']] = scaler.fit_transform(X_test[['Age', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome',
       'Education', 'EducationField', 'EnvironmentSatisfaction',
       'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole',
       'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate',
       'NumCompaniesWorked', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager']])

### Checking top row of X_test after scaling.

In [None]:
X_test.head(1)

In [None]:
# LOGISTIC REGRESSION

In [None]:
from sklearn.linear_model import LogisticRegression

### Instantiate Logistic Regression model.

In [None]:
lr = LogisticRegression()

### Training data is used for model building.

In [None]:
lr.fit(X_train, y_train)

### Testing data is used for prediction

In [None]:
y_pred_logreg = lr.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score

### Calculate accuracy of the model.

In [None]:
LR = accuracy_score(y_test, y_pred_logreg)
print("Accuracy for logistic regression model is :", LR," and in percentage is :", LR*100,'%')

In [None]:
# Libraries for Validation of models
from sklearn.metrics import confusion_matrix

### Create confusion matrix.

In [None]:
logistic_confusion_matrix = confusion_matrix(y_test, y_pred_logreg)
logistic_confusion_matrix

### Create labels for confusion matrix heatmap.

In [None]:
print(logistic_confusion_matrix)

group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                logistic_confusion_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                     logistic_confusion_matrix.flatten()/np.sum(logistic_confusion_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)

# Create heatmap of confusion matrix
sns.heatmap(logistic_confusion_matrix, annot=labels, fmt='', cmap='Greens')
plt.xlabel('Predicted')
plt.ylabel('Actual')

### Function to plot ROC curve.

In [None]:
# Function For Logistic Regression Create Summary For Logistic Regression

from sklearn.metrics import roc_curve, roc_auc_score

def plot_roc_curve(fpr, tpr):
    plt.plot(fpr, tpr, color='orange', lw=2,linestyle='--')
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle=':')
    plt.xlabel('False Positive Rate(1-specificity)')
    plt.ylabel('True Positive Rate (sensitivity)')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()

# Function to generate summary for Logistic Regression.

def get_summary(y_test, y_pred_logreg):
    # Confusion Matrix
    conf_mat = confusion_matrix(y_test, y_pred_logreg)
    TP = conf_mat[0,0:1]
    FP = conf_mat[0,1:2]
    FN = conf_mat[1,0:1]
    TN = conf_mat[1,1:2]
    
 # Calculate evaluation metrics.

    accuracy = (TP+TN)/((FN+FP)+(TP+TN))
    sensitivity = TP/(TP+FN)
    specificity = TN/(TN+FP)
    precision = TP/(TP+FP)
    recall =  TP / (TP + FN)
    fScore = (2 * recall * precision) / (recall + precision)
    auc = roc_auc_score(y_test, y_pred_logreg)

 # Print summary.

    print("Confusion Matrix:\n",conf_mat)
    print("Accuracy:",accuracy)
    print("Sensitivity :",sensitivity)
    print("Specificity :",specificity)
    print("Precision:",precision)
    print("Recall:",recall)
    print("F-score:",fScore)
    print("AUC:",auc)
    print("ROC curve:")

 # Plot ROC curve
    
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_logreg)
    plot_roc_curve(fpr, tpr)

### Generate summary for logistic regression model.

In [None]:
get_summary(y_test, y_pred_logreg)

# SVM

### Exploratory Data Analysis (EDA) . ###

In [None]:
# Display the first few rows of the training data
X_train.head()

In [None]:
# Display the first few rows of the target variable in the training data
y_train.head()

In [None]:
from sklearn.svm import SVC

### Instantiate SVM model.

In [None]:
svc = SVC()

### Training the SVM model.

In [None]:
svc.fit(X_train, y_train)

### Predicting with the SVM model.

In [None]:
y_pred_svc = svc.predict(X_test)

### Calculate accuracy of the SVM model.

In [None]:
SVM = accuracy_score(y_test, y_pred_svc)

# Print accuracy in percentage
print("Accuracy for Support Vector Machine model is :", SVM," and in percentage is :", SVM*100,'%')

### Create confusion matrix for SVM model.

In [None]:
SVM_confusion_matrix = confusion_matrix(y_test, y_pred_svc)
SVM_confusion_matrix

### Create labels for confusion matrix heatmap.

In [None]:
print(SVM_confusion_matrix)

group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                SVM_confusion_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                     SVM_confusion_matrix.flatten()/np.sum(SVM_confusion_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)

# Create heatmap of confusion matrix for SVM model

sns.heatmap(SVM_confusion_matrix, annot=labels, fmt='', cmap='Reds')
plt.xlabel('Predicted')
plt.ylabel('Actual')

### Generate summary for SVM model.

In [None]:
get_summary(y_test, y_pred_svc)

#### Overall, the code trains an SVM model, predicts with the model, evaluates the model's performance using accuracy and a confusion matrix, generates a summary of evaluation metrics, and displays a heatmap of the confusion matrix.

#  Naive Bayes Model

In [None]:
from sklearn.naive_bayes import GaussianNB

### Instantiate Gaussian Naive Bayes model.

In [None]:
gnb = GaussianNB()

### Training the Naive Bayes model.

In [None]:
gnb.fit(X_train, y_train)

### Predicting with the Naive Bayes model.

In [None]:
y_pred_gnb = gnb.predict(X_test)

### Calculate accuracy of the Naive Bayes model.

In [None]:
NB = accuracy_score(y_test,y_pred_gnb)

# Print accuracy in percentage

print("Accuracy for Naive Bayes model is :", NB," and in percentage is :", NB*100,'%')

### Create confusion matrix for Naive Bayes model.

In [None]:
gnb_confusion_matrix = confusion_matrix(y_test, y_pred_gnb)
gnb_confusion_matrix

### Create labels for confusion matrix heatmap.

In [None]:
print(gnb_confusion_matrix)

group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                gnb_confusion_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                     gnb_confusion_matrix.flatten()/np.sum(gnb_confusion_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)

# Create heatmap of confusion matrix for Naive Bayes model

sns.heatmap(gnb_confusion_matrix, annot=labels, fmt='', cmap='icefire')
plt.xlabel('Predicted')
plt.ylabel('Actual')

### Generate summary for Naive Bayes model.

In [None]:
get_summary(y_test, y_pred_gnb)

#### Overall, the code trains a Naive Bayes model, predicts with the model, evaluates the model's performance using accuracy and a confusion matrix, generates a summary of evaluation metrics, and displays a heatmap.

#  K-Nearest Neighbors (KNN) Model 

In [None]:
from sklearn.neighbors import KNeighborsClassifier

### Instantiate KNN model

In [None]:
knn = KNeighborsClassifier()

### Training the KNN model

In [None]:
knn.fit(X_train, y_train)

### Predicting with the KNN model

In [None]:
y_pred_knn = knn.predict(X_test)

### Calculate accuracy of the KNN model

In [None]:
KNN = accuracy_score(y_test, y_pred_knn)


# Print accuracy in percentage

print("Accuracy for K nerarest neighbour model is :", KNN," and in percentage is :", KNN*100,'%')

### Create confusion matrix for KNN model.

In [None]:
knn_confusion_matrix = confusion_matrix(y_test, y_pred_knn)
knn_confusion_matrix

### Create labels for confusion matrix heatmap

In [None]:
print(knn_confusion_matrix)

group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
               knn_confusion_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                     knn_confusion_matrix.flatten()/np.sum(knn_confusion_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)

# Create heatmap of confusion matrix for KNN model

sns.heatmap(knn_confusion_matrix, annot=labels, fmt='', cmap='coolwarm')
plt.xlabel('Predicted')
plt.ylabel('Actual')

### Generate summary for KNN model.

In [None]:
get_summary(y_test, y_pred_knn)

#### Overall, the code trains a KNN model, predicts with the model, evaluates the model's performance using accuracy and a confusion matrix, generates a summary of evaluation metrics, and displays a heatmap

# Decision Tree Model

In [None]:
from sklearn.tree import DecisionTreeClassifier

### Instantiate Decision Tree Classifier

In [None]:
dtree = DecisionTreeClassifier()

### Training the Decision Tree model

In [None]:
dtree.fit(X_train, y_train)

### Predicting with the Decision Tree model

In [None]:
y_pred_dtree = dtree.predict(X_test)

### Calculate accuracy of the Decision Tree model

In [None]:
DT = accuracy_score(y_test, y_pred_dtree) 

# Print accuracy in percentage

print("Accuracy for decision tree model is :", DT," and in percentage is :", DT*100,'%')

### Create confusion matrix for Decision Tree model

In [None]:
dtree_confusion_matrix = confusion_matrix(y_test, y_pred_dtree)
dtree_confusion_matrix

### Create labels for confusion matrix heatmap

In [None]:
print(dtree_confusion_matrix)

group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
               dtree_confusion_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                     dtree_confusion_matrix.flatten()/np.sum(dtree_confusion_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)


# Create heatmap of confusion matrix for Decision Tree model

sns.heatmap(dtree_confusion_matrix, annot=labels, fmt='', cmap='Spectral')
plt.xlabel('Predicted')
plt.ylabel('Actual')

### Generate summary for Decision Tree model

In [None]:
get_summary(y_test, y_pred_dtree)

#### Overall, the code trains a Decision Tree model, predicts with the model, evaluates its performance using accuracy and a confusion matrix, generates a summary of evaluation metrics, and displays a heatmap.

# Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

### Instantiate Random Forest Classifier.

In [None]:
rfc = RandomForestClassifier()

### Training the Random Forest model.

In [None]:
rfc.fit(X_train, y_train)

### Predicting with the Random Forest model.

In [None]:
y_pred_rfc = rfc.predict(X_test)

### Calculate accuracy of the Random Forest model.

In [None]:
RF = accuracy_score(y_test, y_pred_rfc)

# Print accuracy in percentage
print("Accuracy random forest model is :", RF," and in percentage is :", RF*100,'%')

### Create confusion matrix for Random Forest model.

In [None]:
RandomForest_confusion_matrix = confusion_matrix(y_test, y_pred_rfc)
RandomForest_confusion_matrix

### Create labels for confusion matrix heatmap.

In [None]:
print(RandomForest_confusion_matrix)

group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                RandomForest_confusion_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                     RandomForest_confusion_matrix.flatten()/np.sum(RandomForest_confusion_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)

# Create heatmap of confusion matrix for Random Forest model

sns.heatmap(RandomForest_confusion_matrix, annot=labels, fmt='', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')

### Generate summary for Random Forest model

In [None]:
get_summary(y_test, y_pred_rfc)

#### Overall, the code trains a Random Forest model, predicts with the model, evaluates its performance using accuracy and a confusion matrix, generates a summary of evaluation metrics, and displays a heatmap.

### Accuracy scores summary

In [None]:
LR = accuracy_score(y_test, y_pred_logreg)
SVM = accuracy_score(y_test, y_pred_svc)
NB = accuracy_score(y_test,y_pred_gnb)
KNN = accuracy_score(y_test, y_pred_knn)
DT = accuracy_score(y_test, y_pred_dtree) 
RF = accuracy_score(y_test, y_pred_rfc)

### Create a bar chart to compare the accuracy of all classification models

In [None]:
algorithms = ['LR','SVM','NB','KNN','DT', 'RF']
accuracies = [LR,SVM,NB,KNN,DT,RF]

In [None]:
c = ['red', 'yellow', 'pink', 'blue', 'orange','green']
plt.bar(algorithms, accuracies,color=c)
plt.xlabel('Algorithm')
plt.ylabel('Accuracy (%)')
plt.title('Comparison of Classifier Accuracy')
plt.ylim([0, 1])  # Set the y-axis limits between 0 and 1 or 0 and 100. 
plt.xticks(rotation=45)

for i in range(len(algorithms)):
    plt.text(i, accuracies[i],f"{accuracies[i]*100:.2f}%", ha='center',va= 'bottom')
plt.show()

##### This visualization provides a comparison of the accuracy of different classification models, allowing you to easily identify the model with the highest accuracy.

# Recommendations

## The below recommendations is based on the key findings related to reducing attrition rate.
### 1. Age:
#### - Implement strategies to address the specific needs and career aspirations of employees across different age groups. - This can include offering targeted development opportunities, mentorship programs, and flexible work arrangements to support work-life balance.
### 2. Compensation:
#### - Regularly review and benchmark compensation packages to ensure they are competitive in the market.
#### - Consider incorporating performance-based incentives and rewards to motivate employees and recognize their contributions.
### 3. Job experience:
#### - Provide opportunities for career advancement, skill development, and cross-functional training.
#### - Establish clear career paths and provide regular feedback and performance evaluations to support employee growth and engagement.
### 4. Specific job-related variables: - Tailor retention strategies based on different job roles and responsibilities.
#### - This can include improving job satisfaction, providing challenging assignments, and fostering a positive work environment. 5. Job-related aspects:
#### - Enhance employee engagement and job satisfaction by offering a supportive work environment.
#### - Provide opportunities for professional development, promote a culture of continuous learning, and ensure fair and transparent processes for promotions and career growth.
### 6. Work-related factors:
#### - Focus on improving factors such as environment satisfaction, job involvement, job satisfaction, work-life balance, and managing overtime demands.
#### - Conduct regular employee surveys to understand their concerns and feedback, and take proactive measures to address any identified areas of improvement.
### 7. Overall:
#### - Foster a positive organizational culture that values employee well-being, work-life balance, and growth opportunities.
#### - Encourage open communication, provide avenues for feedback and suggestions, and regularly evaluate and refine retention strategies based on employee feedback and changing needs.

# Conclusion :-

#### LOGISTIC REGRESSION model has the highest accuracy as compared to other algorithms like SUPPORT VECTOR MACHINE, NAIVE BAYES, K NEAREST NEIGHBOUR, DECISION TREE & RANDOM FOREST. This model is best suited to Predict Or Analyse Employee Attrition. 

