## Problem statement
Within the context of human resources (HR), attrition is a reduction in the workforce caused by retirement or resignation. This is a serious problem faced by several organizations around the world as attrition is economically damaging to the organizations as the replacement employees have to be hired at a cost and trained again at a cost. High Rates of Attrition also damages the brand value of the company.
 
Now the Dataset belongs to a very fast-growing company. This company has witnessed several employees leaving the company in the last 3 years. The company’s HR team has always been reactive to attrition but now the team wants to be proactive and wished to predict attrition of employees using the data they have in hand. 
 

 




## Objective
The goal here is to predict whether an employee will leave the company based upon the various variables given in the dataset.

The clip() function trims values at both ends of an interval. Specifically, values smaller than the lower threshold are set to the lower threshold, and values larger than the upper threshold are set to the upper threshold. This method is particularly useful for handling outliers in data preprocessing steps.

In [2]:
### importing Libaries

import os 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.base import BaseEstimator,TransformerMixin

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.metrics import classification_report,accuracy_score

In [8]:
## PIPELINE

# handling missing value, outlier treatement, imbalance treamtent, encoding, feature scaling

class OutlierClipper(BaseEstimator,TransformerMixin):
    def __init__(self,factor = 1.5):
        self.factor = factor
        
    def fit(self, x,y = None):
        x_df  = pd.DataFrame(x)
        
        self.lower_bound = x_df.quantile(0.25) - (self.factor * (x_df.quantile(0.75)-x_df.quantile(0.25)))
        self.upper_bound = x_df.quantile(0.75)+(self.factor*(x_df.quantile(0.75)-x_df.quantile(0.25)))
        return self
                                                
    def transform(self, x,y=None):
        x_df = pd.DataFrame(x)
        x_clipped = x_df.clip(lower = self.lower_bound,upper = self.upper_bound, axis = 1)
        return x_clipped.values
                                                
                                                
## Load the data
                                                
file_path = "C:\\Users\\Hi\\Downloads\\Attrition.csv"                                                
df = pd.read_csv(file_path)                                                
                                                
## split the data into dependent and independent variables
                                                
x = df.drop("Attrition",axis = 1)
y = df['Attrition'] 
                                                
## splitting the data into train and test data
                                                
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state = 42)
                                                
                                                
## preprocessing
                                                
numerical_cols = x.select_dtypes(exclude = ["object"]).columns                                               
categorical_cols = x.select_dtypes(include = ['object']).columns                                                

## create transformers for categorical and numerical columns

numerical_transformer = Pipeline(steps =[
    ("imputer" , SimpleImputer(strategy ="median")),
    ("oulierclipper", OutlierClipper()),
    ("scaler" ,StandardScaler())
]) 
                                                
categorical_transformer = Pipeline(steps=[
    ("imputer",SimpleImputer(strategy = "most_frequent")),
    ("encoder",OneHotEncoder(handle_unknown = 'ignore'))
]) 
                                                
                                                
## create preprocessor using column transformer
                                                
preprocessor= ColumnTransformer(transformers=[
    ("num",numerical_transformer,numerical_cols),
    ("cat",categorical_transformer,categorical_cols)
])                                                

#Define the pipeline with smote tech
                                                
pipeline_1 = ImbPipeline(steps=[
    ("preprocessor",preprocessor),
    ("smote",SMOTE(random_state = 42)),
    ("classifier1",LogisticRegression(random_state=42))
])  
                                                
pipeline_2 = ImbPipeline(steps=[
    ("preprocessor",preprocessor),
    ("smote",SMOTE(random_state = 42)),
    ("classifier2",RandomForestClassifier(random_state=42))
])                                                                                                

pipeline_3 = ImbPipeline(steps=[
    ("preprocessor",preprocessor),
    ("smote",SMOTE(random_state = 42)),
    ("classifier3",SVC(random_state=42))
])                                                  

pipeline_4 = ImbPipeline(steps=[
    ("preprocessor",preprocessor),
    ("smote",SMOTE(random_state = 42)),
    ("classifier4",KNeighborsClassifier())
])                                                  
                                                

In [9]:
## train the model
pipeline_1.fit(X_train,y_train)
pipeline_2.fit(X_train,y_train)
pipeline_3.fit(X_train,y_train)
pipeline_4.fit(X_train,y_train)

In [10]:
## make prediction
y_pred1 =pipeline_1.predict(X_test)
y_pred2 =pipeline_2.predict(X_test)
y_pred3 =pipeline_3.predict(X_test)
y_pred4 =pipeline_4.predict(X_test)

In [11]:
## Evaleate the model

print("*********LOGISTIC REGRESSION**********")

print("*********Classifiation report**********")
print(classification_report(y_test,y_pred1))

print("accuracy_score :",accuracy_score(y_test,y_pred1))

*********LOGISTIC REGRESSION**********
*********Classifiation report**********
              precision    recall  f1-score   support

          No       0.92      0.80      0.86       255
         Yes       0.31      0.56      0.40        39

    accuracy                           0.77       294
   macro avg       0.61      0.68      0.63       294
weighted avg       0.84      0.77      0.80       294

accuracy_score : 0.7721088435374149


In [12]:

print("*********RANDOM FOREST**********")

print("*********Classifiation report**********")
print(classification_report(y_test,y_pred2))

print("accuracy_score :",accuracy_score(y_test,y_pred2))

*********RANDOM FOREST**********
*********Classifiation report**********
              precision    recall  f1-score   support

          No       0.89      0.97      0.93       255
         Yes       0.50      0.21      0.29        39

    accuracy                           0.87       294
   macro avg       0.69      0.59      0.61       294
weighted avg       0.84      0.87      0.84       294

accuracy_score : 0.8673469387755102


In [13]:

print("*********** SVM ************")

print("*********Classifiation report**********")
print(classification_report(y_test,y_pred3))

print("accuracy_score :",accuracy_score(y_test,y_pred3))

*********** SVM ************
*********Classifiation report**********
              precision    recall  f1-score   support

          No       0.92      0.94      0.93       255
         Yes       0.54      0.49      0.51        39

    accuracy                           0.88       294
   macro avg       0.73      0.71      0.72       294
weighted avg       0.87      0.88      0.87       294

accuracy_score : 0.8775510204081632


In [14]:

print("*********  KNN  **********")

print("*********Classifiation report**********")
print(classification_report(y_test,y_pred4))

print("accuracy_score :",accuracy_score(y_test,y_pred4))

*********  KNN  **********
*********Classifiation report**********
              precision    recall  f1-score   support

          No       0.93      0.62      0.74       255
         Yes       0.22      0.69      0.33        39

    accuracy                           0.63       294
   macro avg       0.57      0.65      0.53       294
weighted avg       0.83      0.63      0.69       294

accuracy_score : 0.6258503401360545
