# Employee Retention

There are always people starting new jobs and retiring, and some move between jobs. Talent retention is imperative for organizational success, therefore understanding employee behavior is key to sustaining a good economy.
Kaggle Employee dataset contains information about employees in a company, including their educational backgrounds, work history, demographics, and employment-related factors, which has been anonymized to protect privacy while still providing valuable insights into the workforce.
https://www.kaggle.com/datasets/tawfikelmetwally/employee-dataset/data

This dataset contains 8 features:
1. **Education**: Level of Education(Bachelors, Masters and PHD)
2. **Joining year**: The year of joining
3. **City**: To which city the employee belongs to
4. **PaymentTier**: Salary Tiers
5. **Age**: The age of the Employee
6. **Gender**: Male and Female
7. **EverBenched**: Everbenched (yes or no)
8. **experience in current domain**: Employee experience in terms of years

The target column is **Leave or Not** which is a binary column (0 or 1)

The following code is used to import the required libraries and load the dataset.

In [2]:
# Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.svm import SVC
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier,VotingClassifier
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,ConfusionMatrixDisplay,classification_report


In [4]:
# Load the dataset
df = pd.read_csv('.\data\Employee.csv')
df.head(10)

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
0,Bachelors,2017,Bangalore,3,34,Male,No,0,0
1,Bachelors,2013,Pune,1,28,Female,No,3,1
2,Bachelors,2014,New Delhi,3,38,Female,No,2,0
3,Masters,2016,Bangalore,3,27,Male,No,5,1
4,Masters,2017,Pune,3,24,Male,Yes,2,1
5,Bachelors,2016,Bangalore,3,22,Male,No,0,0
6,Bachelors,2015,New Delhi,3,38,Male,No,0,0
7,Bachelors,2016,Bangalore,3,34,Female,No,2,1
8,Bachelors,2016,Pune,3,23,Male,No,1,0
9,Masters,2017,New Delhi,2,37,Male,No,2,0


## Phase 1: Problem understanding

In the initial phase of our proposed data science project, our aim is to gain a deep understanding of the problem and define clear and structured objectives. This step is important to avoid solving the wrong problems. The following are the key steps and considerations.
1.	**Clearly Enunciate the Project Objectives**:
a.	Analyze the distribution of educational qualifications among employees.
b.	Investigate the length of service vary across different cities.
c.	Explore the correlation between Payment Tier and Experience in Current Domain.
d.	Determine gender distribution within the employees.
e.	Identify patterns in leave-taking behavior among employees.
2.	**Formulate the problem**: Convert the project objectives into a problem that can be solved using data science.


## Phase 2: Data preparation

In this stage, we will prepare the data to ensure it is clean, consistent, and free from errors for analysis. The following are some potential problems that we will investigate.
1. **Identify Outliers**: Outliers are data points that significantly deviate from the rest of the data. Depending on the nature of the data and the problem we are solving, we will decide on a technique for handling those outliers. While removing them might be the simple and easy way, transforming them, or treating them as special cases might derive important insights.
2. **Data Transformation and Standardization**: Some attributes might be of different scales. Thus, standardization is useful in this case.
3. **Reclassifying Categorical Variables**: Identify any categorical variables that need to be transformed or processed to make them suitable for analysis or modeling.
4. **Binning Numerical Variables**: Determine whether it is valuable to bin or categorize numerical variables, which uncover patterns that might not be visible with continuous data.
5. **Adding an Index Field**: Based on the dataset, it might be useful to add an index field for each employee, which makes it easier to track records.
6. **Handling Missing Data**: Determine a strategy to handle missing values, which may involve replacing them with estimated values or removing records with missing values.


## Phase 3: Exploratory data analysis

The goal in this stage is to gain preliminary insight into the data by utilizing graphical exploration. This stage is useful to uncover patterns and visualize relationships among variables. The following are some common steps to do EDA.
1. **Explore Univariate Relationships**: Identify relationships between predictors and the target variable “Leave or Not”
2. **Explore Multivariate Relationships**: Identify correlations between multiple attributes.
3. **Binning for Predictive Value Improvement**: Determine whether binning based on predictive value could enhance the model performance. For example, we can group employees by years of experience to analyze how it’s related to taking leave.
4. **Derive New Variables**: Combine existing attributes to create new ones, which can provide additional information.


## Phase 4: Setup

At this stage, we will have explored the data enough to understand the problem and the dataset, and we need to prepare it to be fed into the potential models by:
1. **Separating the data into train and test sets to be used for validation**: depending on how many records the dataset has.
2. **Balancing the dataset**: If any of the target column classes are not represented equally, we will apply any of the up-sampling or weighting techniques
3. **Establishing baseline model performance**: defining the minimum performance accepted for any of the used models.


## Phase 5: Modeling

Now the dataset is set and ready, we will feed it to different algorithms/models to uncover the relationships between the columns and target. In this phase, we need to:
1. **Selecting and implementing algorithms/models**: At least three classification models
2. **Making sure we meet the baseline performance**: each model must surpass the baseline otherwise will be dropped
3. **Fine-tuning model algorithms**: to achieve the best possible scores for the classification


## Phase 6: Evaluation

In this phase, the testing portion will be used to determine if any of the models are good and select the best performing one based on the following:
1. Making sure it achieves the objectives from stage 1
2. Applying error analysis and determining best cost
3. Finalizing the selection of models that will be deployed


## Phase 7: Deployment

In this stage, we will deploy the application to production and make it available for use by others. Also, we will finalize our findings in the format of a publication paper, documenting all the seven stages and the exact numbers for all models’ performance.