# Exploratory Analysis of Employee Performance and Attrition Trends

## Project Overview
This project explores the key factors that influence **employee performance**, **job satisfaction**, and **resignation behavior**. By performing **Exploratory data analysis (EDA)** on an extended employee dataset, the project aims to identify:

- Patterns in performance across departments and job roles
- Impacts of factors such as **overtime**, **sick days**, **promotions**, and **training**
- Trends and signals that predict **attrition** or resignation
- Insights that could inform **employee retention** strategies

## 🎯 Problem Definition


To improve employee management and reduce attrition, organizations must understand what drives performance and turnover. This project attempts to answer:

- What are the key variables affecting employee performance?
- Which factors are linked to high or low satisfaction?
- What patterns precede resignations across teams and job roles?

This project aims to uncover the key factors influencing employee performance, satisfaction, and resignation through exploratory data analysis (EDA). By analyzing structured employee data—including salary, workload, team size, training, overtime, promotions, and resignation status - we seek to identify patterns that drive high or low performance and job satisfaction across roles and departments. The goal is to reveal insights into resignation behavior and provide actionable recommendations to improve employee retention and performance management.


## 📁 Dataset Selection

- **File Name**: `Extended_Employee_Performance_and_Productivity_Data.csv`
- **Data Type**: Structured tabular data
- **Data Size**: 100,000 rows x 20 columns
- **Source**: https://www.kaggle.com/datasets/mexwell/employee-performance-and-productivity-data
- **Content**: Contains information such as salary, working hours, promotions, team structure, performance scores, resignation status, etc.


## Data Summary

This dataset contains 100,000 rows of data capturing key aspects of employee performance, productivity, and demographics in a corporate environment. It includes details related to the employee's job, work habits, education, performance, and satisfaction. The dataset is designed for various purposes such as HR analytics, employee churn prediction, productivity analysis, and performance evaluation.

| Column Name                | Description                                                                 |
|---------------------------|-----------------------------------------------------------------------------|
| **Employee_ID**           | Unique identifier for each employee.                                        |
| **Department**            | The department in which the employee works (e.g., Sales, HR, IT).          |
| **Gender**                | Gender of the employee (Male, Female, Other).                              |
| **Age**                   | Employee's age (between 22 and 60).                                        |
| **Job_Title**             | The role held by the employee (e.g., Manager, Analyst, Developer).         |
| **Hire_Date**             | The date the employee was hired.                                           |
| **Years_At_Company**      | Number of years the employee has been with the company.                    |
| **Education_Level**       | Highest educational qualification (High School, Bachelor, Master, PhD).    |
| **Performance_Score**     | Employee's performance rating (1 to 5 scale).                              |
| **Monthly_Salary**        | Monthly salary in USD, linked to role and performance.                     |
| **Work_Hours_Per_Week**   | Hours worked per week.                                                     |
| **Projects_Handled**      | Total number of projects handled.                                          |
| **Overtime_Hours**        | Total overtime hours in the past year.                                     |
| **Sick_Days**             | Number of sick days taken.                                                 |
| **Remote_Work_Frequency** | Percentage of time worked remotely (0%–100%).                              |
| **Team_Size**             | Number of people in the employee's team.                                   |
| **Training_Hours**        | Number of hours spent in training.                                         |
| **Promotions**            | Number of promotions received during tenure.                               |
| **Employee_Satisfaction_Score** | Satisfaction rating (1.0 to 5.0 scale).                              |
| **Resigned**              | Whether the employee has resigned (True/False).                            |


# 🧪 Step 1: Data Loading and Initial Overview

In this step, we will:
- Import the dataset using **Pandas**
- Examine its **structure**, **column types**, and **missing values**
- Generate **summary statistics** and preview the data


## 1.1 📥 Importing Required Libraries

In [2]:
import pandas as pd
import numpy as np

## 1.2 📂 Loading the Dataset

In [3]:
df = pd.read_csv(r"C:\Users\Muhsin\Downloads\archive\Extended_Employee_Performance_and_Productivity_Data.csv")
df

Unnamed: 0,Employee_ID,Department,Gender,Age,Job_Title,Hire_Date,Years_At_Company,Education_Level,Performance_Score,Monthly_Salary,Work_Hours_Per_Week,Projects_Handled,Overtime_Hours,Sick_Days,Remote_Work_Frequency,Team_Size,Training_Hours,Promotions,Employee_Satisfaction_Score,Resigned
0,1,IT,Male,55,Specialist,2022-01-19 08:03:05.556036,2,High School,5,6750.0,33,32,22,2,0,14,66,0,2.63,False
1,2,Finance,Male,29,Developer,2024-04-18 08:03:05.556036,0,High School,5,7500.0,34,34,13,14,100,12,61,2,1.72,False
2,3,Finance,Male,55,Specialist,2015-10-26 08:03:05.556036,8,High School,3,5850.0,37,27,6,3,50,10,1,0,3.17,False
3,4,Customer Support,Female,48,Analyst,2016-10-22 08:03:05.556036,7,Bachelor,2,4800.0,52,10,28,12,100,10,0,1,1.86,False
4,5,Engineering,Female,36,Analyst,2021-07-23 08:03:05.556036,3,Bachelor,2,4800.0,38,11,29,13,100,15,9,1,1.25,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,99996,Finance,Male,27,Technician,2022-12-07 08:03:05.556036,1,Bachelor,4,4900.0,55,46,5,3,75,16,48,2,1.28,False
99996,99997,IT,Female,36,Consultant,2018-07-24 08:03:05.556036,6,Master,5,8250.0,39,35,7,0,0,10,77,1,3.48,True
99997,99998,Operations,Male,53,Analyst,2015-11-24 08:03:05.556036,8,High School,2,4800.0,31,13,6,5,0,5,87,1,2.60,False
99998,99999,HR,Female,22,Consultant,2015-08-03 08:03:05.556036,9,High School,5,8250.0,35,43,10,1,75,2,31,1,3.10,False


## 1.3 🔎 Dataset Overview


### 1.3.1 Dataset Shape
 
 Let's check how many rows and columns the dataset contains.

In [8]:
print ("Dataset contains:", df.shape[0], "rows and", df.shape[1], "columns")

Dataset contains: 100000 rows and 20 columns


**✅ Dataset contains: 100000 rows and 20 columns**

### 1.3.2 Column Data Types

We’ll now review the data types of each column to identify:
- Categorical vs. numerical vs. boolean
- Identify `object` columns that could be optimized (as `category` or `datetime`)
- Any needed conversions will be done in the Task 2 file (e.g., object to category or datetime)


In [10]:
df.dtypes


Employee_ID                      int64
Department                      object
Gender                          object
Age                              int64
Job_Title                       object
Hire_Date                       object
Years_At_Company                 int64
Education_Level                 object
Performance_Score                int64
Monthly_Salary                 float64
Work_Hours_Per_Week              int64
Projects_Handled                 int64
Overtime_Hours                   int64
Sick_Days                        int64
Remote_Work_Frequency            int64
Team_Size                        int64
Training_Hours                   int64
Promotions                       int64
Employee_Satisfaction_Score    float64
Resigned                          bool
dtype: object

**Department, Gender, Job_Title, Education_Level, and Hire_Date will be optimized as category, category, category, category, and datetime respectively**

### 1.3.3 Previewing the First 5 Rows

A quick look at the first few rows gives us an idea of:
- Column naming consistency
- Types of values in each column
- Any obviously incorrect data


In [11]:
df.head()

Unnamed: 0,Employee_ID,Department,Gender,Age,Job_Title,Hire_Date,Years_At_Company,Education_Level,Performance_Score,Monthly_Salary,Work_Hours_Per_Week,Projects_Handled,Overtime_Hours,Sick_Days,Remote_Work_Frequency,Team_Size,Training_Hours,Promotions,Employee_Satisfaction_Score,Resigned
0,1,IT,Male,55,Specialist,2022-01-19 08:03:05.556036,2,High School,5,6750.0,33,32,22,2,0,14,66,0,2.63,False
1,2,Finance,Male,29,Developer,2024-04-18 08:03:05.556036,0,High School,5,7500.0,34,34,13,14,100,12,61,2,1.72,False
2,3,Finance,Male,55,Specialist,2015-10-26 08:03:05.556036,8,High School,3,5850.0,37,27,6,3,50,10,1,0,3.17,False
3,4,Customer Support,Female,48,Analyst,2016-10-22 08:03:05.556036,7,Bachelor,2,4800.0,52,10,28,12,100,10,0,1,1.86,False
4,5,Engineering,Female,36,Analyst,2021-07-23 08:03:05.556036,3,Bachelor,2,4800.0,38,11,29,13,100,15,9,1,1.25,False


### 1.3.4 Previewing the Last 5 Rows

A quick look at the Last few rows

In [12]:
df.tail()

Unnamed: 0,Employee_ID,Department,Gender,Age,Job_Title,Hire_Date,Years_At_Company,Education_Level,Performance_Score,Monthly_Salary,Work_Hours_Per_Week,Projects_Handled,Overtime_Hours,Sick_Days,Remote_Work_Frequency,Team_Size,Training_Hours,Promotions,Employee_Satisfaction_Score,Resigned
99995,99996,Finance,Male,27,Technician,2022-12-07 08:03:05.556036,1,Bachelor,4,4900.0,55,46,5,3,75,16,48,2,1.28,False
99996,99997,IT,Female,36,Consultant,2018-07-24 08:03:05.556036,6,Master,5,8250.0,39,35,7,0,0,10,77,1,3.48,True
99997,99998,Operations,Male,53,Analyst,2015-11-24 08:03:05.556036,8,High School,2,4800.0,31,13,6,5,0,5,87,1,2.6,False
99998,99999,HR,Female,22,Consultant,2015-08-03 08:03:05.556036,9,High School,5,8250.0,35,43,10,1,75,2,31,1,3.1,False
99999,100000,Finance,Female,43,Analyst,2024-03-04 08:03:05.556036,0,PhD,1,4400.0,51,43,27,11,75,13,45,1,2.64,False


### 1.3.5 DataFrame Structure Overview

Using `df.info()`, we can review:
- Column names and their data types
- Non-null (non-missing) counts per column
- Overall memory usage

This helps identify:
- Columns with missing data
- Object columns that may be converted to more efficient types (e.g., category or datetime)


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 20 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Employee_ID                  100000 non-null  int64  
 1   Department                   100000 non-null  object 
 2   Gender                       100000 non-null  object 
 3   Age                          100000 non-null  int64  
 4   Job_Title                    100000 non-null  object 
 5   Hire_Date                    100000 non-null  object 
 6   Years_At_Company             100000 non-null  int64  
 7   Education_Level              100000 non-null  object 
 8   Performance_Score            100000 non-null  int64  
 9   Monthly_Salary               100000 non-null  float64
 10  Work_Hours_Per_Week          100000 non-null  int64  
 11  Projects_Handled             100000 non-null  int64  
 12  Overtime_Hours               100000 non-null  int64  
 13  

### 1.3.6 📊 Summary Statistics

This step helps summarize:
- Distributions (mean, median, quartiles)
- Range and spread (min, max, std dev)
- Potential outliers or skew


In [14]:
df.describe()

Unnamed: 0,Employee_ID,Age,Years_At_Company,Performance_Score,Monthly_Salary,Work_Hours_Per_Week,Projects_Handled,Overtime_Hours,Sick_Days,Remote_Work_Frequency,Team_Size,Training_Hours,Promotions,Employee_Satisfaction_Score
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,50000.5,41.02941,4.47607,2.99543,6403.211,44.95695,24.43117,14.51493,7.00855,50.0905,10.01356,49.50606,0.99972,2.999088
std,28867.657797,11.244121,2.869336,1.414726,1372.508717,8.942003,14.469584,8.664026,4.331591,35.351157,5.495405,28.890383,0.815872,1.150719
min,1.0,22.0,0.0,1.0,3850.0,30.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
25%,25000.75,31.0,2.0,2.0,5250.0,37.0,12.0,7.0,3.0,25.0,5.0,25.0,0.0,2.01
50%,50000.5,41.0,4.0,3.0,6500.0,45.0,24.0,15.0,7.0,50.0,10.0,49.0,1.0,3.0
75%,75000.25,51.0,7.0,4.0,7500.0,53.0,37.0,22.0,11.0,75.0,15.0,75.0,2.0,3.99
max,100000.0,60.0,10.0,5.0,9000.0,60.0,49.0,29.0,14.0,100.0,19.0,99.0,2.0,5.0


### 1.3.7 Checking for Missing Values

Missing or null values need to be handled before further analysis.
This step shows:
- Which columns have missing values
- The count of missing entries per column


In [15]:
df.isnull().sum()

Employee_ID                    0
Department                     0
Gender                         0
Age                            0
Job_Title                      0
Hire_Date                      0
Years_At_Company               0
Education_Level                0
Performance_Score              0
Monthly_Salary                 0
Work_Hours_Per_Week            0
Projects_Handled               0
Overtime_Hours                 0
Sick_Days                      0
Remote_Work_Frequency          0
Team_Size                      0
Training_Hours                 0
Promotions                     0
Employee_Satisfaction_Score    0
Resigned                       0
dtype: int64

**✅ No missing values found**

### 1.3.8 Unique Value Counts per Column

To better understand which columns are categorical or continuous, we check the number of unique values in each column.
- Columns with low unique values are good candidates for `category`
- Date strings should be parsed as `datetime`


In [17]:
df.nunique()

Employee_ID                    100000
Department                          9
Gender                              3
Age                                39
Job_Title                           7
Hire_Date                        3650
Years_At_Company                   11
Education_Level                     4
Performance_Score                   5
Monthly_Salary                     28
Work_Hours_Per_Week                31
Projects_Handled                   50
Overtime_Hours                     30
Sick_Days                          15
Remote_Work_Frequency               5
Team_Size                          19
Training_Hours                    100
Promotions                          3
Employee_Satisfaction_Score       401
Resigned                            2
dtype: int64

### 1.3.9 Checking for Duplicate Values

This step shows there are duplicate values in the dataset

In [21]:
df.duplicated().sum()

0

**✅ No duplicate values found**

### 1.3.10 Preview Unique Values in Object Columns

This step helps us decide:
- Which `object` columns can be converted to `category`
- Which ones are actually dates


In [19]:

object_cols = df.select_dtypes(include='object').columns    # Identify object columns

for col in object_cols:
    print(f"\n🟨 Unique values in '{col}':")
    print(df[col].unique())                                 # Print unique values for each object column



🟨 Unique values in 'Department':
['IT' 'Finance' 'Customer Support' 'Engineering' 'Marketing' 'HR'
 'Operations' 'Sales' 'Legal']

🟨 Unique values in 'Gender':
['Male' 'Female' 'Other']

🟨 Unique values in 'Job_Title':
['Specialist' 'Developer' 'Analyst' 'Manager' 'Technician' 'Engineer'
 'Consultant']

🟨 Unique values in 'Hire_Date':
['2022-01-19 08:03:05.556036' '2024-04-18 08:03:05.556036'
 '2015-10-26 08:03:05.556036' ... '2015-12-25 08:03:05.556036'
 '2016-02-03 08:03:05.556036' '2017-07-15 08:03:05.556036']

🟨 Unique values in 'Education_Level':
['High School' 'Bachelor' 'Master' 'PhD']
