# 📊 Employee Attrition Analysis and Prediction – Part 1: Data Preprocessing

This notebook presents the **data preprocessing** step in a full end-to-end pipeline to analyze and predict employee attrition using the IBM HR Analytics dataset. Preparing the data is a crucial first step before conducting exploratory analysis or building predictive models.

This notebook is **Part 1 of 3**, focusing on:
- Cleaning and transforming the data
- Handling missing values
- Encoding categorical variables
- Exporting the cleaned dataset for later analysis

---

## 🔄 Notebook Series Overview:

🧹 **Part 1 – Data Preprocessing** _(this notebook)_
- Handle missing values
- Encode categorical features
- Export clean dataset

📊 **Part 2 – Exploratory Data Analysis (EDA)** _(coming soon)_
- Visualize distributions and relationships
- Explore attrition patterns by feature

🤖 **Part 3 – Predictive Modeling & Insights** _(coming soon)_
- Build and evaluate ML models (Logistic Regression, Random Forest, XGBoost)
- Optimize performance and extract actionable HR insights

---

## 📁 Dataset:
- Source: [IBM HR Analytics Employee Attrition – Kaggle](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset)
- Fictional dataset created by IBM data scientists

---

## 🛠️ Tools Used:
- **Python**: Pandas, NumPy, Scikit-learn
- **Environment**: Jupyter Notebook

---

🔁 After running this notebook, you’ll get a cleaned version of the dataset ready for analysis and modeling in the next parts of the project.

---

### Import libraries

In [1]:
import numpy as np 
import pandas as pd 
from sklearn.preprocessing import StandardScaler

ـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــ

## 🧼 Data Preprocessing

**Objective**: Prepare and clean the IBM Employee Attrition dataset for exploratory analysis and predictive modeling.

---
### 1. Load and Inspect Dataset

We'll load the dataset and inspect its structure, shape, types, and missing values.

In [2]:
# Load Data
attrition_data = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")

In [3]:
# Preview
attrition_data.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [4]:
# Rows and Columns
attrition_data.shape

(1470, 35)

In [5]:
# Column names
attrition_data.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

In [6]:
# Data types
attrition_data.dtypes

Age                          int64
Attrition                   object
BusinessTravel              object
DailyRate                    int64
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField              object
EmployeeCount                int64
EmployeeNumber               int64
EnvironmentSatisfaction      int64
Gender                      object
HourlyRate                   int64
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction              int64
MaritalStatus               object
MonthlyIncome                int64
MonthlyRate                  int64
NumCompaniesWorked           int64
Over18                      object
OverTime                    object
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StandardHours                int64
StockOptionLevel             int64
TotalWorkingYears   

In [7]:
# Summary info
attrition_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

In [8]:
# Numeric summary
attrition_data.describe()

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,802.485714,9.192517,2.912925,1.0,1024.865306,2.721769,65.891156,2.729932,2.063946,...,2.712245,80.0,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129
std,9.135373,403.5091,8.106864,1.024165,0.0,602.024335,1.093082,20.329428,0.711561,1.10694,...,1.081209,0.0,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136
min,18.0,102.0,1.0,1.0,1.0,1.0,1.0,30.0,1.0,1.0,...,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,465.0,2.0,2.0,1.0,491.25,2.0,48.0,2.0,1.0,...,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,802.0,7.0,3.0,1.0,1020.5,3.0,66.0,3.0,2.0,...,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,1157.0,14.0,4.0,1.0,1555.75,4.0,83.75,3.0,3.0,...,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1499.0,29.0,5.0,1.0,2068.0,4.0,100.0,4.0,5.0,...,4.0,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


---
### 2. Handle Missing Values


In [9]:
# Check missing values
print(attrition_data.isnull().sum())

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSince

As we can see that there are no null values.

---
### 3. Remove Duplicates

In [10]:
# Check and drop duplicates
cleaned_attrition_data = attrition_data.drop_duplicates()
cleaned_attrition_data.shape

(1470, 35)

As we can see that there are no duplicates.

---
### 4. Drop Irrelevant Columns

These columns are either constant or ID-based, and do not help prediction:
- `EmployeeCount` (always 1)
- `StandardHours` (always 40)
- `Over18` (always 'Y')
- `EmployeeNumber` (unique identifier)


In [11]:
# Drop Columns
cleaned_attrition_data.drop(['EmployeeCount', 'StandardHours', 'Over18', 'EmployeeNumber'], axis=1, inplace=True)

---
### 5. Encode Categorical Variables

- Binary variables (`Attrition`, `Gender`, `OverTime`) are label encoded.
- Nominal categorical variables are one-hot encoded:
  - `BusinessTravel`, `Department`, `EducationField`, `JobRole`, `MaritalStatus`

In [12]:
# Binary Encoding
cleaned_attrition_data['Attrition'] = cleaned_attrition_data['Attrition'].map({'Yes': 1, 'No': 0})
cleaned_attrition_data['Gender'] = cleaned_attrition_data['Gender'].map({'Male': 1, 'Female': 0})
cleaned_attrition_data['OverTime'] = cleaned_attrition_data['OverTime'].map({'Yes': 1, 'No': 0})

In [13]:
# One-Hot Encoding
cleaned_attrition_data = pd.get_dummies(cleaned_attrition_data, columns=[
    'BusinessTravel', 'Department', 'EducationField', 'JobRole', 'MaritalStatus'
], drop_first=True)

---
### 6. Scale Numerical Features

To prepare for machine learning, we scale the numerical columns (excluding the target `Attrition`) using `StandardScaler`.


In [14]:
# Scaling
scaler = StandardScaler()
numerical_cols = cleaned_attrition_data.select_dtypes(include=['int64', 'float64']).drop('Attrition', axis=1).columns
cleaned_attrition_data[numerical_cols] = scaler.fit_transform(cleaned_attrition_data[numerical_cols])

---
### 7. Export the Cleaned Dataset

Save the cleaned dataset to a new CSV file, ready for analysis and modeling.

In [15]:
# Save Final Dataset
cleaned_attrition_data.to_csv('cleaned_attrition_dataset.csv', index=False)
print("Cleaned dataset saved successfully.")

Cleaned dataset saved successfully.


---
###  8. 📄 Summary of Preprocessing Steps

| Step | Action |
|------|--------|
| Load Data | Loaded IBM attrition dataset |
| Drop Columns | Removed constant and ID columns |
| Encode | Binary and one-hot encoding applied |
| Scale | Numerical features standardized |
| Output | Saved as `cleaned_attrition_dataset.csv` |

The dataset is now fully prepared for both EDA and machine learning models.

📂 Cleaned dataset saved as `cleaned_attrition_dataset.csv`.

---

## ⏭️ Next Step

Continue to **Part 2: Exploratory Data Analysis (EDA)**  
➡️ [Link to EDA Notebook] (https://github.com/omarmamdouhismaiel/Employee-Attrition-Analysis-and-Prediction/blob/dddc58e1249a22c2a6583dc722a382e0906d0a41/Part%202%20-%20Exploratory%20Data%20Analysis%20(EDA).ipynb)

We will visualize and explore patterns in employee attrition to better understand which features are most relevant before modeling.