# 📊 Employee Attrition Analysis and Prediction – Part 1: Data Preprocessing

This notebook presents the **data preprocessing** step in a full end-to-end pipeline to analyze and predict employee attrition using the IBM HR Analytics dataset. Preparing the data is a crucial first step before conducting exploratory analysis or building predictive models.

This notebook is **Part 1 of 3**, focusing on:
- Cleaning and transforming the data
- Handling missing values
- Encoding categorical variables
- Exporting the cleaned dataset for later analysis

---

## 🔄 Notebook Series Overview:

🧹 **Part 1 – Data Preprocessing** _(this notebook)_
- Handle missing values
- Encode categorical features
- Export clean dataset

📊 **Part 2 – Exploratory Data Analysis (EDA)** _(coming soon)_
- Visualize distributions and relationships
- Explore attrition patterns by feature

🤖 **Part 3 – Predictive Modeling & Insights** _(coming soon)_
- Build and evaluate ML models (Logistic Regression, Random Forest, XGBoost)
- Optimize performance and extract actionable HR insights

---

## 📁 Dataset:
- Source: [IBM HR Analytics Employee Attrition – Kaggle](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset)
- Fictional dataset created by IBM data scientists

---

## 🛠️ Tools Used:
- **Python**: Pandas, NumPy, Scikit-learn
- **Environment**: Kaggle / Jupyter Notebook

---

🔁 After running this notebook, you’ll get a cleaned version of the dataset ready for analysis and modeling in the next parts of the project.

---

### Import libraries

In [None]:
import numpy as np 
import pandas as pd 
from sklearn.preprocessing import StandardScaler

ـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــ

## 🧼 Data Preprocessing

**Objective**: Prepare and clean the IBM Employee Attrition dataset for exploratory analysis and predictive modeling.

---
### 1. Load and Inspect Dataset

We'll load the dataset and inspect its structure, shape, types, and missing values.

In [None]:
# Load Data
attrition_data = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")

In [None]:
# Preview
attrition_data.head()

In [None]:
# Rows and Columns
attrition_data.shape

In [None]:
# Column names
attrition_data.columns

In [None]:
# Data types
attrition_data.dtypes

In [None]:
# Summary info
attrition_data.info()

In [None]:
# Numeric summary
attrition_data.describe()

---
### 2. Handle Missing Values


In [None]:
# Check missing values
print(attrition_data.isnull().sum())

As we can see that there are no null values.

---
### 3. Remove Duplicates

In [None]:
# Check and drop duplicates
cleaned_attrition_data = attrition_data.drop_duplicates()
cleaned_attrition_data.shape

As we can see that there are no duplicates.

---
### 4. Drop Irrelevant Columns

These columns are either constant or ID-based, and do not help prediction:
- `EmployeeCount` (always 1)
- `StandardHours` (always 40)
- `Over18` (always 'Y')
- `EmployeeNumber` (unique identifier)


In [None]:
# Drop Columns
cleaned_attrition_data.drop(['EmployeeCount', 'StandardHours', 'Over18', 'EmployeeNumber'], axis=1, inplace=True)

---
### 5. Encode Categorical Variables

- Binary variables (`Attrition`, `Gender`, `OverTime`) are label encoded.
- Nominal categorical variables are one-hot encoded:
  - `BusinessTravel`, `Department`, `EducationField`, `JobRole`, `MaritalStatus`

In [None]:
# Binary Encoding
cleaned_attrition_data['Attrition'] = cleaned_attrition_data['Attrition'].map({'Yes': 1, 'No': 0})
cleaned_attrition_data['Gender'] = cleaned_attrition_data['Gender'].map({'Male': 1, 'Female': 0})
cleaned_attrition_data['OverTime'] = cleaned_attrition_data['OverTime'].map({'Yes': 1, 'No': 0})

In [None]:
# One-Hot Encoding
cleaned_attrition_data = pd.get_dummies(cleaned_attrition_data, columns=[
    'BusinessTravel', 'Department', 'EducationField', 'JobRole', 'MaritalStatus'
], drop_first=True)

---
### 6. Scale Numerical Features

To prepare for machine learning, we scale the numerical columns (excluding the target `Attrition`) using `StandardScaler`.


In [None]:
# Scaling
scaler = StandardScaler()
numerical_cols = cleaned_attrition_data.select_dtypes(include=['int64', 'float64']).drop('Attrition', axis=1).columns
cleaned_attrition_data[numerical_cols] = scaler.fit_transform(cleaned_attrition_data[numerical_cols])

---
### 7. Export the Cleaned Dataset

Save the cleaned dataset to a new CSV file, ready for analysis and modeling.

In [None]:
# Save Final Dataset
cleaned_attrition_data.to_csv('cleaned_attrition_dataset.csv', index=False)
print("Cleaned dataset saved successfully.")

---
###  8. 📄 Summary of Preprocessing Steps

| Step | Action |
|------|--------|
| Load Data | Loaded IBM attrition dataset |
| Drop Columns | Removed constant and ID columns |
| Encode | Binary and one-hot encoding applied |
| Scale | Numerical features standardized |
| Output | Saved as `cleaned_attrition_dataset.csv` |

The dataset is now fully prepared for both EDA and machine learning models.

📂 Cleaned dataset saved as `cleaned_attrition_dataset.csv`.

---

## ⏭️ Next Step

Continue to **Part 2: Exploratory Data Analysis (EDA)**  
➡️ [Link to EDA Notebook](#) *(Add the actual link once published)*

We will visualize and explore patterns in employee attrition to better understand which features are most relevant before modeling.