# 🧹 Data Preprocessing

## 📌 Introduction
Data preprocessing is the **first step in the Machine Learning pipeline**.
It involves cleaning and transforming raw data into a usable format.
Without preprocessing, data may contain noise, missing values, or inconsistent formats,
which can mislead models and reduce performance.

---

## 🔑 Importance of Data Preprocessing
- Ensures **data quality** before analysis.
- Handles **missing, duplicate, and inconsistent values**.
- Converts raw data into a **structured format**.
- Improves **accuracy and reliability** of ML models.

---

## 🛠️ Common Steps in Data Preprocessing

### 1. Handling Missing Values
- **Drop rows/columns** with too many missing values.
- **Impute values** (mean, median, mode, or advanced imputation methods).

### 2. Handling Duplicates
- Remove duplicate rows to avoid bias in training.

### 3. Handling Outliers
- Detect outliers using statistical methods (IQR, Z-score).
- Remove or cap extreme values if they are errors.

### 4. Normalization & Scaling
- Scale numerical features so they are comparable.
  - **Standardization** → mean = 0, std = 1.
  - **Min-Max scaling** → values between 0 and 1.

### 5. Handling Class Imbalance
- If the target variable is imbalanced (e.g., 90% class A, 10% class B),
  the model may become biased.
- Solutions:
  - **Resampling techniques**:
    - Oversampling (e.g., SMOTE)
    - Undersampling
  - **Class weights adjustment** in models.
  - **Collect more data** for minority class.

---

## ⚡ Key Point
- **Data Preprocessing = Cleaning & Preparing raw data**.
- It is the **foundation** of every ML project.
- Once data is preprocessed, we can perform **EDA** and **Feature Engineering** effectively.
