# **Data Transformation**

## **3. Discretization & Binning**

In [3]:
import numpy as np 
import pandas as pd 

## 1. 🧠 What It Does & When to Use It

### 📌 What?

**Discretization** (or **binning**) is the process of converting **continuous values** (e.g., age, income, marks, etc.) into **discrete intervals** or **categories** (e.g., low/medium/high, age groups, grade levels).

> It’s like creating **buckets** to group similar values.

---

### 🎯 When to Use?

You use **discretization/binning** when:

* You want to **simplify continuous data** into **interpretable categories**
* Need to create **features for classification models**
* You want to **compare performance or behavior** across different value ranges
* To **reduce noise** in the data and identify patterns


## 2. 🧾 Syntax & Core Parameters

There are **2 main methods** in pandas:

### 🔹 `pd.cut()` – for **equal-width binning**

```python
pd.cut(x, bins, labels=None, include_lowest=False, right=True)
```

| Parameter        | Description                              |
| ---------------- | ---------------------------------------- |
| `x`              | The numeric Series or array to bin       |
| `bins`           | Number of bins or list of bin edges      |
| `labels`         | List of labels for bins                  |
| `right`          | Indicates if bins include the right edge |
| `include_lowest` | Include the first bin's left edge        |


### 🔹 `pd.qcut()` – for **quantile-based binning**

```python
pd.qcut(x, q, labels=None)
```

| Parameter | Description                                          |
| --------- | ---------------------------------------------------- |
| `x`       | Numeric Series or array                              |
| `q`       | Number of quantiles or list of quantile values (0–1) |
| `labels`  | Optional: labels for the bins                        |


## 3. 🧰 Different Methods & Techniques

### 🔸 A. **Fixed-width binning** — using `pd.cut()`

* Divide values into **equal-width intervals**
* Good for when ranges have fixed significance (e.g., 0–10, 10–20)


### 🔸 B. **Quantile binning** — using `pd.qcut()`

* Divide values so that each bin contains **roughly equal number of observations**
* Good for **balancing** classes


### 🔸 C. **Custom binning**

* Manually define bin edges and labels


### 🔸 D. **Label encoding or one-hot encoding**

* Convert binned categories into numeric form (optional post-processing)


## 4. ⚠️ Common Pitfalls & Best Practices

### ❌ Pitfalls

| Issue                        | Cause                                                                                 |
| ---------------------------- | ------------------------------------------------------------------------------------- |
| Uneven bin sizes in `qcut`   | Data with identical values may cause bin overlap                                      |
| Misinterpreting labels       | Labels correspond to intervals, not exact values                                      |
| Ignoring outliers            | Outliers may fall outside bin ranges if custom edges are used                         |
| Using wrong binning strategy | Choose `cut` or `qcut` based on whether you care about value ranges or balanced sizes |

---

### ✅ Best Practices

* Use **`pd.cut()`** when you care about **value ranges**
* Use **`pd.qcut()`** for **balancing** dataset across bins
* Always **label bins** for clarity and interpretability
* Visualize distributions (`hist`, `value_counts`) before and after binning
* Check for **NA values** (can occur if some values fall outside specified bins)


## 5. 🧪 Examples on Real/Pseudo Data

In [4]:
data = {
    'age': [18, 22, 35, 45, 63, 70, 85],
    'income': [20000, 30000, 50000, 80000, 120000, 140000, 160000]
}
df = pd.DataFrame(data)

df

Unnamed: 0,age,income
0,18,20000
1,22,30000
2,35,50000
3,45,80000
4,63,120000
5,70,140000
6,85,160000


### ▶️ Example 1: **Equal-width binning** using `pd.cut()`

In [5]:
# Bin 'age' into 3 equal-width groups
df['age_group'] = pd.cut(df['age'], bins=3, labels=['Young', 'Middle', 'Old'])
df

Unnamed: 0,age,income,age_group
0,18,20000,Young
1,22,30000,Young
2,35,50000,Young
3,45,80000,Middle
4,63,120000,Old
5,70,140000,Old
6,85,160000,Old


This divides the `age` range (18 to 85) into 3 equally wide bins.

### ▶️ Example 2: **Quantile binning** using `pd.qcut()`

In [6]:
# Bin 'income' into 4 quantile groups
df['income_quantile'] = pd.qcut(df['income'], q=4, labels=['low', 'med', 'high', 'top'])
df

Unnamed: 0,age,income,age_group,income_quantile
0,18,20000,Young,low
1,22,30000,Young,low
2,35,50000,Young,med
3,45,80000,Middle,med
4,63,120000,Old,high
5,70,140000,Old,top
6,85,160000,Old,top


This ensures each bin has **approximately the same number** of records.

### ▶️ Example 3: **Custom binning** with fixed edges

In [8]:
# Define bins and labels manually for 'age'
bins = [0, 30, 60, 100]
labels = ['Youth', 'Adult', 'Senior']

df['custom_age_group'] = pd.cut(df['age'], bins=bins, labels=labels)

df

Unnamed: 0,age,income,age_group,income_quantile,custom_age_group
0,18,20000,Young,low,Youth
1,22,30000,Young,low,Youth
2,35,50000,Young,med,Adult
3,45,80000,Middle,med,Adult
4,63,120000,Old,high,Senior
5,70,140000,Old,top,Senior
6,85,160000,Old,top,Senior


### ▶️ Example 4: With `include_lowest=True` and `right=False`

In [9]:
df['age_group'] = pd.cut(df['age'], bins=[18, 30, 60, 85], right=False, include_lowest=True)

df

Unnamed: 0,age,income,age_group,income_quantile,custom_age_group
0,18,20000,"[18.0, 30.0)",low,Youth
1,22,30000,"[18.0, 30.0)",low,Youth
2,35,50000,"[30.0, 60.0)",med,Adult
3,45,80000,"[30.0, 60.0)",med,Adult
4,63,120000,"[60.0, 85.0)",high,Senior
5,70,140000,"[60.0, 85.0)",top,Senior
6,85,160000,,top,Senior


## 6. 🌍 Real-World Use Cases

### 📦 E-commerce

* Segment users by **purchase amount**:

  ```python
  df['spender'] = pd.qcut(df['purchase'], q=3, labels=['Low', 'Medium', 'High'])
  ```

### 🏥 Healthcare

* Bin **age groups** for risk profiling:

  ```python
  df['risk_age'] = pd.cut(df['age'], bins=[0, 30, 60, 100], labels=['Low', 'Medium', 'High'])
  ```

### 🏫 Education

* Convert **marks** to grade levels:

  ```python
  df['grade'] = pd.cut(df['score'], bins=[0, 50, 75, 90, 100], labels=['D', 'C', 'B', 'A'])
  ```

### 💰 Finance

* Divide **credit scores** into creditworthiness levels:

  ```python
  df['credit_level'] = pd.cut(df['score'], bins=[300, 580, 670, 740, 800, 850],
                              labels=['Poor', 'Fair', 'Good', 'Very Good', 'Excellent'])
  ```


## ✅ Summary Table

| Method      | Purpose                 | Best For                        |
| ----------- | ----------------------- | ------------------------------- |
| `pd.cut()`  | Equal-width binning     | Value ranges (e.g., age groups) |
| `pd.qcut()` | Quantile-based binning  | Balanced class sizes            |
| Custom bins | Define bin edges/labels | Domain-specific categories      |


<center><b>Thanks</b></center>