<a href="https://colab.research.google.com/github/omniaghazy/Data-Preprocessing/blob/main/Binning_Practical_Guide.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

"""
# **Binning in Pandas: A Practical Guide**

This notebook provides a practical demonstration of Binning (also known as Discretization) in Pandas.
Binning is a crucial data preprocessing technique used to transform continuous numerical variables
into discrete categorical bins. This is especially useful for understanding data patterns,
handling outliers, and preparing data for certain machine learning models like classification algorithms.

---

## **1. Importing Libraries**

In [None]:
import pandas as pd
import numpy as np

## **2. Creating Sample Data**

Let's create a sample DataFrame with numerical columns to demonstrate Binning.
We'll imagine we have data about employees.

In [None]:
data = {
    'Employee_ID': range(1, 16),
    'Age': [22, 25, 30, 31, 38, 40, 55, 29, 33, 48, 27, 35, 42, 23, 58],
    'Monthly_Salary': [5000, 7500, 4800, 12000, 9000, 15000, 20000, 6000, 8000, 11000, 5500, 10000, 13000, 5200, 18000],
    'Years_at_Company': [1, 2, 3, 7, 5, 10, 20, 2, 4, 12, 1, 6, 8, 1, 15]
}
df = pd.DataFrame(data)

print("Original DataFrame Head:")
print(df.head())
print("\nDataFrame Info:")
df.info()
print("\nDataFrame Description (Numerical Columns):")
print(df.describe())

Original DataFrame Head:
   Employee_ID  Age  Monthly_Salary  Years_at_Company
0            1   22            5000                 1
1            2   25            7500                 2
2            3   30            4800                 3
3            4   31           12000                 7
4            5   38            9000                 5

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   Employee_ID       15 non-null     int64
 1   Age               15 non-null     int64
 2   Monthly_Salary    15 non-null     int64
 3   Years_at_Company  15 non-null     int64
dtypes: int64(4)
memory usage: 612.0 bytes

DataFrame Description (Numerical Columns):
       Employee_ID        Age  Monthly_Salary  Years_at_Company
count    15.000000  15.000000       15.000000         15.000000
mean      8.000000  35.733333    10000.000000  


## **3. Binning with `pd.cut()`**

`pd.cut()` is used when you want to segment data based on **predefined bin edges** or by creating bins of **equal width**.

**Scenario 1: Binning 'Age' into custom, meaningful groups**

Let's categorize employees' ages into logical groups: 'Young', 'Adult', 'Senior'.

In [None]:
# Define the bin edges and labels
age_bins = [0, 25, 35, 50, df['Age'].max() + 1] # Ensure the last bin includes the max value
age_labels = ['Young (<=25)', 'Adult (26-35)', 'Experienced (36-50)', 'Senior (51+)']

In [None]:
# Apply pd.cut()
df['Age_Category_Cut'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, right=True, include_lowest=True)

print("\nDataFrame with 'Age_Category_Cut':")
print(df[['Employee_ID', 'Age', 'Age_Category_Cut']].head(10))

print("\nValue Counts for 'Age_Category_Cut':")
print(df['Age_Category_Cut'].value_counts())


DataFrame with 'Age_Category_Cut':
   Employee_ID  Age     Age_Category_Cut
0            1   22         Young (<=25)
1            2   25         Young (<=25)
2            3   30        Adult (26-35)
3            4   31        Adult (26-35)
4            5   38  Experienced (36-50)
5            6   40  Experienced (36-50)
6            7   55         Senior (51+)
7            8   29        Adult (26-35)
8            9   33        Adult (26-35)
9           10   48  Experienced (36-50)

Value Counts for 'Age_Category_Cut':
Age_Category_Cut
Adult (26-35)          6
Experienced (36-50)    4
Young (<=25)           3
Senior (51+)           2
Name: count, dtype: int64


**Scenario 2: Binning 'Monthly_Salary' into 3 equal-width bins**

Here, Pandas will automatically determine the bin edges to divide the entire range of 'Monthly_Salary' into 3 equally sized intervals.


In [None]:
# Apply pd.cut() with a specified number of bins
df['Salary_Category_Equal_Width'] = pd.cut(df['Monthly_Salary'], bins=3, labels=['Low', 'Medium', 'High'])

print("\nDataFrame with 'Salary_Category_Equal_Width':")
print(df[['Employee_ID', 'Monthly_Salary', 'Salary_Category_Equal_Width']].head(10))

print("\nValue Counts for 'Salary_Category_Equal_Width':")
print(df['Salary_Category_Equal_Width'].value_counts())


DataFrame with 'Salary_Category_Equal_Width':
   Employee_ID  Monthly_Salary Salary_Category_Equal_Width
0            1            5000                         Low
1            2            7500                         Low
2            3            4800                         Low
3            4           12000                      Medium
4            5            9000                         Low
5            6           15000                        High
6            7           20000                        High
7            8            6000                         Low
8            9            8000                         Low
9           10           11000                      Medium

Value Counts for 'Salary_Category_Equal_Width':
Salary_Category_Equal_Width
Low       8
Medium    4
High      3
Name: count, dtype: int64


## **4. Binning with `pd.qcut()`**

`pd.qcut()` is used when you want to segment data so that **each bin contains approximately the same number of observations (data points)**. This is useful for dealing with skewed distributions or creating percentile-based groups.

**Scenario: Binning 'Years_at_Company' into 4 quantile-based groups (Quartiles)**

This will divide employees into four groups based on their years at the company, with each group having roughly the same number of employees.
"""

In [None]:

# Apply pd.qcut() with 4 quantiles
# The 'duplicates' parameter handles cases where multiple data points fall on a bin edge.
# 'drop' means duplicate edges are removed (default), 'raise' would throw an error, 'keep' keeps them.
df['Experience_Quartile'] = pd.qcut(df['Years_at_Company'], q=4, labels=['Q1_New', 'Q2_Mid', 'Q3_Experienced', 'Q4_Veteran'], duplicates='drop')

print("\nDataFrame with 'Experience_Quartile':")
print(df[['Employee_ID', 'Years_at_Company', 'Experience_Quartile']].head(10))

print("\nValue Counts for 'Experience_Quartile':")
print(df['Experience_Quartile'].value_counts())



DataFrame with 'Experience_Quartile':
   Employee_ID  Years_at_Company Experience_Quartile
0            1                 1              Q1_New
1            2                 2              Q1_New
2            3                 3              Q2_Mid
3            4                 7      Q3_Experienced
4            5                 5              Q2_Mid
5            6                10          Q4_Veteran
6            7                20          Q4_Veteran
7            8                 2              Q1_New
8            9                 4              Q2_Mid
9           10                12          Q4_Veteran

Value Counts for 'Experience_Quartile':
Experience_Quartile
Q1_New            5
Q4_Veteran        4
Q2_Mid            3
Q3_Experienced    3
Name: count, dtype: int64


# Apply pd.qcut() with 4 quantiles
# The 'duplicates' parameter handles cases where multiple data points fall on a bin edge.
# 'drop' means duplicate edges are removed (default), 'raise' would throw an error, 'keep' keeps them.

In [None]:
df['Experience_Quartile'] = pd.qcut(df['Years_at_Company'], q=4, labels=['Q1_New', 'Q2_Mid', 'Q3_Experienced', 'Q4_Veteran'], duplicates='drop')

print("\nDataFrame with 'Experience_Quartile':")
print(df[['Employee_ID', 'Years_at_Company', 'Experience_Quartile']].head(10))

print("\nValue Counts for 'Experience_Quartile':")
print(df['Experience_Quartile'].value_counts())


DataFrame with 'Experience_Quartile':
   Employee_ID  Years_at_Company Experience_Quartile
0            1                 1              Q1_New
1            2                 2              Q1_New
2            3                 3              Q2_Mid
3            4                 7      Q3_Experienced
4            5                 5              Q2_Mid
5            6                10          Q4_Veteran
6            7                20          Q4_Veteran
7            8                 2              Q1_New
8            9                 4              Q2_Mid
9           10                12          Q4_Veteran

Value Counts for 'Experience_Quartile':
Experience_Quartile
Q1_New            5
Q4_Veteran        4
Q2_Mid            3
Q3_Experienced    3
Name: count, dtype: int64


In [None]:

print("\nCheck the bin edges created by pd.qcut():")
# To see the actual bin edges that pd.qcut created, you can access the 'categories' attribute
# and then look at the 'interval' objects
# This might need a slightly different approach if we want the exact numerical edges easily
# print(df['Experience_Quartile'].cat.categories) # This shows the labels

# A better way to get the edges for qcut is to not assign labels first and then inspect
_, bins_qcut = pd.qcut(df['Years_at_Company'], q=4, retbins=True, duplicates='drop')
print(f"Actual bin edges for Experience_Quartile: {bins_qcut}")




Check the bin edges created by pd.qcut():
Actual bin edges for Experience_Quartile: [ 1.  2.  5.  9. 20.]



## **5. Why Binning is Important for Classification Models**

After Binning, your numerical features are now categorical. This can be beneficial for:

* **Interpretability:** It's easier to explain "employees in the 'Senior' age group" than "employees with ages between 51 and 58".
* **Handling Non-linearity:** Some classification models struggle with linear relationships or specific numerical scales. Binning can help capture non-linear patterns by grouping values.
* **Robustness to Outliers:** Outliers might fall into the same bin as other values, reducing their disproportionate impact.
* **Feature Engineering:** The new categorical features can be directly used in models or one-hot encoded for models that require numerical inputs.

Binning is a powerful step in transforming raw data into a format that provides more insights and improves model performance, especially in classification tasks.

---

## **Conclusion**

This notebook demonstrated the practical application of `pd.cut()` and `pd.qcut()` for Binning. By transforming continuous numerical data into discrete categories, we can enhance our data analysis and prepare features more effectively for machine learning models.

"""