# Discretization:-

- Discretization is the process of transforming continuous data into discrete bins or intervals. This is commonly used in machine learning to convert continuous features into categorical ones, which can simplify models or make them more interpretable. Below are some common discretization methods with Python code examples.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df=pd.DataFrame({'age':[1,2,3,4,5,10,15,20,24,25,32,31,40,41,45,50,51,55,60,61,65,63,70,72,75,76,80,82,83,85,88]})

In [3]:
df

Unnamed: 0,age
0,1
1,2
2,3
3,4
4,5
5,10
6,15
7,20
8,24
9,25


## Equal-Width Binning:-

- Divides the range of values into bins of equal width.

In [4]:
bins=4
df['age_ewb']=pd.cut(df['age'],bins=bins)

In [5]:
print(df)

    age         age_ewb
0     1  (0.913, 22.75]
1     2  (0.913, 22.75]
2     3  (0.913, 22.75]
3     4  (0.913, 22.75]
4     5  (0.913, 22.75]
5    10  (0.913, 22.75]
6    15  (0.913, 22.75]
7    20  (0.913, 22.75]
8    24   (22.75, 44.5]
9    25   (22.75, 44.5]
10   32   (22.75, 44.5]
11   31   (22.75, 44.5]
12   40   (22.75, 44.5]
13   41   (22.75, 44.5]
14   45   (44.5, 66.25]
15   50   (44.5, 66.25]
16   51   (44.5, 66.25]
17   55   (44.5, 66.25]
18   60   (44.5, 66.25]
19   61   (44.5, 66.25]
20   65   (44.5, 66.25]
21   63   (44.5, 66.25]
22   70   (66.25, 88.0]
23   72   (66.25, 88.0]
24   75   (66.25, 88.0]
25   76   (66.25, 88.0]
26   80   (66.25, 88.0]
27   82   (66.25, 88.0]
28   83   (66.25, 88.0]
29   85   (66.25, 88.0]
30   88   (66.25, 88.0]


## Equal-Frequency Binning

- Divides the data into bins with an equal number of data points.

In [6]:
bins=4
df['age_wfb']=pd.cut(df['age'],bins=bins)

In [7]:
df

Unnamed: 0,age,age_ewb,age_wfb
0,1,"(0.913, 22.75]","(0.913, 22.75]"
1,2,"(0.913, 22.75]","(0.913, 22.75]"
2,3,"(0.913, 22.75]","(0.913, 22.75]"
3,4,"(0.913, 22.75]","(0.913, 22.75]"
4,5,"(0.913, 22.75]","(0.913, 22.75]"
5,10,"(0.913, 22.75]","(0.913, 22.75]"
6,15,"(0.913, 22.75]","(0.913, 22.75]"
7,20,"(0.913, 22.75]","(0.913, 22.75]"
8,24,"(22.75, 44.5]","(22.75, 44.5]"
9,25,"(22.75, 44.5]","(22.75, 44.5]"


## K-Means Binning:-

- Uses k-means clustering to create bins.

In [8]:
from sklearn.cluster import KMeans

kmeans=KMeans(n_clusters=4,random_state=42)
df['age_kmeans']=kmeans.fit_predict(df[['age']])

In [9]:
df

Unnamed: 0,age,age_ewb,age_wfb,age_kmeans
0,1,"(0.913, 22.75]","(0.913, 22.75]",2
1,2,"(0.913, 22.75]","(0.913, 22.75]",2
2,3,"(0.913, 22.75]","(0.913, 22.75]",2
3,4,"(0.913, 22.75]","(0.913, 22.75]",2
4,5,"(0.913, 22.75]","(0.913, 22.75]",2
5,10,"(0.913, 22.75]","(0.913, 22.75]",2
6,15,"(0.913, 22.75]","(0.913, 22.75]",2
7,20,"(0.913, 22.75]","(0.913, 22.75]",0
8,24,"(22.75, 44.5]","(22.75, 44.5]",0
9,25,"(22.75, 44.5]","(22.75, 44.5]",0


## Custom Bins:-

- You define the bin edges manually.

In [10]:
age_bins=[0,15,20,30,60,90]
df['age_cb']=pd.cut(df['age'],bins=age_bins)

In [11]:
df

Unnamed: 0,age,age_ewb,age_wfb,age_kmeans,age_cb
0,1,"(0.913, 22.75]","(0.913, 22.75]",2,"(0, 15]"
1,2,"(0.913, 22.75]","(0.913, 22.75]",2,"(0, 15]"
2,3,"(0.913, 22.75]","(0.913, 22.75]",2,"(0, 15]"
3,4,"(0.913, 22.75]","(0.913, 22.75]",2,"(0, 15]"
4,5,"(0.913, 22.75]","(0.913, 22.75]",2,"(0, 15]"
5,10,"(0.913, 22.75]","(0.913, 22.75]",2,"(0, 15]"
6,15,"(0.913, 22.75]","(0.913, 22.75]",2,"(0, 15]"
7,20,"(0.913, 22.75]","(0.913, 22.75]",0,"(15, 20]"
8,24,"(22.75, 44.5]","(22.75, 44.5]",0,"(20, 30]"
9,25,"(22.75, 44.5]","(22.75, 44.5]",0,"(20, 30]"


## Using KBinsDiscretizer from Scikit-learn:-

- offers multiple strategies, including uniform, quantile, and k-means.

In [12]:
from sklearn.preprocessing import KBinsDiscretizer

kbd=KBinsDiscretizer(n_bins=5,encode='ordinal',strategy='uniform')
df['age_kbd']=kbd.fit_transform(df[['age']])

In [13]:
df

Unnamed: 0,age,age_ewb,age_wfb,age_kmeans,age_cb,age_kbd
0,1,"(0.913, 22.75]","(0.913, 22.75]",2,"(0, 15]",0.0
1,2,"(0.913, 22.75]","(0.913, 22.75]",2,"(0, 15]",0.0
2,3,"(0.913, 22.75]","(0.913, 22.75]",2,"(0, 15]",0.0
3,4,"(0.913, 22.75]","(0.913, 22.75]",2,"(0, 15]",0.0
4,5,"(0.913, 22.75]","(0.913, 22.75]",2,"(0, 15]",0.0
5,10,"(0.913, 22.75]","(0.913, 22.75]",2,"(0, 15]",0.0
6,15,"(0.913, 22.75]","(0.913, 22.75]",2,"(0, 15]",0.0
7,20,"(0.913, 22.75]","(0.913, 22.75]",0,"(15, 20]",1.0
8,24,"(22.75, 44.5]","(22.75, 44.5]",0,"(20, 30]",1.0
9,25,"(22.75, 44.5]","(22.75, 44.5]",0,"(20, 30]",1.0


## Key Notes:
- Equal-width is best for simple and evenly distributed data.
- Equal-frequency ensures each bin has the same number of samples.
- K-means is data-driven and adapts to the data distribution.
- Custom bins are useful when domain knowledge determines bin boundaries.
- Always check binning results to ensure they align with the problem's needs.