# Discretization of Continuous Numerical Data

* Numerical data is divided into distinct intervals.
* Through discretization, numerical features can be converted into categorical data.
* `Discretization` refers to the process of transforming a continuous variable into a variable with two or more categories.

In [29]:
import numpy as np
import pandas as pd

In [30]:
# given data
age = np.array([[6],
                [12],
                [20],
                [36],
                [65]
                ])

### np.digitize()
A method for dividing numerical features based on multiple threshold values.
* The input value for the `bins` parameter represents the left boundary of each interval.
* The data is divided into 4 intervals: [~ 20), [20, 30), [30, 64), [64 ~).
* You can change this behavior by setting `right = True`.


In [31]:
np.digitize(age, bins=[20, 30, 64])

array([[0],
       [0],
       [1],
       [2],
       [3]])

In [32]:
np.digitize(age, bins=[20,30,64], right=True)

array([[0],
       [0],
       [0],
       [2],
       [3]])

## np.where()

Discretizing Continuous Variables using `np.where(condition, factor1, factor2, ...)`

Can be used to discretize continuous variables by applying conditions and assigning corresponding factors.

* `condition`: The condition to evaluate.
* `factor1`: The value to assign when the condition is true.
* `factor2`: The value to assign when the condition is false.


In [33]:
x = np.arange(100)
np.where(x >= x.mean(), 'high', 'low')

array(['low', 'low', 'low', 'low', 'low', 'low', 'low', 'low', 'low',
       'low', 'low', 'low', 'low', 'low', 'low', 'low', 'low', 'low',
       'low', 'low', 'low', 'low', 'low', 'low', 'low', 'low', 'low',
       'low', 'low', 'low', 'low', 'low', 'low', 'low', 'low', 'low',
       'low', 'low', 'low', 'low', 'low', 'low', 'low', 'low', 'low',
       'low', 'low', 'low', 'low', 'low', 'high', 'high', 'high', 'high',
       'high', 'high', 'high', 'high', 'high', 'high', 'high', 'high',
       'high', 'high', 'high', 'high', 'high', 'high', 'high', 'high',
       'high', 'high', 'high', 'high', 'high', 'high', 'high', 'high',
       'high', 'high', 'high', 'high', 'high', 'high', 'high', 'high',
       'high', 'high', 'high', 'high', 'high', 'high', 'high', 'high',
       'high', 'high', 'high', 'high', 'high', 'high'], dtype='<U4')

## sklearn.preprocessing.Binarizer()

`sklearn.preprocessing.Binarizer()` is used to convert continuous variables into a binary variable with two values based on a specified threshold:

* If the value is equal to or less than the threshold, it is converted to '0'.
* If the value is greater than the threshold, it is converted to '1'.


In [34]:
from sklearn.preprocessing import Binarizer

In [35]:
# 20을 기준으로 데이터를 2개 범주로 나눈다.
binarizer = Binarizer(threshold=20)  
binarizer.fit_transform(age)

array([[0],
       [0],
       [0],
       [1],
       [1]])

## sklearn.preprocessing.KBinsDiscretizer() - New in version 0.20.

This function divides continuous feature values into multiple intervals, and you can specify the number of intervals to divide into.

* `encode`:
    * The default value is 'onehot', which returns a sparse matrix with one-hot encoding.
    * 'onehot-dense' returns a dense array.
    * 'ordinal' returns sequential categorical values.
* `strategy`:
    * 'quantile': Ensures that each interval contains approximately the same number of data points.
    * 'uniform': Ensures that each interval has the same width.
* The values of the intervals can be checked using the `bin_edges_` attribute.


In [36]:
from sklearn.preprocessing import KBinsDiscretizer

In [37]:
kb = KBinsDiscretizer(4, encode='ordinal', strategy='quantile')

kb.fit_transform(age)

array([[0.],
       [1.],
       [2.],
       [3.],
       [3.]])

In [38]:
kb = KBinsDiscretizer(4, encode='onehot-dense', strategy='quantile')

kb.fit_transform(age)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.]])

In [39]:
kb = KBinsDiscretizer(4, encode='onehot-dense', strategy='uniform')

kb.fit_transform(age)



array([[1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [40]:
# 구간의 값
kb.bin_edges_

array([array([ 6.  , 20.75, 35.5 , 50.25, 65.  ])], dtype=object)

### Recall Discretization of Continuous Numerical Data

* Numerical data is divided into distinct intervals.
* Through discretization, numerical features can be converted into categorical data.
* Discretization refers to the process of transforming a continuous variable into a variable with two or more categories.


## Binning

Binning allows converting numerical data into categorical data by categorizing the numeric values.

* `pd.cut()`: Divides the data into intervals by specifying the boundary values for the intervals.
* `pd.qcut()`: Divides the data into a specified number of intervals, ensuring each interval contains the same number of data points, without explicitly specifying the boundary values.


### pd.cut() - Equal-Length Buckets Categorization

* The `pd.cut()` function allows easy categorization by taking numerical data and the interval boundaries as arguments.
* The data segmented using `pd.cut()` is returned as a Series with a categorical data type.


The `ages` data is returned as a categorical Series with 4 intervals, created by dividing the data into 5 equal-length buckets.


In [45]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]

# 18 ~ 25 / 25 ~ 35 / 35 ~ 60 / 60 ~ 100 이렇게 총 4구간
cats = pd.cut(ages,bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

By using `cats.codes`, you can see the integer index representing which interval each element of `ages` belongs to. 
For example, 20 belongs to the first interval (index 0), and 27 belongs to the second interval (index 1).


In [43]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

By using `cats.value_counts()`, you can check the count of elements in each interval.
The `value_counts()` function helps to determine how many elements belong to each category in a categorical Series.

In [46]:
cats.value_counts()

(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
Name: count, dtype: int64

When calling `pd.cut()`, you can add the `labels = [list]` argument to specify custom category labels for each interval.

In [47]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
group_names = ["Youth", "YoungAdult", "MiddleAged", "Senior"]

pd.cut(ages, bins, labels= group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

#### Dividing into a Number of Intervals using pd.cut()

Instead of providing the specific bin boundaries as the second argument, you can simply input the number of intervals (bins) you want to create. 
The function will automatically divide the data into equal intervals based on the minimum and maximum values of the data.


In [None]:
data = np.random.rand(20)
data

array([0.6054644 , 0.24505555, 0.82241768, 0.87309714, 0.72911796,
       0.02596842, 0.75510777, 0.0203316 , 0.18695353, 0.45116648,
       0.72475169, 0.36621165, 0.67065616, 0.44007087, 0.49748874,
       0.50314348, 0.22130711, 0.81164844, 0.63236241, 0.73136337])

In [19]:
# 20개의 data성분에 대해, 동일한 길이의 구간으로 4개를 나누었고, 
# 기준은 소수2번째 자리까지를 기준으로 한다.
cat_data = pd.cut(data, 4, precision = 2 )
cat_data

[(0.45, 0.66], (0.23, 0.45], (0.66, 0.87], (0.66, 0.87], (0.66, 0.87], ..., (0.45, 0.66], (0.019, 0.23], (0.66, 0.87], (0.45, 0.66], (0.66, 0.87]]
Length: 20
Categories (4, interval[float64, right]): [(0.019, 0.23] < (0.23, 0.45] < (0.45, 0.66] < (0.66, 0.87]]

In [48]:
cat_data.value_counts()

(0.019, 0.23]    4
(0.23, 0.45]     3
(0.45, 0.66]     5
(0.66, 0.87]     8
Name: count, dtype: int64

### pd.qcut() - Equal-Size Buckets Categorization

Pandas provides a function called `qcut()`.

* It defines intervals based on the specified number of intervals (bins).
* Unlike `pd.cut()`, which divides the data based on the minimum and maximum values, 
* `pd.qcut()` takes the data distribution into account and divides the data into intervals such that each interval contains an equal number of data points, using quantiles as the bin boundaries.


In [49]:
data2 = np.random.randn(100)
data2

array([-0.34912103, -0.82275297, -1.03713009,  1.13109436,  0.80933678,
        0.66246984,  0.6980762 ,  0.7334708 ,  0.3258062 , -0.79843347,
        1.20716415,  1.33652198, -0.09499702,  1.20351958,  0.11837861,
       -0.8047348 ,  1.37070788,  0.69336737, -0.18915024,  1.50274268,
       -1.47504603, -0.0075665 , -1.25401873,  0.59201201, -2.01909322,
       -0.83664949,  2.6792604 , -1.11383775,  1.29440623,  0.58176616,
       -0.56155935, -1.27711394, -0.78523135, -0.73071255,  0.03741178,
       -1.12374532, -0.18750311,  1.02515031, -0.28438394, -1.41937138,
       -0.40980026, -0.25071172, -2.52227712, -0.68133111,  0.90837629,
       -0.56007251,  0.37027034, -0.43474412, -1.72636334,  2.64008542,
        0.39306375,  1.9153212 ,  1.25005798, -1.25321645, -0.25119106,
        1.10821247,  0.09818287,  0.2097303 , -0.44229231,  0.84498387,
       -0.95915654, -1.45279646, -1.84926377,  1.09359047, -0.9590751 ,
        0.83045742,  0.44466298, -0.47401628, -0.47356698,  0.61

In [51]:
cats = pd.qcut(data2, 4)
cats

[(-0.726, -0.188], (-2.5229999999999997, -0.726], (-2.5229999999999997, -0.726], (0.707, 2.679], (0.707, 2.679], ..., (0.707, 2.679], (-0.726, -0.188], (0.707, 2.679], (-0.726, -0.188], (-0.726, -0.188]]
Length: 100
Categories (4, interval[float64, right]): [(-2.5229999999999997, -0.726] < (-0.726, -0.188] < (-0.188, 0.707] < (0.707, 2.679]]

* `cats = pd.qcut(data2, 4)` divides the data into 4 intervals.
* Instead of simply dividing the range between the minimum and maximum values into four equal parts, it considers the distribution and splits the data into quartiles.
* Unlike the `cut()` function, it cannot be said that each interval has the same length.


In [52]:
cats.value_counts()

(-2.5229999999999997, -0.726]    25
(-0.726, -0.188]                 25
(-0.188, 0.707]                  25
(0.707, 2.679]                   25
Name: count, dtype: int64