## Discretization

Discretization: It is the process of transforming continuous functions, models, variables and equations into discrete counterparts. This is important, as some algorithms only work with inputs of discrete values, not predicting continuous values. Discretization creates a limited number of possible states.


### Discretization helps handle outliers and may improve value spread in skewed variables

Discretization helps handle outliers by placing these values into the lower or higher intervals, together with the remaining inlier values of the distribution. Thus, these outlier observations no longer differ from the rest of the values at the tails of the distribution, as they are now all together in the same interval / bucket. In addition, by creating appropriate bins or intervals, discretisation can help spread the values of a skewed variable across a set of bins with equal number of observations.


### Discretization approaches

There are several approaches to transform continuous variables into discrete ones. Discretization methods fall into 2 categories: **supervised and unsupervised**. Unsupervised methods do not use any information, other than the variable distribution, to create the contiguous bins in which the values will be placed. Supervised methods typically use target information in order to create the bins or intervals.

### In this class, we will:
-Learn more about equal width discretization;

-Learn how to do it with Pandas and NumPy;

-Learn how to do it with Feature-Engine;

-Learn how to do it with Sckit-learn.

## Equal width discretization

Equal width discretization divides the scope of possible values into N bins of the same width.The width is determined by the range of values in the variable and the number of bins we wish to use to divide the variable:

width = (max value - min value) / N

where N is the number of bins or intervals.

For example if the values of the variable vary between 0 and 100, we create 5 bins like this: width = (100-0) / 5 = 20. The bins thus are 0-20, 20-40, 40-60, 80-100. The first and final bins (0-20 and 80-100) can be expanded to accommodate outliers (that is, values under 0 or greater than 100 would be placed in those bins as well).

There is no rule of thumb to define N, that is something to determine experimentally.

Let's begin!

In [0]:
pip install -U feature-engine

In [0]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import KBinsDiscretizer

from feature_engine.discretisation import EqualWidthDiscretiser

In [0]:
# load the numerical variables of the Titanic Dataset

data = pd.read_csv('/dbfs/FileStore/CDS2023/titanic.csv',
                   usecols=['age', 'fare', 'survived'])

data.head()

In [0]:
# Let's separate into train and test set

X_train, X_test, y_train, y_test = train_test_split(
    data[['age', 'fare']],
    data['survived'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

First, let's remember to deal with the missing values issue. The variables age and fare contain missing data, that we will fill by extracting a random sample of the variable so we learn one more way to do imputation.

In [0]:
def impute_na(data, variable):

    df = data.copy()

    # random sampling
    df[variable + '_random'] = df[variable]

    # extract the random sample to fill the na
    random_sample = X_train[variable].dropna().sample(
        df[variable].isnull().sum(), random_state=0)

    # pandas needs to have the same index in order to merge datasets
    random_sample.index = df[df[variable].isnull()].index
    df.loc[df[variable].isnull(), variable + '_random'] = random_sample

    return df[variable + '_random']

In [0]:
# replace NA in both train and test sets

X_train['age'] = impute_na(data, 'age')
X_test['age'] = impute_na(data, 'age')

X_train['fare'] = impute_na(data, 'fare')
X_test['fare'] = impute_na(data, 'fare')

## Exercise:
Plot the age distribution.

## Equal width discretisation with pandas and NumPy

First we need to determine the intervals' edges or limits.

In [0]:
# let's capture the range of the variable age

age_range = X_train['age'].max() - X_train['age'].min()

age_range

In [0]:
# let's divide the range into 10 equal width bins

age_range / 10

The range or width of our intervals will be 7 years.

In [0]:
# now let's capture the lower and upper boundaries

min_value = int(np.floor( X_train['age'].min()))
max_value = int(np.ceil( X_train['age'].max()))

# let's round the bin width
inter_value = int(np.round(age_range / 10))

min_value, max_value, inter_value

In [0]:
# let's capture the interval limits, so we can pass them to the pandas cut function to generate the bins

intervals = [i for i in range(min_value, max_value+inter_value, inter_value)]

intervals

In [0]:
# let's make labels to label the different bins

labels = ['Bin_' + str(i) for i in range(1, len(intervals))]

labels

In [0]:
# create binned age / discretise age

# create one column with labels
X_train['Age_disc_labels'] = pd.cut(x=X_train['age'],
                                    bins=intervals,
                                    labels=labels,
                                    include_lowest=True)

# and one with bin boundaries
X_train['Age_disc'] = pd.cut(x=X_train['age'],
                             bins=intervals,
                             include_lowest=True)

X_train.head(10)

We can see in the above output how by discretising using equal width, we placed each Age observation within one interval / bin. For example, age=13 was placed in the 7-14 interval, whereas age 30 was placed into the 28-35 interval.

When performing equal width discretisation, we guarantee that the intervals are all of the same lenght, however there won't necessarily be the same number of observations in each of the intervals. See below:

In [0]:
X_train.groupby('Age_disc')['age'].count()

In [0]:
X_train.groupby('Age_disc')['age'].count().plot.bar()
plt.xticks(rotation=45)
plt.ylabel('Number of observations per bin')

The majority of people on the Titanic were between 14-42 years of age.

Now, we can discretise Age in the test set, using the same interval boundaries that we calculated for the train set:

In [0]:
X_test['Age_disc_labels'] = pd.cut(x=X_test['age'],
                                   bins=intervals,
                                   labels=labels,
                                   include_lowest=True)

X_test['Age_disc'] = pd.cut(x=X_test['age'],
                            bins=intervals,
                            include_lowest=True)

X_test.head()

If the distributions in train and test set are similar, we should expect similar propotion of observations in the different intervals in the train and test set. Let's see that below:

In [0]:
t1 = X_train.groupby(['Age_disc'])['age'].count() / len(X_train)
t2 = X_test.groupby(['Age_disc'])['age'].count() / len(X_test)

tmp = pd.concat([t1, t2], axis=1)
tmp.columns = ['train', 'test']
tmp.plot.bar()
plt.xticks(rotation=45)
plt.ylabel('Number of observations per bin')

## Equal width discretisation with Feature-Engine
With feature engine we can automate the process for many variables in one line of code!

### Exercise:
Separate the dataset into train and test and replace mssing values for "age" and "fare" by using the function *impute_na* we defined earlier in this class.

In [0]:
disc = EqualWidthDiscretiser(bins=10, variables = ['age', 'fare'])

disc.fit(X_train)

In [0]:
disc.binner_dict_

In the binner dict, we can see the limits of the intervals. For *age*, the value increases aproximately 7 years from one bin to the next.
For *fare*, it increases in around 50 dollars from one interval to the next, but it increases always the same value, aka, same width.

In [0]:
# transform, train and test

train_t = disc.transform(X_train)
test_t = disc.transform(X_test)

In [0]:
train_t.head()

In [0]:
t1 = train_t.groupby(['age'])['age'].count() / len(train_t)
t2 = test_t.groupby(['age'])['age'].count() / len(test_t)

tmp = pd.concat([t1, t2], axis=1)
tmp.columns = ['train', 'test']
tmp.plot.bar()
plt.xticks(rotation=0)
plt.ylabel('Number of observations per bin')

In [0]:
t1 = train_t.groupby(['fare'])['fare'].count() / len(train_t)
t2 = test_t.groupby(['fare'])['fare'].count() / len(test_t)

tmp = pd.concat([t1, t2], axis=1)
tmp.columns = ['train', 'test']
tmp.plot.bar()
plt.xticks(rotation=0)
plt.ylabel('Number of observations per bin')

We can see quite clearly, that equal width discretisation does not improve the value spread. The original variable Fare was skewed, and the discrete variable is also skewed.

## Equal width discretisation with Scikit-learn

In [0]:
# Let's separate into train and test set

X_train, X_test, y_train, y_test = train_test_split(
    data[['age', 'fare']],
    data['survived'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

In [0]:
# replace NA in both  train and test sets

X_train['age'] = impute_na(data, 'age')
X_test['age'] = impute_na(data, 'age')

X_train['fare'] = impute_na(data, 'fare')
X_test['fare'] = impute_na(data, 'fare')

Now, let's use the *KBinsDiscretizer* function from Sckit-learn.

In [0]:
disc = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')

disc.fit(X_train[['age', 'fare']])

In [0]:
disc.bin_edges_

In [0]:
train_t = disc.transform(X_train[['age', 'fare']])

train_t = pd.DataFrame(train_t, columns = ['age', 'fare'])

train_t.head()

In [0]:
test_t = disc.transform(X_test[['age', 'fare']])

test_t = pd.DataFrame(test_t, columns = ['age', 'fare'])

In [0]:
t1 = train_t.groupby(['age'])['age'].count() / len(train_t)
t2 = test_t.groupby(['age'])['age'].count() / len(test_t)

tmp = pd.concat([t1, t2], axis=1)
tmp.columns = ['train', 'test']
tmp.plot.bar()
plt.xticks(rotation=0)
plt.ylabel('Number of observations per bin')

In [0]:
t1 = train_t.groupby(['fare'])['fare'].count() / len(train_t)
t2 = test_t.groupby(['fare'])['fare'].count() / len(test_t)

tmp = pd.concat([t1, t2], axis=1)
tmp.columns = ['train', 'test']
tmp.plot.bar()
plt.xticks(rotation=0)
plt.ylabel('Number of observations per bin')

**Authors:** Juliana Coelho, Camila Mizokami

**References:**

https://urldefense.com/v3/__https://pandas.pydata.org/docs/reference/api/pandas.cut.html*5Cn__;JQ!!D5dmWrNzYg!L9U_tgz4muzYIdWWOUOBkeemrTncdyadBUgXLUZIgd-bYBCLAWdzkNcB1z0AWWVgNanz43waWrkluqB-RQQOLA$ 
https://urldefense.com/v3/__http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html*5Cn__;JQ!!D5dmWrNzYg!L9U_tgz4muzYIdWWOUOBkeemrTncdyadBUgXLUZIgd-bYBCLAWdzkNcB1z0AWWVgNanz43waWrkluqAp5vHt5g$ 
https://urldefense.com/v3/__https://feature-engine.trainindata.com/en/0.6.x_a/discretisers/EqualWidthDiscretiser.html*5Cn__;JQ!!D5dmWrNzYg!L9U_tgz4muzYIdWWOUOBkeemrTncdyadBUgXLUZIgd-bYBCLAWdzkNcB1z0AWWVgNanz43waWrkluqAyDODywg$ 
https://urldefense.com/v3/__https://www.linkedin.com/advice/1/what-advantages-disadvantages-equal-width*5Cn__;JQ!!D5dmWrNzYg!L9U_tgz4muzYIdWWOUOBkeemrTncdyadBUgXLUZIgd-bYBCLAWdzkNcB1z0AWWVgNanz43waWrkluqCELALF4Q$ 
https://urldefense.com/v3/__https://www.geeksforgeeks.org/binning-in-data-mining/__;!!D5dmWrNzYg!L9U_tgz4muzYIdWWOUOBkeemrTncdyadBUgXLUZIgd-bYBCLAWdzkNcB1z0AWWVgNanz43waWrkluqDAOL9Q8Q$ 