# Features engineering and models selection - Variables transformation 

In [1]:
%pylab
%matplotlib inline

%config InlineBackend.figure_format = 'retina'

import numpy as np

Using matplotlib backend: Qt5Agg
Populating the interactive namespace from numpy and matplotlib


### Discretization

One of the transformation which is used commonly during the data preprocessing in a machine learning model is the discretization. The discretization is make a ordinal feature using another kind of features (generally cuantitative). 

The discretization can be made using `cut()` function given by `pandas`, like this:

In [2]:
import pandas as pd

In [3]:
print(pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), # Data to discretizate
             3,                                       # Number of groups to discretizate
             labels = ['good', 'medium', 'bad'],      # Groups's labels
             retbins = True))                         # Return de groups's definition  

([good, good, good, medium, bad, good]
Categories (3, object): [good < medium < bad], array([ 0.1905    ,  3.36666667,  6.53333333,  9.7       ]))


Several methods to select the best division of a continue feature: the mean or the median, for example. Also, the Weight of Evidence (WoE) exists, which is defined like: 

$$WoE = \ln \frac{R_i(T)}{R_i(F)}$$

Being $R_i(T)$ the true values rate of $i$ feature, and $R_i(F)$ the false values rate of $i$ feature. The WoE is not implemented by Python, we have to implement it:

In [4]:
# data   -> data
# var    -> variable to estimate
# target -> target
def get_WoE(data, var, target) :
    import pandas as pd
    
    crosstab = pd.crosstab(data[target], data[var])
    
    print('Getting the WoE value for the', var, 'variable:')
    
    for col in crosstab.columns :
        if crosstab[col][1] == 0 :
            print('  The WoE value for', col, '[', sum(crosstab[col]), '] is infinity.')
        else :
            WoE = np.log(float(crosstab[col][0]) / float(crosstab[col][1]))
            print('  The WoE value for', col, '[', sum(crosstab[col]), '] is', WoE)

In [5]:
data = pd.DataFrame({'Value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                   'Target': [True, True, False, True, True, False, True, False, False, False]})

# Create two categorical variables depart from a continue variable
data['Cat 1'] = data['Value'] > 3
data['Cat 2'] = data['Value'] > 6

# Get WoE of each categorical variable
get_WoE(data, 'Cat 1', 'Target')
get_WoE(data, 'Cat 2', 'Target')

Getting the WoE value for the Cat 1 variable:
  The WoE value for False [ 3 ] is -0.69314718056
  The WoE value for True [ 7 ] is 0.287682072452
Getting the WoE value for the Cat 2 variable:
  The WoE value for False [ 6 ] is -0.69314718056
  The WoE value for True [ 4 ] is 1.09861228867


We can observe that in both cases, the WoE value below the threshold is -0.69, while the WoE value above the threshold is 0.28 in the first one and 1.09 in the second one. For that, the second configuration is better than the first one.

The WoE value allows us identify the discretization capacity that a categorical variable has, but it say nothing above the set of variables. For that, we will use the IV method which will be explained later.

### Normalization

We are going to explain three normalization methods, given by `scikit-learn`:

- `MinMaxScaler` for a homogeneous variables.
- `StandardScaler` for a normal variables.
- `RobustScaler` for using the interquantile range.

In [6]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler

In [7]:
data = [35.6, -26.4, 54.9, -63.4, 37.9, 45.8, 44.3, 9.2, 35.5, -12.9]
data = np.array(data).reshape(-1,1)

minmax   = MinMaxScaler().fit(data)
standard = StandardScaler().fit(data)
robust   = RobustScaler().fit(data)

print('MinMaxScaler:', minmax.transform(data))
print('\nStandard:', standard.transform(data))
print('\nRobust:', robust.transform(data))


MinMaxScaler: [[ 0.83685545]
 [ 0.31276416]
 [ 1.        ]
 [ 0.        ]
 [ 0.85629755]
 [ 0.92307692]
 [ 0.9103973 ]
 [ 0.613694  ]
 [ 0.83601014]
 [ 0.42688081]]

Standard: [[ 0.53347433]
 [-1.15836242]
 [ 1.06012673]
 [-2.16800692]
 [ 0.59623601]
 [ 0.81180876]
 [ 0.77087723]
 [-0.18692067]
 [ 0.53074556]
 [-0.78997861]]

Robust: [[  9.98502247e-04]
 [ -1.23714428e+00]
 [  3.86420369e-01]
 [ -1.97603595e+00]
 [  4.69296056e-02]
 [  2.04692961e-01]
 [  1.74737893e-01]
 [ -5.26210684e-01]
 [ -9.98502247e-04]
 [ -9.67548677e-01]]
