# OrdinalEncoderTransformer
This notebook shows the functionality in the `OrdinalEncoderTransformer` class. This transformer maps categorical levels to rank-ordered integer values by target-mean in ascending order for a particular problem.

In [1]:
import pandas as pd
import numpy as np
from pprint import pprint
from sklearn.datasets import load_diabetes

In [2]:
import tubular
from tubular.nominal import OrdinalEncoderTransformer

In [3]:
tubular.__version__

'0.3.0'

## Load diabetes dataset from sklearn
We also create a categorical column from `bmi` and treat it as unordered for demonstration purposes in this notebook.

In [4]:
diabetes, target = load_diabetes(return_X_y=True, as_frame=True)

In [5]:
diabetes["bmi_cut"] = pd.cut(diabetes["bmi"], bins=20)

In [6]:
diabetes["target"] = target

In [7]:
diabetes.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,bmi_cut,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,"(0.0532, 0.0662]",151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,"(-0.0642, -0.0512]",75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,"(0.0401, 0.0532]",141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,"(-0.012, 0.00102]",206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,"(-0.0381, -0.0251]",135.0


In [8]:
diabetes["bmi_cut"].value_counts(dropna=False) / diabetes.shape[0]

(-0.012, 0.00102]     0.117647
(-0.0381, -0.0251]    0.110860
(-0.0251, -0.012]     0.110860
(-0.0512, -0.0381]    0.085973
(0.00102, 0.0141]     0.085973
(0.0141, 0.0271]      0.081448
(-0.0642, -0.0512]    0.063348
(0.0271, 0.0401]      0.063348
(0.0532, 0.0662]      0.063348
(0.0401, 0.0532]      0.049774
(-0.0772, -0.0642]    0.049774
(0.0662, 0.0793]      0.036199
(-0.0905, -0.0772]    0.022624
(0.0923, 0.105]       0.020362
(0.0793, 0.0923]      0.015837
(0.118, 0.131]        0.009050
(0.105, 0.118]        0.006787
(0.158, 0.171]        0.004525
(0.131, 0.144]        0.002262
(0.144, 0.158]        0.000000
Name: bmi_cut, dtype: float64

## Simple usage

### Initialising OrdinalEncoderTransformer
The `response_column` argument must be specified to set the response column that the `fit` method will use. <br>
There can be no nulls in the response column otherwise an exception will be raised.

In [9]:
oe_1 = OrdinalEncoderTransformer(
    columns="bmi_cut", response_column="target", copy=True, verbose=True
)

BaseTransformer.__init__() called


### OrdinalEncoderTransformer fit
The `fit` method must be run before the `transform` method. It calculates the average response column value for each level and then calculated the ascending ordinal integer values accordingly. The mappings are stored in an attribute called `mappings`.

In [10]:
oe_1.fit(diabetes)

BaseTransformer.fit() called


OrdinalEncoderTransformer(columns=['bmi_cut'], response_column='target')

In [11]:
pprint(oe_1.mappings)

{'bmi_cut': {Interval(-0.0905, -0.0772, closed='right'): 2,
             Interval(-0.0772, -0.0642, closed='right'): 1,
             Interval(-0.0642, -0.0512, closed='right'): 3,
             Interval(-0.0512, -0.0381, closed='right'): 4,
             Interval(-0.0381, -0.0251, closed='right'): 5,
             Interval(-0.0251, -0.012, closed='right'): 6,
             Interval(-0.012, 0.00102, closed='right'): 7,
             Interval(0.00102, 0.0141, closed='right'): 8,
             Interval(0.0141, 0.0271, closed='right'): 11,
             Interval(0.0271, 0.0401, closed='right'): 10,
             Interval(0.0401, 0.0532, closed='right'): 9,
             Interval(0.0532, 0.0662, closed='right'): 12,
             Interval(0.0662, 0.0793, closed='right'): 13,
             Interval(0.0793, 0.0923, closed='right'): 16,
             Interval(0.0923, 0.105, closed='right'): 15,
             Interval(0.105, 0.118, closed='right'): 17,
             Interval(0.118, 0.131, closed='right'): 19

### OrdinalEncoderTransformer transform

In [12]:
diabetes_2 = oe_1.transform(diabetes)

BaseTransformer.transform() called


In [13]:
diabetes_2["bmi_cut"].value_counts(dropna=False)

7     52
5     49
6     49
4     38
8     38
11    36
3     28
10    28
12    28
9     22
1     22
13    16
2     10
15     9
16     7
19     4
17     3
18     2
14     1
20     0
Name: bmi_cut, dtype: int64

## Transform with nulls
Null values are not converted in the `OrdinalEncoderTransformer`. There are other transforrmers in the package which can be used to deal with imputation first.

In [14]:
diabetes["bmi_cut_str"] = diabetes["bmi_cut"].astype(str)

In [15]:
diabetes.loc[0, "bmi_cut_str"] = np.NaN

In [16]:
diabetes["bmi_cut_str"].isnull().sum()

1

In [17]:
oe_2 = OrdinalEncoderTransformer(
    columns=["bmi_cut_str"], response_column="target", copy=True, verbose=True
)

BaseTransformer.__init__() called


In [18]:
oe_2.fit(diabetes)

BaseTransformer.fit() called


OrdinalEncoderTransformer(columns=['bmi_cut_str'], response_column='target')

In [19]:
try:
    oe_2.transform(diabetes)
except Exception as err:
    print(type(err), err)

<class 'ValueError'> nulls would be introduced into column bmi_cut_str from levels not present in mapping


## Weights column
It is possible to specify a weights column using the `weights_column` argument when initialising the transformer. <br>
If this is the case then a weighted mean, using this column, will be calculated in `fit`.

In [20]:
diabetes["weights"] = diabetes["bp"].abs()

In [21]:
oe_3 = OrdinalEncoderTransformer(
    columns="bmi_cut", response_column="target", weights_column="weights"
)

In [22]:
oe_3.fit(diabetes)

OrdinalEncoderTransformer(columns=['bmi_cut'], response_column='target',
                          weights_column='weights')

In [23]:
diabetes_4 = oe_3.transform(diabetes)

In [24]:
diabetes_4["bmi_cut"].value_counts(dropna=False)

13    52
6     49
5     49
4     38
9     38
17    36
3     28
8     28
12    28
11    22
1     22
16    16
2     10
7      9
14     7
10     4
15     3
18     2
19     1
20     0
Name: bmi_cut, dtype: int64