# OrdinalEncoderTransformer
This notebook shows the functionality in the OrdinalEncoderTransformer class. This transformer maps categorical levels to rank-ordered integer values by target-mean in ascending order for a particular problem.

In [1]:
import pandas as pd
import numpy as np
from pprint import pprint

In [2]:
import tubular
from tubular.nominal import OrdinalEncoderTransformer

In [3]:
tubular.__version__

'0.2.11'

## Load Boston house price dataset from sklearn
Note, the load_boston script modifies the original Boston dataset to include nulls values and pandas categorical dtypes.

In [4]:
boston_df = tubular.testing.test_data.prepare_boston_df()
boston_df.shape

(506, 17)

In [5]:
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target,ZN_cat,CHAS_cat,RAD_cat
0,0.00632,18.0,2.31,0.0,0.538,6.575,,4.09,,296.0,15.3,396.9,4.98,24.0,18.0,0.0,
1,0.02731,,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6,,0.0,2.0
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,,17.8,392.83,4.03,34.7,0.0,0.0,2.0
3,,,2.18,0.0,0.458,,45.8,6.0622,3.0,222.0,18.7,,,33.4,,0.0,3.0
4,0.06905,0.0,2.18,0.0,0.458,,,6.0622,3.0,222.0,18.7,396.9,5.33,36.2,0.0,0.0,3.0


In [6]:
boston_df.dtypes

CRIM         float64
ZN            object
INDUS        float64
CHAS          object
NOX          float64
RM           float64
AGE          float64
DIS          float64
RAD           object
TAX          float64
PTRATIO      float64
B            float64
LSTAT        float64
target       float64
ZN_cat      category
CHAS_cat    category
RAD_cat     category
dtype: object

## Simple usage

### Initialising OrdinalEncoderTransformer
The `response_column` argument must be specified to set the response column that the fit method will use. <br>
There can be no nulls in the response column otherwise an exception will be raised.

In [7]:
oe_1 = OrdinalEncoderTransformer(
    columns = 'CHAS', 
    response_column = 'target',
    copy = True, 
    verbose = True
)

BaseTransformer.__init__() called


### OrdinalEncoderTransformer fit
The fit method must be run before the transform method. It calculates the average response column value for each level and then calculated the ascending ordinal integer values accordingly. The mappings are stored in an attribute called `mapping`.

In [8]:
oe_1.fit(boston_df)

BaseTransformer.fit() called


OrdinalEncoderTransformer(columns=['CHAS'], response_column='target')

In [9]:
pprint(oe_1.mappings)

{'CHAS': {'0.0': 1, '1.0': 2}}


### OrdinalEncoderTransformer transform

In [10]:
boston_df_2 = oe_1.transform(boston_df)

BaseTransformer.transform() called


In [11]:
boston_df_2['CHAS'].value_counts(dropna = False)

1    471
2     35
Name: CHAS, dtype: int64

## Transform with nulls
Null values are not converted in the OrdinalEncoderTransformer. There are other transforrmers in the package which deal with imputation.

In [12]:
oe_2 = OrdinalEncoderTransformer(
    columns = ['RAD', 'ZN_cat'], 
    response_column = 'target',
    copy = True, 
    verbose = True
)

BaseTransformer.__init__() called


In [13]:
oe_2.fit(boston_df)

BaseTransformer.fit() called


OrdinalEncoderTransformer(columns=['RAD', 'ZN_cat'], response_column='target')

In [14]:
boston_df[['RAD', 'ZN_cat']].isnull().sum()

RAD       62
ZN_cat    62
dtype: int64

In [15]:
try:
    boston_df_3 = oe_2.transform(boston_df)
except Exception as e:
    print(e)

nulls would be introduced into column RAD from levels not present in mapping


## Weights column
It is possible to specify a weights column using the `weights_column` argument when initialising the transformer. <br>
If this is the case then a weighted mean, using this column, will be calculated in `fit`.

In [16]:
oe_3 = OrdinalEncoderTransformer(
    columns = 'CHAS', 
    response_column = 'target',
    weights_column = 'CRIM'
)

In [17]:
oe_3.fit(boston_df)

OrdinalEncoderTransformer(columns=['CHAS'], response_column='target',
                          weights_column='CRIM')

In [18]:
boston_df_4 = oe_3.transform(boston_df)

In [19]:
boston_df_4['CHAS'].value_counts(dropna = False)

1    471
2     35
Name: CHAS, dtype: int64