# NominalToIntegerTransformer
This notebook shows the functionality in the `NominalToIntegerTransformer` class. This transformer converts nominal columns to integer columns. Although once converted to integers these columns are inherently ordered - there is no particular ordering of nominal levels in the mapping to integer. <br>

In [1]:
import pandas as pd
import numpy as np
from pprint import pprint
from sklearn.datasets import load_diabetes

In [2]:
import tubular
from tubular.nominal import NominalToIntegerTransformer

In [3]:
tubular.__version__

'0.3.0'

## Load diabetes dataset from sklearn
We also create a categorical column from `bmi` and treat it as unordered for demonstration purposes in this notebook.

In [4]:
diabetes = load_diabetes(return_X_y=False, as_frame=True)["data"]

In [5]:
diabetes["bmi_cut"] = pd.cut(diabetes["bmi"], bins=20)

In [6]:
diabetes.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,bmi_cut
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,"(0.0532, 0.0662]"
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,"(-0.0642, -0.0512]"
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,"(0.0401, 0.0532]"
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,"(-0.012, 0.00102]"
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,"(-0.0381, -0.0251]"


In [7]:
diabetes["bmi_cut"].value_counts(dropna=False) / diabetes.shape[0]

(-0.012, 0.00102]     0.117647
(-0.0381, -0.0251]    0.110860
(-0.0251, -0.012]     0.110860
(-0.0512, -0.0381]    0.085973
(0.00102, 0.0141]     0.085973
(0.0141, 0.0271]      0.081448
(-0.0642, -0.0512]    0.063348
(0.0271, 0.0401]      0.063348
(0.0532, 0.0662]      0.063348
(0.0401, 0.0532]      0.049774
(-0.0772, -0.0642]    0.049774
(0.0662, 0.0793]      0.036199
(-0.0905, -0.0772]    0.022624
(0.0923, 0.105]       0.020362
(0.0793, 0.0923]      0.015837
(0.118, 0.131]        0.009050
(0.105, 0.118]        0.006787
(0.158, 0.171]        0.004525
(0.131, 0.144]        0.002262
(0.144, 0.158]        0.000000
Name: bmi_cut, dtype: float64

## Simple usage

### Initialising NominalToIntegerTransformer
Note, tt is possible to convert multiple nominal columns to integer by specifying a list of columns in the `columns` argument. Note, the same `start_encoding` value will be used for all columns (see section 1.3).

In [8]:
nom_1 = NominalToIntegerTransformer(columns="bmi_cut", copy=True, verbose=True)

BaseTransformer.__init__() called


### NominalToIntegerTransformer fit
The `fit` method sets the mappings from nominal levels to integer values for each of the specified columns. It must be run before the `transform` method.
The mappings are stored in an attribute called `mappings`.

In [9]:
nom_1.fit(diabetes)

BaseTransformer.fit() called


NominalToIntegerTransformer(columns=['bmi_cut'])

In [10]:
pprint(nom_1.mappings)

{'bmi_cut': {Interval(-0.0905, -0.0772, closed='right'): 7,
             Interval(-0.0772, -0.0642, closed='right'): 11,
             Interval(-0.0642, -0.0512, closed='right'): 1,
             Interval(-0.0512, -0.0381, closed='right'): 5,
             Interval(-0.0381, -0.0251, closed='right'): 4,
             Interval(-0.0251, -0.012, closed='right'): 9,
             Interval(-0.012, 0.00102, closed='right'): 3,
             Interval(0.00102, 0.0141, closed='right'): 10,
             Interval(0.0141, 0.0271, closed='right'): 8,
             Interval(0.0271, 0.0401, closed='right'): 6,
             Interval(0.0401, 0.0532, closed='right'): 2,
             Interval(0.0532, 0.0662, closed='right'): 0,
             Interval(0.0662, 0.0793, closed='right'): 13,
             Interval(0.0793, 0.0923, closed='right'): 14,
             Interval(0.0923, 0.105, closed='right'): 16,
             Interval(0.105, 0.118, closed='right'): 15,
             Interval(0.118, 0.131, closed='right'): 12,

### NominalToIntegerTransformer transform

In [11]:
diabetes_2 = nom_1.transform(diabetes)

BaseTransformer.transform() called


In [12]:
diabetes_2["bmi_cut"].value_counts(dropna=False).sort_index()

0.0     28
1.0     28
2.0     22
3.0     52
4.0     49
5.0     38
6.0     28
7.0     10
8.0     36
9.0     49
10.0    38
11.0    22
12.0     4
13.0    16
14.0     7
15.0     3
16.0     9
17.0     2
18.0     1
Name: bmi_cut, dtype: int64

## Starting the mapping at a different value
It is possible to start the encoding a another integer value by specifiying the `start_encoding` argument when initialising the NominalToIntegerTransformer.

In [13]:
nom_2 = NominalToIntegerTransformer(
    columns="bmi_cut", start_encoding=-10, copy=True, verbose=False
)

In [14]:
nom_2.fit(diabetes)

NominalToIntegerTransformer(columns=['bmi_cut'], start_encoding=-10)

In [15]:
diabetes_3 = nom_2.transform(diabetes)

In [16]:
diabetes_3["bmi_cut"].value_counts(dropna=False).sort_index()

-10.0    28
-9.0     28
-8.0     22
-7.0     52
-6.0     49
-5.0     38
-4.0     28
-3.0     10
-2.0     36
-1.0     49
 0.0     38
 1.0     22
 2.0      4
 3.0     16
 4.0      7
 5.0      3
 6.0      9
 7.0      2
 8.0      1
Name: bmi_cut, dtype: int64