# NominalToIntegerTransformer
This notebook shows the functionality in the NominalToIntegerTransformer class. This transformer converts nominal columns to integer columns, although once converted to integers these columns are inherently ordered - there is no particular ordering of nominal levels in the mapping to integer. <br>

In [1]:
import pandas as pd
import numpy as np
from pprint import pprint

In [2]:
import tubular
from tubular.nominal import NominalToIntegerTransformer

In [3]:
tubular.__version__

'0.2.8'

## Load Boston house price dataset from sklearn
Note, the load_boston script modifies the original Boston dataset to include nulls values and pandas categorical dtypes.

In [4]:
boston_df = tubular.testing.test_data.prepare_boston_df()
boston_df.shape

(506, 17)

In [5]:
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target,ZN_cat,CHAS_cat,RAD_cat
0,0.00632,18.0,2.31,0.0,0.538,6.575,,4.09,,296.0,15.3,396.9,4.98,24.0,18.0,0.0,
1,0.02731,,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6,,0.0,2.0
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,,17.8,392.83,4.03,34.7,0.0,0.0,2.0
3,,,2.18,0.0,0.458,,45.8,6.0622,3.0,222.0,18.7,,,33.4,,0.0,3.0
4,0.06905,0.0,2.18,0.0,0.458,,,6.0622,3.0,222.0,18.7,396.9,5.33,36.2,0.0,0.0,3.0


In [6]:
boston_df.dtypes

CRIM         float64
ZN            object
INDUS        float64
CHAS          object
NOX          float64
RM           float64
AGE          float64
DIS          float64
RAD           object
TAX          float64
PTRATIO      float64
B            float64
LSTAT        float64
target       float64
ZN_cat      category
CHAS_cat    category
RAD_cat     category
dtype: object

## Simple usage

### Initialising NominalToIntegerTransformer

In [7]:
nom_1 = NominalToIntegerTransformer(columns = 'ZN_cat', copy = True, verbose = True)

BaseTransformer.__init__() called


### NominalToIntegerTransformer fit
The fit method must be run before the transform method. It sets the mappings from nominal levels to integer values for each of the specified columns.
The mappings are stored in an attribute called 'mapping_'.

In [8]:
nom_1.fit(boston_df)

BaseTransformer.fit() called


NominalToIntegerTransformer(columns=['ZN_cat'], start_encoding=0)

In [9]:
pprint(nom_1.mapping_)

{'ZN_cat': {18.0: 0,
            nan: 1,
            0.0: 2,
            12.5: 3,
            17.5: 10,
            20.0: 18,
            21.0: 5,
            22.0: 17,
            25.0: 9,
            28.0: 12,
            30.0: 16,
            33.0: 23,
            34.0: 22,
            35.0: 24,
            40.0: 19,
            45.0: 13,
            52.5: 20,
            55.0: 25,
            60.0: 14,
            70.0: 21,
            75.0: 4,
            80.0: 11,
            85.0: 7,
            90.0: 6,
            95.0: 15,
            100.0: 8}}


### NominalToIntegerTransformer transform

In [10]:
boston_df_2 = nom_1.transform(boston_df)

BaseTransformer.transform() called


In [11]:
boston_df_2['ZN_cat'].value_counts(dropna = False).sort_index()

2      330
3        8
10       1
0        1
18      16
5        4
17       8
9       10
12       3
16       6
23       4
22       1
24       2
19       7
13       3
20       3
25       2
14       4
21       3
4        3
11      13
7        2
6        5
15       4
8        1
NaN     62
Name: ZN_cat, dtype: int64

## Starting the mapping at a different value
It is possible to start the encoding a another integer value by using the 'start_encoding' argument when initialising the NominalToIntegerTransformer.

In [12]:
nom_2 = NominalToIntegerTransformer(
    columns = 'ZN_cat',
    start_encoding = -10,
    copy = True, 
    verbose = False
)

In [13]:
nom_2.fit(boston_df)

NominalToIntegerTransformer(columns=['ZN_cat'], start_encoding=-10)

In [14]:
boston_df_3 = nom_2.transform(boston_df)

In [15]:
boston_df_3['ZN_cat'].value_counts(dropna = False).sort_index()

-8     330
-7       8
0        1
-10      1
8       16
-5       4
7        8
-1      10
2        3
6        6
13       4
12       1
14       2
9        7
3        3
10       3
15       2
4        4
11       3
-6       3
1       13
-3       2
-4       5
5        4
-2       1
NaN     62
Name: ZN_cat, dtype: int64

## Transforming multiple columns
It is possible to convert multiple nominal columns to integer by specifying a list of columns in the columns argument. Note, the same start_encoding value will be used for all columns.

In [16]:
nom_3 = NominalToIntegerTransformer(
    columns = ['CHAS', 'RAD'],
    start_encoding = -1,
    copy = True, 
    verbose = False
)

In [17]:
nom_3.fit(boston_df)

NominalToIntegerTransformer(columns=['CHAS', 'RAD'], start_encoding=-1)

In [18]:
boston_df_4 = nom_3.transform(boston_df)

In [19]:
boston_df_4['CHAS'].value_counts(dropna = False).sort_index()

-1    471
 0     35
Name: CHAS, dtype: int64

In [20]:
boston_df_4['RAD'].value_counts(dropna = False).sort_index()

-1     62
 0     20
 1     35
 2    103
 3     88
 4     21
 5     22
 6     18
 7     13
 8    124
Name: RAD, dtype: int64