# Ordinal Encoding in Scikit-Learn

## Defining some toy data

- We start by defining some toy data here:

In [1]:
import pandas as pd

In [2]:
feature_1 = [
    1.1, 2.1, 3.1, 4.2,
    5.1, 6.1, 7.1, 8.1,
    1.2, 2.1, 3.1, 4.1
]

feature_2 = [
    'b', 'b', 'b', 'b',
    'a', 'a', 'a', 'a',
    'c', 'c', 'c', 'c'
]

df = pd.DataFrame({'numerical': feature_1, 'categorical': feature_2})

df

Unnamed: 0,numerical,categorical
0,1.1,b
1,2.1,b
2,3.1,b
3,4.2,b
4,5.1,a
5,6.1,a
6,7.1,a
7,8.1,a
8,1.2,c
9,2.1,c


## Ordinal Encoding

- Usually, we use onehot encoding if we have categorical data without ordering information, so-called nominal data.
- An example of such data is blood type (A, B, AB, or O)

- Ordinal encoding is typically used if we have categorical data with ordering information.
- One example of such data is T-shirt sizes (XS, S, M, L, or XL)
- Now, assume that the "categorical" column above has ordered features; we can use the `OrdinalEncoder` to encode that:

In [3]:
data = df['categorical'].values.reshape(-1, 1)
data

array([['b'],
       ['b'],
       ['b'],
       ['b'],
       ['a'],
       ['a'],
       ['a'],
       ['a'],
       ['c'],
       ['c'],
       ['c'],
       ['c']], dtype=object)

In [4]:
from sklearn.preprocessing import OrdinalEncoder

ode = OrdinalEncoder(
    categories= [['a', 'b', 'c']]
)

ode.fit_transform(data)

array([[1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [2.],
       [2.],
       [2.],
       [2.]])

- Notice that based on the alphabetical ordering the ordinal encoder assumes that `'a: 0 < b: 1 < c: 2'`.

- If we want to change that and have an ordering assumption like `'b: 0 < a: 1 < c: 2'`, we can override the feature ordering via the `categories` attribute as follows:

In [5]:
ode = OrdinalEncoder(
    categories= [['b', 'a', 'c']]
)

ode.fit_transform(data)

array([[0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [2.],
       [2.],
       [2.],
       [2.]])

## Using the `OrdinalEncoder` when other columns are present

- Below is an example using a `ColumnTransformer` to transform only specific columns via the `OrdinalEncoder` when multiple columns are present.
- For instance, considering the toy dataset at the top, assume we only want to transform the "categorical" column but not the "numerical" column:

In [6]:
import sklearn
from sklearn.compose import ColumnTransformer


ohe = OrdinalEncoder()

X = df.values
categorical_features = [1]

col_transformer = ColumnTransformer(
    transformers=[
        ('cat', ohe, categorical_features)],
         remainder='passthrough'
)

col_transformer.fit(df)
X_t = col_transformer.transform(df)

- Note that there are a few extra workaround like the `FloatTransformer()`, which are explained [here](sklearn-onehot-encoding-mixedtype-df.ipynb).

In [7]:
X_t

array([[1. , 1.1],
       [1. , 2.1],
       [1. , 3.1],
       [1. , 4.2],
       [0. , 5.1],
       [0. , 6.1],
       [0. , 7.1],
       [0. , 8.1],
       [2. , 1.2],
       [2. , 2.1],
       [2. , 3.1],
       [2. , 4.1]])

In [8]:
%load_ext watermark
%watermark --iversions

sklearn: 1.0.2
pandas : 1.4.0

