https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159

https://www.kaggle.com/code/discdiver/category-encoders-examples/notebook

# Classic Encoders
The first group of five classic encoders can be seen on a continuum of embedding information in one column (Ordinal) up to k columns (OneHot). These are very useful encodings for machine learning practitioners to understand.<br>
- Ordinal — convert string labels to integer values 1 through k. Ordinal.
- OneHot — one column for each value to compare vs. all other values. Nominal, ordinal.
- Binary — convert each integer to binary digits. Each binary digit gets one column. Some info loss but fewer dimensions. Ordinal.
- BaseN — Ordinal, Binary, or higher encoding. Nominal, ordinal. Doesn’t add much functionality. Probably avoid.
- Hashing — Like OneHot but fewer dimensions, some info loss due to collisions. Nominal, ordinal.
- Sum — Just like OneHot except one value is held constant and encoded as -1 across all columns.

# Binary Encoder

In [2]:
# import the packages
import numpy as np
import pandas as pd
import category_encoders as ce

# make some data
df = pd.DataFrame({
 'color':["a", "b", "a", "c"], 
 'outcome':[1, 2, 3, 2]})

# split into X and y
X = df.drop('outcome', axis = 1)
y = df.drop('color', axis = 1)

# instantiate an encoder - here we use Binary()
ce_binary = ce.BinaryEncoder(cols = ['color'])

# fit and transform and presto, you've got encoded data
ce_binary.fit_transform(X, y)

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,color_0,color_1,color_2
0,0,0,1
1,0,1,0
2,0,0,1
3,0,1,1


# Ordinal Encoder

In [3]:
import numpy as np
import pandas as pd              # version 0.23.4
import category_encoders as ce   # version 1.2.8
from sklearn.preprocessing import LabelEncoder

pd.options.display.float_format = '{:.2f}'.format # to make legible

# make some data
df = pd.DataFrame({
    'color':["a", "c", "a", "a", "b", "b"], 
    'outcome':[1, 2, 0, 0, 0, 1]})

# set up X and y
X = df.drop('outcome', axis = 1)
y = df.drop('color', axis = 1)

In [4]:
X

Unnamed: 0,color
0,a
1,c
2,a
3,a
4,b
5,b


In [5]:
ce_ord = ce.OrdinalEncoder(cols = ['color'])
ce_ord.fit_transform(X, y['outcome'])

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,color
0,1
1,2
2,1
3,1
4,3
5,3


# One-Hot Encoder

In [6]:
ce_one_hot = ce.OneHotEncoder(cols = ['color'])
ce_one_hot.fit_transform(X, y)

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,color_1,color_2,color_3
0,1,0,0
1,0,1,0
2,1,0,0
3,1,0,0
4,0,0,1
5,0,0,1


# BaseN Encoder

In [7]:
ce_basen = ce.BaseNEncoder(cols = ['color'])
ce_basen.fit_transform(X, y)

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,color_0,color_1,color_2
0,0,0,1
1,0,1,0
2,0,0,1
3,0,0,1
4,0,1,1
5,0,1,1


# Hashing Encoder

In [8]:
ce_hash = ce.HashingEncoder(cols = ['color'])
ce_hash.fit_transform(X, y)

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7
0,0,1,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0
2,0,1,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1
5,0,0,0,0,0,0,0,1
