## One Hot, Dummy and Ordinal Encoding

Reference:

[1] https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/

[2] https://datascience.stackexchange.com/questions/72343/encoding-with-ordinalencoder-how-to-give-levels-as-user-input

In [1]:
# example of a one hot encoding
from numpy import asarray
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

### One Hot Encoding

For categorical variables where no ordinal relationship exists, the integer encoding may not be enough, at best, or misleading to the model at worst.

Forcing an ordinal relationship via an ordinal encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the ordinal representation. This is where one new binary variable is added for each unique value in the variable.

In [2]:
# define data
data = asarray([['red'], ['green'], ['blue'], ['red'], ['green'], ['blue']])
print('---------Data-------------')
print(data)
# define one hot encoding
encoder = OneHotEncoder(sparse=False)
# transform data
onehot = encoder.fit_transform(data)

print('---------Encoding-------------')
print(onehot)

---------Data-------------
[['red']
 ['green']
 ['blue']
 ['red']
 ['green']
 ['blue']]
---------Encoding-------------
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]


### Dummy Variable Encoding

The one-hot encoding creates one binary variable for each category. The problem is that this representation includes redundancy. For example, if we know that [1, 0, 0] represents “blue” and [0, 1, 0] represents “green” we don’t need another binary variable to represent “red“, instead we could use 0 values for both “blue” and “green” alone, e.g. [0, 0]. This is called a dummy variable encoding, and always represents C categories with C-1 binary variables.

In addition to being slightly less redundant, a dummy variable representation is required for some models. For example, in the case of a linear regression model (and other regression models that have a bias term), a one hot encoding will case the matrix of input data to become singular, meaning it cannot be inverted and the linear regression coefficients cannot be calculated using linear algebra. For these types of models a dummy variable encoding must be used instead.

We can use the OneHotEncoder class to implement a dummy encoding as well as a one hot encoding.

The “drop” argument can be set to indicate which category will be come the one that is assigned all zero values, called the “baseline“. 

In [3]:
# define data
data = asarray([['red'], ['green'], ['blue'], ['red'], ['green'], ['blue']])
print('---------Data-------------')
print(data)
# define one hot encoding
encoder = OneHotEncoder(drop='first', sparse=False)
# transform data
onehot = encoder.fit_transform(data)

print('---------Encoding-------------')
print(onehot)

---------Data-------------
[['red']
 ['green']
 ['blue']
 ['red']
 ['green']
 ['blue']]
---------Encoding-------------
[[0. 1.]
 [1. 0.]
 [0. 0.]
 [0. 1.]
 [1. 0.]
 [0. 0.]]


### Ordinal Encoding

An ordinal encoding involves mapping each unique label to an integer value.

This type of encoding is really only appropriate if there is a known relationship between the categories. This relationship does exist for some of the variables in our dataset, and ideally, this should be harnessed when preparing the data.

We can use the OrdinalEncoder from scikit-learn to encode each variable to integers. This is a flexible class and does allow the order of the categories to be specified as arguments if any such order is known.

In [4]:
# define data
data = asarray([['good', 'low'], ['better', 'medioum'], ['best', 'high'], ['good', 'low'], ['better', 'medioum'], ['best', 'high']])
df = pd.DataFrame(data, columns=['Q1', 'Q2']) # for better visualisation
df

Unnamed: 0,Q1,Q2
0,good,low
1,better,medioum
2,best,high
3,good,low
4,better,medioum
5,best,high


In [5]:
# define ordinal encoding
encoder = OrdinalEncoder(categories=[['good', 'better', 'best'], ['low', 'medioum', 'high']])

# transform data
ordinal = encoder.fit_transform(df) 
# or you can feed the number array as well
#ordinal = encoder.fit_transform(data) 

df2 = pd.DataFrame(ordinal, columns=['Q1', 'Q2'])
df2

Unnamed: 0,Q1,Q2
0,0.0,0.0
1,1.0,1.0
2,2.0,2.0
3,0.0,0.0
4,1.0,1.0
5,2.0,2.0
