# Introduction
<hr style="border:2px solid black"> </hr>


**What?** Three types of encoding available for categorical data



# Import modules
<hr style="border:2px solid black"> </hr>

In [11]:
from numpy import asarray
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Ordinal Encoding
<hr style="border:2px solid black"> </hr>


- By default, it will assign integers to labels in the order that is observed in the data. If a specific order is desired, it can be specified via the `categories` argument as a list with the rank order of all expected labels.
- For strings, this means the labels are sorted alphabetically, and to be honest this **may cause some confusion**.
- In my opinion ordinal shouls imply an order by defaul.



In [4]:
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = encoder.fit_transform(data)
print(result)

[['red']
 ['green']
 ['blue']]
[[2.]
 [1.]
 [0.]]


In [5]:
# define data
data = asarray([['blue'], ['green'], ['red']])
print(data)
# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = encoder.fit_transform(data)
print(result)

[['blue']
 ['green']
 ['red']]
[[0.]
 [1.]
 [2.]]



- Let us assume I have three categories "low", "medium", "high" and I'd like to keepp this order.
- If I feed this info to the encoder I will get low=1, medium=2 and high=0 which is not what I want.
- I'd like to keep the order, meaning "low" is less important than "high", to do this, I will have to use the `categories` argument. 



In [6]:
# define data
data = asarray([['low'], ['medium'], ['high']])
print(data)
# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = encoder.fit_transform(data)
print(result)

[['low']
 ['medium']
 ['high']]
[[1.]
 [2.]
 [0.]]


In [16]:
# define data
data = asarray([['low'], ['medium'], ['high']])
print(data)
# define ordinal encoding with an order NOW specified!
encoder = OrdinalEncoder(
    categories=[['low', 'medium', 'high']])
# transform data
result = encoder.fit_transform(data)
print(result)

[['low']
 ['medium']
 ['high']]
[[0.]
 [1.]
 [2.]]


# One Hot Encoding
<hr style="border:2px solid black"> </hr>


- We can demonstrate the usage of the OneHotEncoder on the color categories.
- First the categories are sorted, in this case alphabetically because they are strings, then binary variables are created for each category in turn.
- This means blue will be represented as [1, 0, 0] with a 1 in for the first binary variable, then green, then finally red.
- The one-hot encoding **creates one binary variable for each category**.

</font >
</div >

In [3]:
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define one hot encoding
encoder = OneHotEncoder(sparse=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

[['red']
 ['green']
 ['blue']]
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]


In [17]:
# define data
data = asarray([['red'], ['green'], 
                ['blue'], ["balck"]])
print(data)
# define one hot encoding
encoder = OneHotEncoder(sparse=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

[['red']
 ['green']
 ['blue']
 ['balck']]
[[0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]]


# Dummy Variable Encoding
<hr style="border:2px solid black"> </hr>


- The one hot encoding creates one binary variable for each category. 
- The problem is that this representation includes redundancy. 
- For example, if we know that [1, 0, 0] represents blue and [0, 1, 0] represents green we don’t need another binary variable to represent red, instead we could use 0 values alone, e.g. [0, 0]. 
- This is called a dummy variable encoding, and always represents C categories with C − 1 binary variables.
- The `drop` argument can be set to indicate which category will be come the one that is assigned all zero values, called the “baseline“. We can set this to “first” so that the first category is used. When the labels are sorted alphabetically, the first “blue” label will be the first and will become the baseline.



In [4]:
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define one hot encoding
encoder = OneHotEncoder(drop='first',
                        sparse=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

[['red']
 ['green']
 ['blue']]
[[0. 1.]
 [1. 0.]
 [0. 0.]]


In [18]:
# define data
data = asarray([['red'], ['green'], 
                ['blue'], ["black"]])
print(data)
# define one hot encoding
encoder = OneHotEncoder(drop='first',
                        sparse=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

[['red']
 ['green']
 ['blue']
 ['black']]
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 0.]]


# FAQs
<hr style="border:2px solid black"> </hr>


- **What if I have hundreds of categories?** Or, what if I concatenate many one hot encoded vectors to create a many thousand element input vector? You can use a one hot encoding up to thousands and tens of thousands of categories. Also, having large vectors as input sounds intimidating, but the models can generally handle it. Try an embedding; it offers the benefit of a smaller vector space (a projection) and the representation can have more meaning.

- **What encoding technique is the best?** This is unknowable as it will probably depends on both the dataset and the algorithm used. Test each technique (and more) on your dataset with your chosen model and discover what works best for your case.

- **What is you really want to avoid?** Do not mislead the model! For categorical variables where no ordinal relationship exists, the integer encoding may not be enough, at best, or misleading to the model at worst. In fact if we assign integer number, the model may have a tendency to rank its entries from the smallest to the highest which something we do not really want to do. Forcing an ordinal relationship via an ordinal encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (*predictions halfway between categories*).

- **Is there any case where you'll be wrong to use one method instead of the other?** Yes, there is one case. For example, in the case of a linear regression model (and other regression models that have a bias term), a one hot encoding will case the matrix of input data to become singular, meaning it cannot be inverted and the linear regression coefficients cannot be calculated using linear algebra. For these types of models a dummy variable encoding must be used instead.



# Example on a real dataset
<hr style="border:2px solid black"> </hr>


- The example below is just an example, that sometimes you have to try both methods and see if there are any notable differences youself.
- This is the case shown below.



## OrdinalEncoder Transform

In [5]:
# load the dataset
dataset = read_csv('../DATASETS/breast-cancer.csv', header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) 
# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(X_train)
X_train = ordinal_encoder.transform(X_train)
X_test = ordinal_encoder.transform(X_test)
# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
# define the model
model = LogisticRegression()
# fit on the training set
model.fit(X_train, y_train)
# predict on test set
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat) 
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 75.79


## OneHotEncoder Transform

In [6]:
# load the dataset
dataset = read_csv('../DATASETS/breast-cancer.csv', header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) 
# one-hot encode input variables
onehot_encoder = OneHotEncoder()
onehot_encoder.fit(X_train)
X_train = onehot_encoder.transform(X_train)
X_test = onehot_encoder.transform(X_test)
# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
# define the model
model = LogisticRegression()
# fit on the training set
model.fit(X_train, y_train)
# predict on test set
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 70.53


# References
<hr style="border:2px solid black"> </hr>


- https://machinelearningmastery.com/how-to-prepare-categorical-data-for-deep-learning-in-python/
- https://datascience.stackexchange.com/questions/39317/difference-between-ordinalencoder-and-labelencoder/64177

