In [None]:
Label Encoder

Label encoder basically converts the string categorical values into integer categorical ids. Eg. if you have a field
of countries like "Turkey,France,Itely,Germany" and apply the label encoder it will generate an integer id value for
each country value. After applying the label encoder we will have a class of ids 0,1,2 and 3, one id per country.
As you can guess the ids will point to the original country names.

In [20]:
from numpy import array
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# define example
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
values = array(data)
print(values)

['cold' 'cold' 'warm' 'cold' 'hot' 'hot' 'warm' 'cold' 'warm' 'hot']


In [21]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)

[0 0 2 0 1 1 2 0 2 1]


In [26]:
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)

[[ 1.  0.  0.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]
 [ 0.  1.  0.]]


In [13]:
%%html
<p style="text-align:justify">
<strong>One Hot Encoding</strong><br><br>
One hot encoding is a process by which categorical variables are converted into a form
that could be provided to ML algorithms to do a better job in prediction.
A one hot encoding is a representation of categorical variables as binary vectors.
This first requires that the categorical values be mapped to integer values.
Then, each integer value is represented as a binary vector that is all zero values except
the index of the integer, which is marked with a 1.
<img src="https://www.codeproject.com/KB/AI/1146582/onehotencoding.PNG"></img>
</p>
<p style="text-align:justify">
A one hot encoding allows the representation of categorical data to be more expressive.

Many machine learning algorithms cannot work with categorical data directly. The categories must be converted 
into numbers. This is required for both input and output variables that are categorical.

We could use an integer encoding directly, rescaled where needed. This may work for problems where there is a 
natural ordinal relationship between the categories, and in turn the integer values, such as labels for 
temperature ‘cold’, warm’, and ‘hot’.

There may be problems when there is no ordinal relationship and allowing the representation to lean on any such 
relationship might be damaging to learning to solve the problem. An example might be the labels ‘dog’ and ‘cat’

In these cases, we would like to give the network more expressive power to learn a probability-like number 
for each possible label value. This can help in both making the problem easier for the network to model. 
When a one hot encoding is used for the output variable, it may offer a more nuanced set of predictions 
than a single label.
</p>