In [1]:
from sklearn.preprocessing import OneHotEncoder

# Categorical Features

Some algorithms only work when they are supplied with numerical features. For example, if you have a dataset of cars and you would like to recommend similar cars then colour may be an aspect you look at. A dataset containing car colours would need to be encoding into numerical features to be fed into the learning algorithm.

## One-Hot Encoding

Let's say our dataset of car colours contains the following colours:
- Red
- Green
- Blue
- Orange

The is feature has 4 possible values. A method of converting this to numerical would be via one-hot encoding where each colour becomes a feature vector:
- Red = [1, 0, 0, 0]
- Green = [0, 1, 0, 0]
- Blue = [0, 0, 1, 0]
- Orange = [0, 0, 0, 1]

The length of the feature vector corresponds to the number of colours in this feature (4) and where the value of the feature vector == 1 corresponds to the colour. Note, that this will enlarge your dataset by increasing the dimensionality.

The method is implemented in sklearn:

In [4]:
enc = OneHotEncoder(handle_unknown='ignore')
X = [['Red'], ['Blue'], ['Red'], ['Orange'], ['Green']]
enc.fit(X)

OneHotEncoder(handle_unknown='ignore')

In [5]:
enc.categories_

[array(['Blue', 'Green', 'Orange', 'Red'], dtype=object)]

The categories show the order in which the colours are encoded. `Blue` is at index `0` mean the blue array will be `[1, 0, 0, 0]`. The colour `Green` is encoded next, resulting in a green feature vector of `[0, 1, 0, 0]`

In [7]:
enc.transform(X).toarray()

array([[0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.]])

In [9]:
for colour, feature_vector in zip(X, enc.transform(X).toarray()):
    print(colour, feature_vector)

['Red'] [0. 0. 0. 1.]
['Blue'] [1. 0. 0. 0.]
['Red'] [0. 0. 0. 1.]
['Orange'] [0. 0. 1. 0.]
['Green'] [0. 1. 0. 0.]
