In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Categorical Features

Some algorithms only work when they are supplied with numerical features. For example, if you have a dataset of cars and you would like to recommend similar cars then colour may be an aspect you look at. A dataset containing car colours would need to be encoding into numerical features to be fed into the learning algorithm.

## One-Hot Encoding

Let's say our dataset of car colours contains the following colours:
- Red
- Green
- Blue
- Orange

The is feature has 4 possible values. A method of converting this to numerical would be via one-hot encoding where each colour becomes a feature vector:
- Red = [1, 0, 0, 0]
- Green = [0, 1, 0, 0]
- Blue = [0, 0, 1, 0]
- Orange = [0, 0, 0, 1]

The length of the feature vector corresponds to the number of colours in this feature (4) and where the value of the feature vector == 1 corresponds to the colour. Note, that this will enlarge your dataset by increasing the dimensionality.

The method is implemented in sklearn:

In [4]:
enc = OneHotEncoder(handle_unknown='ignore')
X = [['Red'], ['Blue'], ['Red'], ['Orange'], ['Green']]
enc.fit(X)

OneHotEncoder(handle_unknown='ignore')

In [5]:
enc.categories_

[array(['Blue', 'Green', 'Orange', 'Red'], dtype=object)]

The categories show the order in which the colours are encoded. `Blue` is at index `0` mean the blue array will be `[1, 0, 0, 0]`. The colour `Green` is encoded next, resulting in a green feature vector of `[0, 1, 0, 0]`

In [7]:
enc.transform(X).toarray()

array([[0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.]])

In [9]:
for colour, feature_vector in zip(X, enc.transform(X).toarray()):
    print(colour, feature_vector)

['Red'] [0. 0. 0. 1.]
['Blue'] [1. 0. 0. 0.]
['Red'] [0. 0. 0. 1.]
['Orange'] [0. 0. 1. 0.]
['Green'] [0. 1. 0. 0.]


## Target Encoding

- [Ref](https://maxhalford.github.io/blog/target-encoding/)

One-hot encoding works well in most cases but really does suffer when there is a high cardinality in the feature is it encoding. It also suffers when there is a high cardinality and some of the values in the feature occur very infrequently. This can lead to favoured splits when using a random forest model because splitting on values with a low occurrence will lead to the biggest gain in purity. Therefore, using one-hot encoding is not always best. Another way to numerically encode categorical variables is with target encoding. For each distinct element in $x$ you’re going to compute the average of the corresponding values in $y$. Then you’re going to replace each $x$ with the according mean.

In [17]:
# Dummy data
df = pd.DataFrame({
    'feature_1': ['apple'] * 5 + ['orange'] * 5,
    'feature_2': ['pear'] * 9 + ['grape'] * 1,
    'y': [1, 1, 1, 1, 0, 1, 0, 0, 0, 0]
})
df

Unnamed: 0,feature_1,feature_2,y
0,apple,pear,1
1,apple,pear,1
2,apple,pear,1
3,apple,pear,1
4,apple,pear,0
5,orange,pear,1
6,orange,pear,0
7,orange,pear,0
8,orange,pear,0
9,orange,grape,0


Now, let's replace each column with their respective mean towards the dependent $y$:

In [18]:
means = df.groupby('feature_1')['y'].mean()
means

feature_1
apple     0.8
orange    0.2
Name: y, dtype: float64

Now, replace each value in `feature_1` with the matching mean:

In [19]:
df['feature_1'] = df['feature_1'].map(means)

In [20]:
df

Unnamed: 0,feature_1,feature_2,y
0,0.8,pear,1
1,0.8,pear,1
2,0.8,pear,1
3,0.8,pear,1
4,0.8,pear,0
5,0.2,pear,1
6,0.2,pear,0
7,0.2,pear,0
8,0.2,pear,0
9,0.2,grape,0


Now let's do the same for `feature_2`:

In [16]:
df['feature_2'] = df['feature_2'].map(df.groupby('feature_2')['y'].mean())
df

Unnamed: 0,feature_1,feature_2,y
0,0.8,0.555556,1
1,0.8,0.555556,1
2,0.8,0.555556,1
3,0.8,0.555556,1
4,0.8,0.555556,0
5,0.2,0.555556,1
6,0.2,0.555556,0
7,0.2,0.555556,0
8,0.2,0.555556,0
9,0.2,0.0,0


The advantage target encoding has is that it can pick up values that can explain the target. In the above example apple has an average target value of 0.8.
There is an issue with target encoding in that relying on the average value isn’t always a good idea. This is especially bad when the number of values for a particular category is low as this won’t be representative of the large dataset.
For example, the value of grape is 0.0 because this value only occurs once but may not be reflective of wider dataset. Is the mean of `grape` actually 0.0? We just don’t have enough data points to be sure.
To handle this, we can perform cross-validation and compute the means at each k-fold. Another approach would be to apply some additive smoothing. The intuition behind additive smoothing is that as the number of samples becomes larger, rely more on the mean of the category in question. If the count is low rely more on the global mean. Take movie rating for example, a new movie comes out and gets 3 ratings of 10 / 10 meaning an average of 10. This is most likely incorrect as we don’t have enough data to find the true mean of the movie rating. We can smooth the average rating of this movie by including the average rating over all movies. In a nut shell:
-	Not many data points? Rely on global mean
-	Lots of data points? Rely on local mean

\begin{equation}
u=\frac{n \times \bar{x} + m \times w}{n+m}
\end{equation}

- $u$: Mean we're trying to compute
- $n$: Number of values of a given category
- $\bar{x}$: estimated mean of the category
- $m$: The "weight" we assign the overall mean
- $w$: The overall mean

We only have single parameter $m$ to tune for, which we can find out via cross-validation. Note, for high values of $m$ results in relying on the global mean more. A value of zero here would apply no smoothing and just us the local average.

Below we have a method for additive smoothing. Note, that the smoothing is done on the columns where it has already been target encoded.

In [27]:
def calc_smooth_mean(df, feature, y, m):
    # Compute the global mean
    mean = df[y].mean()

    # Compute the number of values and the mean of each group
    agg = df.groupby(feature)[y].agg(['count', 'mean'])
    counts = agg['count']
    means = agg['mean']

    # Compute the "smoothed" means
    smooth = (counts * means + m * mean) / (counts + m)
    
    # Replace each value by the according smoothed mean
    return df[feature].map(smooth)

In [28]:
# Feature 1 smoothing
calc_smooth_mean(df, 'feature_1', 'y', 10)

0    0.6
1    0.6
2    0.6
3    0.6
4    0.6
5    0.4
6    0.4
7    0.4
8    0.4
9    0.4
Name: feature_1, dtype: float64

In [29]:
# Feature 1 no smoothing by setting the weight to zero, note we have the exact same values as if we didn't apply smoothing.
calc_smooth_mean(df, 'feature_1', 'y', 0)

0    0.8
1    0.8
2    0.8
3    0.8
4    0.8
5    0.2
6    0.2
7    0.2
8    0.2
9    0.2
Name: feature_1, dtype: float64

As a rule of thumb, setting $m$ to something like 300 works well in most cases. It’s quite intuitive really: you’re saying that you require that there must be at least 300 values for the sample mean to overtake the global mean.