We all know that the machine learning algorithm can only deal with numerical features.  
However, in most scenarios, categorical features are prevalent.  
For example, a person’s gender, address, product type, weather, and so on are categorical features.   
To ensure the machine learning algorithm can use this information (features or fields), we need to transform these categorical features into numerical features. Label encoding is one of the methods used for this transformation.

The preprocessing module in sklearn provides a useful object/function called LabelEncoder to help label encoding.

In [2]:
import sklearn.preprocessing as preprocessing
import numpy as np
import pandas as pd

# use LabelEncoder to transform the categorical data to numerical data
targets = np.array(["Sun", "Sun", "Moon", "Earth", "Monn", "Venus"])
labelenc = preprocessing.LabelEncoder()
labelenc.fit(targets)
targets_trans = labelenc.transform(targets)
print("The original data")
print(targets)
print("The transform data using LabelEncoder")
print(targets_trans)

The original data
['Sun' 'Sun' 'Moon' 'Earth' 'Monn' 'Venus']
The transform data using LabelEncoder
[3 3 2 0 1 4]


Pandas can also handle LabelEncoding for category data

In [5]:
import pandas as pd

df = pd.DataFrame({"col1": ["Sun", "Sun", "Moon", "Earth", "Monn", "Venus"]})
print("The original types of DataFrame")
print(df.dtypes)
print("*"*30)
df["col1"] = df["col1"].astype("category")
print("The new types of DataFrame")
print(df.dtypes)
print("*"*30)
df["col1_label_encoding"] = df["col1"].cat.codes
print("The new column.")
print(df)


The original types of DataFrame
col1    object
dtype: object
******************************
The new types of DataFrame
col1    category
dtype: object
******************************
The new column.
    col1  col1_label_encoding
0    Sun                    3
1    Sun                    3
2   Moon                    2
3  Earth                    0
4   Monn                    1
5  Venus                    4


One hot encoding

In [6]:
# Scikit
import sklearn.preprocessing as preprocessing
import numpy as np
import pandas as pd

targets = np.array(["Sun", "Sun", "Moon", "Earth", "Moon",
                    "Venus"])
labelEnc = preprocessing.LabelEncoder()
new_target = labelEnc.fit_transform(targets)
onehotEnc = preprocessing.OneHotEncoder()
onehotEnc.fit(new_target.reshape(-1, 1))
targets_trans = onehotEnc.transform(new_target.reshape(-1, 1))
print("The original data")
print(targets)
print("The transform data using OneHotEncoder")
print(targets_trans.toarray())

The original data
['Sun' 'Sun' 'Moon' 'Earth' 'Moon' 'Venus']
The transform data using OneHotEncoder
[[0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]]


In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [7]:
# pandas by get_dummies

df = pd.DataFrame({"col1": ["Sun", "Sun", "Moon", "Earth", "Moon", "Venus"]})
print("The original data")
print(df)
print("*" * 30)
df_new = pd.get_dummies(df, columns=["col1"], prefix="Planet")
print("The transform data using get_dummies")
print(df_new)

The original data
    col1
0    Sun
1    Sun
2   Moon
3  Earth
4   Moon
5  Venus
******************************
The transform data using get_dummies
   Planet_Earth  Planet_Moon  Planet_Sun  Planet_Venus
0             0            0           1             0
1             0            0           1             0
2             0            1           0             0
3             1            0           0             0
4             0            1           0             0
5             0            0           0             1


Count Encoding: good for tree-based models, such as xgboost, but not friendly to the new feature in the test set.

In [8]:
import pandas as pd

df = pd.DataFrame({"col1": ["Sun", "Sun", "Moon", "Earth", "Moon", "Venus"]})
print("The original dataset")
print(df)
print("*" * 30)
df["planet_count"] = df["col1"].map(df["col1"].value_counts().to_dict())
print("The new transformed dataset.")
print(df)

The original dataset
    col1
0    Sun
1    Sun
2   Moon
3  Earth
4   Moon
5  Venus
******************************
The new transformed dataset.
    col1  planet_count
0    Sun             2
1    Sun             2
2   Moon             2
3  Earth             1
4   Moon             2
5  Venus             1


Mean Encoding, mean encoding uses the mean of the target value as a new feature. It’s usually done for classification tasks, particularly a binary classification

In [9]:
import pandas as pd

df = pd.DataFrame({
    "col1": ["Sun", "Moon", "Sun", "Moon", "Moon", "Mars"],
    "price": [20, 30, 30, 35, 40, 55]
})
print("The original dataset")
print(df)
print("*" * 30)
d = df.groupby(["col1"])["price"].mean().to_dict()
df["col1_price_mean"] = df["col1"].map(d)
print("The new transformed dataset.")
print(df)

The original dataset
   col1  price
0   Sun     20
1  Moon     30
2   Sun     30
3  Moon     35
4  Moon     40
5  Mars     55
******************************
The new transformed dataset.
   col1  price  col1_price_mean
0   Sun     20               25
1  Moon     30               35
2   Sun     30               25
3  Moon     35               35
4  Moon     40               35
5  Mars     55               55


Weight of Evidence Encoding.  
It’s a measure of evidence on one side of an issue compared with the evidence on the other side of the issue.


In [10]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "col1": ["Moon", "Sun", "Moon", "Sun", "Sun"],
    "Target": [1, 1, 0, 1, 0]
})
df["Target"] = df["Target"].astype("float64")
print("The original dataset")
print(df)
print("*" * 30)
d = df.groupby(["col1"])["Target"].mean().to_dict()
df["p1"] = df["col1"].map(d)
df["p0"] = 1 - df["p1"]
df["woe"] = np.log(df["p1"] / df["p0"])
print("The new transform dataset")
print(df)

The original dataset
   col1  Target
0  Moon     1.0
1   Sun     1.0
2  Moon     0.0
3   Sun     1.0
4   Sun     0.0
******************************
The new transform dataset
   col1  Target        p1        p0       woe
0  Moon     1.0  0.500000  0.500000  0.000000
1   Sun     1.0  0.666667  0.333333  0.693147
2  Moon     0.0  0.500000  0.500000  0.000000
3   Sun     1.0  0.666667  0.333333  0.693147
4   Sun     0.0  0.666667  0.333333  0.693147
