# One-Hot Encoding
* Some machine learning algorithms require numerical inputs so one-hot encoding is commonly used to handle categorical features. One-Hot Encoding is often used over label encoding (assigning a number to each category) because label encoding can cause an algorithm to interpret an ordering to the categories. One-Hot Encoding is often used for linear models and distance-based models, but can also be used for tree-based models to improve interpretability and reduce tree depth.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Generate data
data = {'Categories': np.random.choice(['red', 'green', 'blue'], 100)}

In [3]:
# Turn data into a dataframe
df = pd.DataFrame(data)
df.head(10)

Unnamed: 0,Categories
0,green
1,red
2,red
3,green
4,blue
5,green
6,blue
7,blue
8,red
9,blue


In [4]:
# Use get_dummies to encode the data
encoded_df = pd.get_dummies(df, dtype=int)
encoded_df.head(10)

Unnamed: 0,Categories_blue,Categories_green,Categories_red
0,0,1,0
1,0,0,1
2,0,0,1
3,0,1,0
4,1,0,0
5,0,1,0
6,1,0,0
7,1,0,0
8,0,0,1
9,1,0,0


In [5]:
# Leaving all categories can cause multicollinearity which is detrimental to linear models
# For example:
# If green and red are both 0, then it's already implied that blue is 1
# Dropping a column resolves the problem known as the dummy variable trap
encoded_df2 = pd.get_dummies(df, drop_first=True, dtype=int)
encoded_df2.head(10)

Unnamed: 0,Categories_green,Categories_red
0,1,0
1,0,1
2,0,1
3,1,0
4,0,0
5,1,0
6,0,0
7,0,0
8,0,1
9,0,0
