# One-hot encoding categorical data

we can convert categorical to a collection of binary variables using a technique called "one-hot encoding"

One-hot encoding is a data pre-processing technique that converts categorical data into a binary matrix (0s and 1s), representing each category as a unique vector with a single "hot" (1) value and the rest "cold" (0). It enables machine learning algorithms to process nominal data by creating new binary columns for each distinct category.

One-hot encoding is useful for:
- Logistic regression
- Linear regression
- Increases dimensionality
- Not ideal when categories are very high-cardinality (high numbers of unique values)

In [1]:
import pandas as pd

# Create a small dataset
data = {
    "order_id": [101, 102, 103, 104, 105],
    "payment_method": ["Credit Card", "PayPal", "Debit Card", "Credit Card", "PayPal"],
    "shipping_type": ["Standard", "Express", "Standard", "Express", "Standard"],
    "order_amount": [120.5, 75.0, 89.9, 150.2, 60.0]
}

df = pd.DataFrame(data)

print(df)

   order_id payment_method shipping_type  order_amount
0       101    Credit Card      Standard         120.5
1       102         PayPal       Express          75.0
2       103     Debit Card      Standard          89.9
3       104    Credit Card       Express         150.2
4       105         PayPal      Standard          60.0


## One-Hot Encoding with `get_dummies()`


In [9]:
encoded_df = pd.get_dummies(df,
                            columns=["payment_method"],#select the columns to be one-hot encoded
                            dtype=int # by default, returns bool, but you can set it as int, so it will returns 1 and 0
                            )

encoded_df

Unnamed: 0,order_id,shipping_type,order_amount,payment_method_Credit Card,payment_method_Debit Card,payment_method_PayPal
0,101,Standard,120.5,1,0,0
1,102,Express,75.0,0,0,1
2,103,Standard,89.9,0,1,0
3,104,Express,150.2,1,0,0
4,105,Standard,60.0,0,0,1


## Encoding Multiple Columns

We can encode more than one categorical variable:

In [7]:
encoded_df = pd.get_dummies(df, columns=["payment_method", "shipping_type"])

encoded_df

Unnamed: 0,order_id,order_amount,payment_method_Credit Card,payment_method_Debit Card,payment_method_PayPal,shipping_type_Express,shipping_type_Standard
0,101,120.5,True,False,False,False,True
1,102,75.0,False,False,True,True,False
2,103,89.9,False,True,False,False,True
3,104,150.2,True,False,False,True,False
4,105,60.0,False,False,True,False,True


## [IMPORTANT] Avoiding Dummy Variable Trap, set `drop_first=True`
This will create `n-1` new columns, with `n` equals to the number of unique values.

For example, for a column `gender` that has two unique values `"female"` and `"male"`, `.get_dummies(df, drop_first=True)` will remove `gender` column and create ONE single dummy variable. Depending on the order, it could be `gender_female` or `gender_male` but not both. Assuming a gender can only have two unique values, then keeping both `gender_female` and `gender_male` provides a redundancy of information (they can predict each other), and they are highly correlated, will cause problem in regression analysis and machine learning models.

Conclusion: if you are doing one-hot encoding for regression and machine learning analysis, ALWAYS set `drop_first=True`.

In [11]:
encoded_df = pd.get_dummies(df,
                            columns=["payment_method"],
                            drop_first=True)

encoded_df

Unnamed: 0,order_id,shipping_type,order_amount,payment_method_Debit Card,payment_method_PayPal
0,101,Standard,120.5,False,False
1,102,Express,75.0,False,True
2,103,Standard,89.9,True,False
3,104,Express,150.2,False,False
4,105,Standard,60.0,False,True
