# **Data Cleaning**

# **10. Categorical Encoding**

In [11]:
import numpy as np
import pandas as pd 

## ✅ What is Categorical Encoding?

Categorical variables represent **discrete values** or **labels** (e.g., gender, country, product type). Encoding transforms these into **numeric representations** so models can understand them.


### ✅ Techniques:

| Method                  | Type             | Use Case                      | Example                |
| ----------------------- | ---------------- | ----------------------------- | ---------------------- |
| 1. Label Encoding       | Ordinal          | When categories have order    | `Low < Medium < High`  |
| 2. One-Hot Encoding     | Nominal          | When categories have no order | `Red, Blue, Green`     |
| 3. get\_dummies()       | Nominal          | Pandas shortcut for one-hot   | `Region: East, West`   |
| 4. Ordinal Encoding     | Ordinal          | Custom mapping with order     | Education level        |
| 5. Binary Encoding      | High cardinality | Memory efficient              | City names             |
| 6. Frequency Encoding   | Nominal          | Based on value counts         | Product types          |
| 7. Target/Mean Encoding | Nominal + Target | Uses target variable          | Useful in ML           |
| 8. Hash Encoding        | High cardinality | Space-efficient & fast        | Hashed category values |


In [12]:
df = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],
    'Size': ['Small', 'Large', 'Medium', 'Medium', 'Small'],
    'Product_Category': ['A', 'B', 'C', 'A', 'B'],
    'City': ['Delhi', 'Mumbai', 'Chennai', 'Delhi', 'Mumbai'],
    'Education': ['High School', 'Bachelors', 'Masters', 'PhD', 'Bachelors'],
    'Price': [100, 200, 150, 130, 220]  # Target variable for mean encoding
})

df

Unnamed: 0,Color,Size,Product_Category,City,Education,Price
0,Red,Small,A,Delhi,High School,100
1,Blue,Large,B,Mumbai,Bachelors,200
2,Green,Medium,C,Chennai,Masters,150
3,Red,Medium,A,Delhi,PhD,130
4,Blue,Small,B,Mumbai,Bachelors,220


## 1️⃣ Label Encoding

### ✅ Description:

Converts each category to a **unique integer**.

In [13]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [14]:
df['Color_LE'] = le.fit_transform(df['Color'])
df

Unnamed: 0,Color,Size,Product_Category,City,Education,Price,Color_LE
0,Red,Small,A,Delhi,High School,100,2
1,Blue,Large,B,Mumbai,Bachelors,200,0
2,Green,Medium,C,Chennai,Masters,150,1
3,Red,Medium,A,Delhi,PhD,130,2
4,Blue,Small,B,Mumbai,Bachelors,220,0


### 🔄 Mapping:

Red=2, Blue=0, Green=1

### 📌 Use Case:

For tree-based models or **when categories have no real meaning**, but you want a compact representation.

## 2️⃣ One-Hot Encoding

### ✅ Description:

Creates **a new column for each category** with binary values.

In [15]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse_output=False)

In [16]:
encoded = ohe.fit_transform(df[['Size']])
pd.DataFrame(encoded, columns=ohe.get_feature_names_out(['Size']))

Unnamed: 0,Size_Large,Size_Medium,Size_Small
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0


### 📌 Use Case:

When categories are **nominal** (no order) — e.g., color, city.

## 3️⃣ `pd.get_dummies()`

### ✅ Description:

Pandas shortcut for One-Hot Encoding.

In [17]:
df

Unnamed: 0,Color,Size,Product_Category,City,Education,Price,Color_LE
0,Red,Small,A,Delhi,High School,100,2
1,Blue,Large,B,Mumbai,Bachelors,200,0
2,Green,Medium,C,Chennai,Masters,150,1
3,Red,Medium,A,Delhi,PhD,130,2
4,Blue,Small,B,Mumbai,Bachelors,220,0


In [18]:
pd.get_dummies(df, columns=['Product_Category'], drop_first=True)

Unnamed: 0,Color,Size,City,Education,Price,Color_LE,Product_Category_B,Product_Category_C
0,Red,Small,Delhi,High School,100,2,False,False
1,Blue,Large,Mumbai,Bachelors,200,0,True,False
2,Green,Medium,Chennai,Masters,150,1,False,True
3,Red,Medium,Delhi,PhD,130,2,False,False
4,Blue,Small,Mumbai,Bachelors,220,0,True,False


### 📌 Use Case:

Quick and easy encoding, especially in preprocessing pipelines.

## 4️⃣ Ordinal Encoding (with Order)

### ✅ Description:

Custom mapping based on order.

In [19]:
df

Unnamed: 0,Color,Size,Product_Category,City,Education,Price,Color_LE
0,Red,Small,A,Delhi,High School,100,2
1,Blue,Large,B,Mumbai,Bachelors,200,0
2,Green,Medium,C,Chennai,Masters,150,1
3,Red,Medium,A,Delhi,PhD,130,2
4,Blue,Small,B,Mumbai,Bachelors,220,0


In [20]:
education_order = {'High School': 1, 'Bachelors': 2, 'Masters': 3, 'PhD': 4}
df['Education_Ordinal'] = df['Education'].map(education_order)

df

Unnamed: 0,Color,Size,Product_Category,City,Education,Price,Color_LE,Education_Ordinal
0,Red,Small,A,Delhi,High School,100,2,1
1,Blue,Large,B,Mumbai,Bachelors,200,0,2
2,Green,Medium,C,Chennai,Masters,150,1,3
3,Red,Medium,A,Delhi,PhD,130,2,4
4,Blue,Small,B,Mumbai,Bachelors,220,0,2


### 📌 Use Case:

When values have clear order — e.g., education, size (Small < Medium < Large).

## 5️⃣ Binary Encoding

### ✅ Description:

Encodes categories into **binary numbers**, then splits into columns.

In [21]:
import category_encoders as ce 

In [22]:
encoder = ce.BinaryEncoder(cols=['City'])
encoder.fit_transform(df)

Unnamed: 0,Color,Size,Product_Category,City_0,City_1,Education,Price,Color_LE,Education_Ordinal
0,Red,Small,A,0,1,High School,100,2,1
1,Blue,Large,B,1,0,Bachelors,200,0,2
2,Green,Medium,C,1,1,Masters,150,1,3
3,Red,Medium,A,0,1,PhD,130,2,4
4,Blue,Small,B,1,0,Bachelors,220,0,2


### 📌 Use Case:

For **high cardinality** (many unique categories) where one-hot would explode columns.

## 6️⃣ Frequency Encoding

### ✅ Description:

Replaces category with **frequency count**.

In [25]:
freq_map = df['City'].value_counts().to_dict()
df['City_Freq'] = df['City'].map(freq_map)
df

Unnamed: 0,Color,Size,Product_Category,City,Education,Price,Color_LE,Education_Ordinal,City_Freq
0,Red,Small,A,Delhi,High School,100,2,1,2
1,Blue,Large,B,Mumbai,Bachelors,200,0,2,2
2,Green,Medium,C,Chennai,Masters,150,1,3,1
3,Red,Medium,A,Delhi,PhD,130,2,4,2
4,Blue,Small,B,Mumbai,Bachelors,220,0,2,2


### 📌 Use Case:

Used in **boosting algorithms**, simple & memory-efficient.

## 7️⃣ Target/Mean Encoding

### ✅ Description:

Replace category with the **mean of the target variable** (`Price` here).

In [29]:
mean_encoding = df.groupby('City')['Price'].mean().to_dict()
df['City_TargetEnc'] = df['City'].map(mean_encoding)
df

Unnamed: 0,Color,Size,Product_Category,City,Education,Price,Color_LE,Education_Ordinal,City_Freq,City_TargetEnc
0,Red,Small,A,Delhi,High School,100,2,1,2,115.0
1,Blue,Large,B,Mumbai,Bachelors,200,0,2,2,210.0
2,Green,Medium,C,Chennai,Masters,150,1,3,1,150.0
3,Red,Medium,A,Delhi,PhD,130,2,4,2,115.0
4,Blue,Small,B,Mumbai,Bachelors,220,0,2,2,210.0


### 📌 Use Case:

Boosting models like XGBoost, LightGBM, CatBoost. Be cautious of data leakage — use with **cross-validation**.

## 8️⃣ Hash Encoding

### ✅ Description:

Encodes categories into fixed number of columns using hash function.

In [30]:
df

Unnamed: 0,Color,Size,Product_Category,City,Education,Price,Color_LE,Education_Ordinal,City_Freq,City_TargetEnc
0,Red,Small,A,Delhi,High School,100,2,1,2,115.0
1,Blue,Large,B,Mumbai,Bachelors,200,0,2,2,210.0
2,Green,Medium,C,Chennai,Masters,150,1,3,1,150.0
3,Red,Medium,A,Delhi,PhD,130,2,4,2,115.0
4,Blue,Small,B,Mumbai,Bachelors,220,0,2,2,210.0


In [32]:
encoder = ce.HashingEncoder(cols=['City'], n_components=4)
encoder.fit_transform(df)

Unnamed: 0,col_0,col_1,col_2,col_3,Color,Size,Product_Category,Education,Price,Color_LE,Education_Ordinal,City_Freq,City_TargetEnc
0,0,0,1,0,Red,Small,A,High School,100,2,1,2,115.0
1,0,1,0,0,Blue,Large,B,Bachelors,200,0,2,2,210.0
2,0,0,1,0,Green,Medium,C,Masters,150,1,3,1,150.0
3,0,0,1,0,Red,Medium,A,PhD,130,2,4,2,115.0
4,0,1,0,0,Blue,Small,B,Bachelors,220,0,2,2,210.0


### 📌 Use Case:

Efficient when dealing with large cardinality and no memory for one-hot or binary.

## 📊 Summary Table

| Technique          | Handles Order? | High Cardinality? | Memory Efficient | Model Compatibility |
| ------------------ | -------------- | ----------------- | ---------------- | ------------------- |
| Label Encoding     | ❌              | ❌                 | ✅                | Tree-based models   |
| One-Hot Encoding   | ❌              | ❌                 | ❌                | All ML models       |
| get\_dummies()     | ❌              | ❌                 | ❌                | All ML models       |
| Ordinal Encoding   | ✅              | ❌                 | ✅                | Linear/Tree models  |
| Binary Encoding    | ❌              | ✅                 | ✅                | Tree/linear         |
| Frequency Encoding | ❌              | ✅                 | ✅                | Boosting models     |
| Target Encoding    | ❌              | ✅                 | ✅                | Boosting models     |
| Hash Encoding      | ❌              | ✅                 | ✅                | Any                 |


<center><b>Thanks</b></center>