### Demystifying Categorical Data: Types and Challenges

#### Types of Categorical Data
##### 1. Nominal Data
- Categories have **no order or ranking**
- Values are just labels, not comparable numerically
- Examples:
  - Colors: Red, Blue, Green
  - Cities: New York, London, Tokyo
  - Gender: Male, Female, Non-binary
  - Animal Species: Cat, Dog, Bird
- No category is greater or smaller than another

##### 2. Ordinal Data
- Categories have a **natural order**
- Difference between categories is **not numerically defined**
- Examples:
  - Education Level: High School < Bachelor’s < Master’s < PhD
  - Customer Satisfaction: Very Unsatisfied → Very Satisfied
  - T-shirt Sizes: Small < Medium < Large < XL
  - Likert Scale responses
- Order matters, but distance does not

---

### Why Categorical Data is a Challenge for ML

- Most ML algorithms work only with **numerical data**
- Raw categorical values cannot be used directly

#### Key Problems:
1. **Meaningless Mathematics**
   - Arbitrary numbers (Red=1, Blue=2) imply false ordering and magnitude

2. **Spurious Correlations**
   - Model may learn incorrect relationships due to numeric mapping

3. **Loss of Non-Linear Relationships**
   - Improper encoding can hide complex patterns

4. **Dimensionality Issues**
   - High-cardinality features can cause sparse and inefficient models

---

### Need for Encoding

- Encoding converts categorical data into **numerical form**
- Goals of encoding:
  - Avoid unintended bias
  - Preserve category information
  - Make data usable for ML algorithms


## One-Hot Encoding (For Nominal Categorical Data)

#### What is One-Hot Encoding?
- A technique to convert **categorical (nominal) features** into numerical form
- Creates **binary (0/1) columns** for each unique category
- Suitable when categories have **no natural order**

---

#### How One-Hot Encoding Works
Example feature: `Color`

Original values:

After One-Hot Encoding:

| Color_Red | Color_Blue | Color_Green |
|----------|------------|-------------|
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |
| 1 | 0 | 0 |
| 0 | 1 | 0 |

- Each row has **exactly one 1**
- Remaining columns are 0

---

#### Why One-Hot Encoding is Important
- **Avoids false ordering**
  - No category is treated as greater or smaller
- **ML-compatible**
  - Converts text labels into numeric form
- **Better learning**
  - Model learns category-specific patterns
- **Handles unseen categories**
  - New categories can be represented as all zeros (with proper pipeline setup)

---

#### When to Use One-Hot Encoding
- Nominal categorical data
- Low to medium number of categories
- Linear models, Logistic Regression, SVMs, Neural Networks

---

#### Implementation Options
- **Pandas:** `pd.get_dummies()`
- **Scikit-learn:** `OneHotEncoder`
  - Preferred for ML pipelines
  - Prevents data leakage
  - Ensures consistent train-test transformation

## Label Encoding

#### What is Label Encoding?
- Converts categorical values into **integer labels**
- Each unique category gets a unique number
- Replaces the original categorical column with numbers

**Example:**
City:
New York → 0
London → 1
Tokyo → 2

---

#### Why is Label Encoding Important?
- **Simple & fast**
  - Easy to implement and understand
- **No increase in dimensions**
  - Keeps dataset compact (unlike One-Hot Encoding)
- **Works well with tree-based models**
  - Decision Trees, Random Forests, Gradient Boosting
  - These models split data, not compute distances

---

#### When to Use Label Encoding
- **Ordinal categorical data**
  - Example: Small (0), Medium (1), Large (2)
- **Tree-based algorithms**
  - Artificial order does not harm performance
- **High-cardinality features**
  - Where One-Hot Encoding is impractical

---

#### When NOT to Use Label Encoding
- Nominal data with **no natural order**
- Linear models, Logistic Regression, SVMs
  - Model may assume false numerical relationships

---

#### Implementation in Scikit-learn
- `LabelEncoder`
  - Mostly used for **target variable (y)** in classification
- `OrdinalEncoder`
  - Preferred for **input features**
  - Works well inside pipelines and offers more control


## Ordinal Encoding

#### What is Ordinal Encoding?
- A controlled encoding technique for **ordinal categorical features**
- Converts categories into integers **while preserving their natural order**
- User explicitly defines the category-to-number mapping

**Example: Education Level**
High School → 0
Bachelor's Degree → 1
Master's Degree → 2
PhD → 3
- Numeric values reflect true ranking

---

#### Why is Ordinal Encoding Important?
- **Preserves meaningful order**
  - Ensures models understand category ranking
- **Controlled mapping**
  - Avoids incorrect ordering from automatic encoders
- **Works with many ML algorithms**
  - Suitable for models sensitive to numeric magnitude
- **No dimensionality increase**
  - More efficient than One-Hot Encoding

---

#### When to Use Ordinal Encoding
- Only for **ordinal categorical data**
  - Education level, ratings, sizes, satisfaction scales
- When category order is **clear and meaningful**
- When using algorithms sensitive to numerical order
  - Linear models, Logistic Regression, SVMs

---

#### When NOT to Use Ordinal Encoding
- Nominal features with no order
  - Color, city, gender
- When numeric distance between categories is misleading

---

#### Implementation (Scikit-learn)
- Use `OrdinalEncoder`
- Allows custom category ordering
- Designed for feature encoding inside pipelines

In [3]:
from sklearn.preprocessing import OrdinalEncoder
experience_categories = [['Junior','Mid-level','Senior']]
ordinal_encoder = OrdinalEncoder(categories=experience_categories)  # defining categories
data=[['Junior'],['Mid-level'],['Senior'],['Junior'],['Junior'],['Mid-level'],['Senior']]
encoded_data = ordinal_encoder.fit_transform(data)
print(encoded_data)

# 0-> Junior, 1-> Mid-level, 2-> Senior

[[0.]
 [1.]
 [2.]
 [0.]
 [0.]
 [1.]
 [2.]]


## Choosing the Right Encoding Technique

| Feature Type | Algorithm Type | Recommended Encoding | Notes |
|--------------|----------------|----------------------|-------|
| Ordinal | All | Ordinal Encoding | Explicitly define the correct category order. |
| Nominal | Magnitude-Sensitive (Linear, SVM, Neural Networks) | One-Hot Encoding | Prevents artificial ordering; drop one column if multicollinearity is a concern. |
| Nominal | Tree-Based (Decision Tree, Random Forest, GBM) | One-Hot or Label Encoding | Tree models tolerate Label Encoding; prefer Label Encoding for high cardinality. |
| Nominal | High Cardinality | Target Encoding, Frequency Encoding, Binary Encoding, Feature Hashing | Advanced methods reduce dimensionality; careful validation required. |


## Implementing Categorical Encoding with Scikit-learn and Pandas

In [5]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [6]:
data = {
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue', 'Green', 'Red', 'Blue', 'Green', 'Red'],
    'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small', 'Large', 'Medium', 'Small', 'Large', 'Medium'],
    'Material': ['Cotton', 'Polyester', 'Cotton', 'Silk', 'Polyester', 'Cotton', 'Silk', 'Cotton', 'Polyester', 'Silk'],
    'Price': [10, 25, 15, 30, 20, 18, 28, 12, 22, 35],
    'Target': [0, 1, 0, 1, 0, 0, 1, 0, 0, 1] # 0: Low Value, 1: High Value
}
df = pd.DataFrame(data)
print(df.head())

   Color    Size   Material  Price  Target
0    Red   Small     Cotton     10       0
1   Blue  Medium  Polyester     25       1
2  Green   Large     Cotton     15       0
3    Red  Medium       Silk     30       1
4   Blue   Small  Polyester     20       0


In [17]:
# Identifying feature types
nominal_features = ['Color','Material']
ordinal_features = ['Small', 'Medium', 'Large']
numerical_features = ['Price']
target = 'Target'

# Applying One-Hot Encoding to Nominal Features
onehot_encoder = OneHotEncoder(handle_unknown='ignore',sparse_output=False)
df_onehot = df.copy()
onehot_encoded_data = onehot_encoder.fit_transform(df_onehot[nominal_features])
# Create a DataFrame from the encoded data
onehot_feature_names = onehot_encoder.get_feature_names_out(nominal_features)
onehot_df = pd.DataFrame(onehot_encoded_data, columns=onehot_feature_names)
# Drop original nominal columns and concatenate with encoded ones
df_onehot = df_onehot.drop(columns=nominal_features)
df_onehot = pd.concat([df_onehot, onehot_df], axis=1)

print("DataFrame after One-Hot Encoding:")
print(df_onehot)

DataFrame after One-Hot Encoding:
     Size  Price  Target  Color_Blue  Color_Green  Color_Red  Material_Cotton  \
0   Small     10       0         0.0          0.0        1.0              1.0   
1  Medium     25       1         1.0          0.0        0.0              0.0   
2   Large     15       0         0.0          1.0        0.0              1.0   
3  Medium     30       1         0.0          0.0        1.0              0.0   
4   Small     20       0         1.0          0.0        0.0              0.0   
5   Large     18       0         0.0          1.0        0.0              1.0   
6  Medium     28       1         0.0          0.0        1.0              0.0   
7   Small     12       0         1.0          0.0        0.0              1.0   
8   Large     22       0         0.0          1.0        0.0              0.0   
9  Medium     35       1         0.0          0.0        1.0              0.0   

   Material_Polyester  Material_Silk  
0                 0.0            0.

##### OneHotEncoder(handle_unknown='ignore', sparse_output=False):
 - **handle_unknown='ignore'**: This is useful if you encounter new categories in unseen data. It will assign all zeros to the corresponding one-hot encoded columns for that unknown category.
 - **sparse_output=False**: This ensures the output is a dense NumPy array, not a sparse matrix, which is easier to work with for smaller datasets.

In [18]:
# Applying Ordinal Encoding to Ordinal Features
ordinal_encoder = OrdinalEncoder(categories=[ordinal_features])
df_ordinal = df.copy()
# Reshape 'Size' column as it's a single feature
ordinal_encoded_data = ordinal_encoder.fit_transform(df_ordinal[['Size']])
df_ordinal['Size_Encoded'] = ordinal_encoded_data
df_ordinal = df_ordinal.drop(columns=['Size'])

print("DataFrame after Ordinal Encoding:")
print(df_ordinal)

DataFrame after Ordinal Encoding:
   Color   Material  Price  Target  Size_Encoded
0    Red     Cotton     10       0           0.0
1   Blue  Polyester     25       1           1.0
2  Green     Cotton     15       0           2.0
3    Red       Silk     30       1           1.0
4   Blue  Polyester     20       0           0.0
5  Green     Cotton     18       0           2.0
6    Red       Silk     28       1           1.0
7   Blue     Cotton     12       0           0.0
8  Green  Polyester     22       0           2.0
9    Red       Silk     35       1           1.0


- **OrdinalEncoder(categories=[size_order])**: We explicitly provide the desired order of categories for the 'Size' feature. This is crucial for correct ordinal encoding.