# **📊 What is Categorical Data?**

- **Categorical data** refers to variables that represent **categories or labels** rather than numeric values. These categories may or may not have an **inherent order**.

- Categorical data is common in real-world datasets and often needs special handling during preprocessing (e.g., imputation, encoding).

---

## 🧩 Types of Categorical Data
- Nominal & Ordinal

    | Type              | Description                                  | Example Values                |
    |-------------------|----------------------------------------------|-------------------------------|
    | 🏷️ Nominal        | No natural order or ranking                  | `red`, `blue`, `green`        |
    | 📏 Ordinal        | Has a meaningful order, but not numeric      | `low`, `medium`, `high`       |

---

## 📝 Examples of Categorical Data

### 1. 🧥 Clothing Data (Nominal)
- These variables represent **labels** with **no inherent order**.

    | Color  | Brand  | Type   |
    |--------|--------|--------|
    | Red    | Nike   | Hoodie |
    | Blue   | Adidas | Jacket |
    | Green  | Puma   | Shirt  |

---

### 2. 📦 Product Ratings (Ordinal)
- `Size` and `Rating` are **ordinal** — they have a **ranking**, but the intervals between them are not necessarily equal.

    | Product | Size   | Rating  |
    |---------|--------|---------|
    | A       | Small  | Low     |
    | B       | Medium | Medium  |
    | C       | Large  | High    |

---

## 🛠️ Why Categorical Data Matters

* Categorical data:
    - Cannot be used directly in most mathematical models
    - Must be **encoded** into numbers (e.g., one-hot, label encoding)
    - Can contain **missing values** that need special handling
    - Might hold important **patterns and relationships**

---

## 🧪 Sample: Mixed Data with Missing Values
- `—` = missing values
- All columns are categorical
- Data must be cleaned before modeling

    | Color | Animal | Size   | Brand  | Type |
    |-------|--------|--------|--------|------|
    | Red   | Dog    | Small  | Nike   | A    |
    | Blue  | —      | Large  | Adidas | —    |
    | —     | Cat    | —      | Nike   | B    |

---

## 📌 Summary
- Learn how to handle missing categorical values
- Explore encoding techniques like **One-Hot**, **Label**, or **Ordinal Encoding**

    | Concept       | Key Point                                   |
    |---------------|---------------------------------------------|
    | 📊 Categorical | Represents labels, not numbers              |
    | 🧷 Nominal     | No order (e.g., `red`, `dog`)               |
    | 📏 Ordinal     | Ordered (e.g., `small` < `medium` < `large`)|
    | ⚠️ Must Encode| Cannot be fed directly into ML models        |
    | 🧼 Imputation  | Missing values should be filled appropriately|

---

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.impute import SimpleImputer

In [None]:
# Dataset

df = pd.DataFrame({
    'color': ['red', 'blue', np.nan, 'green', 'red', np.nan],      
    'animal': ['dog', np.nan, 'cat', 'dog', np.nan, 'cat'],        
    'size': ['small', 'large', np.nan, 'medium', 'small', np.nan], 
    'brand': ['nike', 'adidas', 'nike', np.nan, 'puma', np.nan],   
    'type': ['A', np.nan, 'B', 'A', 'B', np.nan]                   
})

print("Original DataFrame:")
print(df)

In [None]:
df.isnull().sum()

In [None]:
# Mode imputation

df['color'] = df['color'].fillna(df['color'].mode()[0])

print(df)

In [None]:
# Constant label

df['animal'] = df['animal'].fillna('Missing')

print(df)

In [None]:
# Missing indicator

df['size_missing'] = df['size'].isnull().astype(int)
df['size'] = df['size'].fillna('Missing')

print(df)

In [None]:
# One-hot encode

df = pd.get_dummies(df, columns=['brand'], drop_first=False, dtype=int)

print(df)

In [None]:
# Model based

imputer = SimpleImputer(strategy='most_frequent')
df[['type']] = imputer.fit_transform(df[['type']])

print(df)