
## Q1. What is Data Encoding?

**Data encoding** refers to converting categorical values (non-numeric) into numerical form so they can be used by machine learning models.

📌 **Why it’s useful**:
- Most ML algorithms require **numerical input**.
- Helps uncover hidden relationships between categorical values and target variables.

---

## Q2. What is Nominal Encoding?

**Nominal encoding** assigns arbitrary integers to categories.

🧠 **Important**: It does **not** preserve any order or magnitude between categories.

### 🔹 Example:
Suppose a dataset contains colors: `['Red', 'Green', 'Blue']`.

Nominal Encoding:
- Red → 0
- Green → 1
- Blue → 2

```python
from sklearn.preprocessing import LabelEncoder

colors = ['Red', 'Green', 'Blue', 'Green', 'Red']
le = LabelEncoder()
encoded = le.fit_transform(colors)
print("Encoded Colors:", encoded)
```

---

## Q3. When is Nominal Encoding Preferred Over One-Hot Encoding?

🔸 Nominal encoding is preferred when:
- The feature has **many unique categories** (e.g., zip codes, product IDs).
- Memory efficiency is crucial (One-Hot creates many columns).

### 🔹 Practical Example:
Encoding 10,000 unique product IDs with one-hot would create 10,000 columns. Instead, label encoding with integers 0–9999 is more efficient.

---

## Q4. Dataset with 5 Unique Categorical Values – Which Encoding to Use?

If the 5 values are **nominal (no order)** → use **One-Hot Encoding**.

If the 5 values have **no intrinsic meaning/order** (like colors, animals), and the number is manageable, **One-Hot** is preferred.

```python
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Blue']})
ohe = OneHotEncoder(sparse=False)
encoded = ohe.fit_transform(df[['Color']])
print("One-Hot Encoded:\n", encoded)
```

---

## Q5. Dataset with 1000 Rows, 5 Columns (2 Categorical, 3 Numerical)

Assume:
- Categorical Column A → 4 unique values
- Categorical Column B → 3 unique values

If using **One-Hot Encoding**:
- Column A → 4 new columns
- Column B → 3 new columns  
➡️ **Total: 7 columns added**

Final dataset = 3 (numerical) + 7 (encoded) = **10 columns**

```python
# Simulate one-hot encoding
categorical_data = pd.DataFrame({
    'A': ['x', 'y', 'z', 'x'],
    'B': ['p', 'q', 'p', 'r']
})
encoded_df = pd.get_dummies(categorical_data)
print(encoded_df.head())
```

---

## Q6. Encoding for Animal Dataset (Species, Habitat, Diet)

- Species: Nominal
- Habitat: Nominal
- Diet: Possibly Ordinal if you can rank it (e.g., herbivore < omnivore < carnivore)

✅ Recommended: **One-Hot Encoding** for all if categories < 10 each.

Why? Maintains interpretability and avoids misleading relationships from nominal encoding.

---

## Q7. Predicting Customer Churn – Encoding Strategy

📊 Features:
- Gender (categorical, 2 values) → One-Hot or Binary Encoding
- Age (numerical) → No encoding needed
- Contract Type (categorical, 3+ values) → One-Hot Encoding
- Monthly Charges (numerical) → No encoding needed
- Tenure (numerical) → No encoding needed

### 🔹 Step-by-step Encoding:
```python
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Gender': ['Male', 'Female', 'Female'],
    'Contract': ['Month-to-month', 'One year', 'Two year'],
    'Age': [23, 45, 36],
    'MonthlyCharges': [70.3, 89.5, 65.0],
    'Tenure': [12, 24, 36]
})

# Encoding
df_encoded = pd.get_dummies(df, columns=['Gender', 'Contract'])
print(df_encoded)
```

