# 🧩 Feature Encoding (Label vs One-Hot)

## 💡 Why Encoding?
Machines can’t understand text (like “Delhi” or “Mumbai”) — they only understand numbers.  
So we convert text features into numerical form so that models can learn from them.

> But how we do it depends on the type of text data.

| Type of text data     | Example                | Encoding Type      |
| --------- | ---------------------- | ------------------ |
| Ordered   | Low, Medium, High      | ✅ Label Encoding   |
| Unordered | Delhi, Mumbai, Chennai | ✅ One-Hot Encoding |



---

## 🔹 1️⃣ Label Encoding
Assigns each category a unique number.

Example:- 1

| City     | Price | Encoded City |
|-----------|--------|--------------|
| Delhi     | 80     | 0 |
| Mumbai    | 100    | 1 |
| Chennai   | 60     | 2 |

Example:- 2

| shirt-size     | Price | Encoded size |
|-----------|--------|--------------|
| Small     | 60     | 0 |
| Large    | 100    | 2 |
| Medium   | 80     | 1 |

✅ **Best for:** When categories have some order (e.g., “Low”, “Medium”, “High”)  
❌ **Avoid for:** Unordered text (like cities, colors), since numbers may imply ranking.


- in above Example-2 “Large” > “Medium” > “Small” — there’s a clear order.
- So it’s okay if we use Label Encoding, because the model can learn that higher number = bigger size.

- But in example-1, There’s no natural order — we can’t say “Mumbai > Delhi”, it makes no sense right . why the hell mumbai would be greater than delhi . 
- So, by label encoding If we assign numbers like Delhi=0, Mumbai=1, Chennai=2, the model might mistakenly think “Chennai > Mumbai > Delhi” — which makes no sense. 😅
- So for such unordered categories, we use One-Hot Encoding — it treats every label equally, with no ranking implied.

---

## 🔹 2️⃣ One-Hot Encoding
Creates a new binary (1/0) column for each category.

| City     | Price | Delhi | Mumbai | Chennai |
|-----------|--------|--------|--------|----------|
| Delhi     | 80     | 1 | 0 | 0 |
| Mumbai    | 100    | 0 | 1 | 0 |
| Chennai   | 60     | 0 | 0 | 1 |

✅ **Best for:** Unordered (nominal) data — cities, gender, color  
❌ **Downside:** Increases columns drastically if there are too many categories.

---

In [44]:
import pandas as pd

## 1. Label Encoding (Simple Numbering)

Converts categories into **integer labels**.

In [45]:
df = pd.DataFrame({"Shirt-size": ["small", "medium", "large"]})

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df["size_encoded"] = le.fit_transform(df["Shirt-size"])

'''
    Shirt-size  size_encoded
0      small             2
1     medium             1
2      large             0
'''

# And here we got fooled. 
# What just happened here is sklearn’s LabelEncoder doesn’t actually know that "small < medium < large".
# It just assigns numbers in alphabetical order, like this: 👉 "large" → 0, "medium" → 1, "small" → 2
# It’s purely based on string sorting, not meaning.
# “small (2) > medium (1) > large (0)” 😭, which is opposite of real world order .

# So, Moral of the story :
# If your categorical data has true order, don’t trust LabelEncoder blindly.
# We should provide numbers to each category manually, like;
size_map={'small':1,'medium':2,'large':3}
df['size_encoded'] = df['Shirt-size'].map(size_map)

# Now the machine will correctly understand: small < medium < large ✅
df

Unnamed: 0,Shirt-size,size_encoded
0,small,1
1,medium,2
2,large,3


## 2. One-Hot Encoding

In [46]:
from sklearn.preprocessing import  OneHotEncoder

# Step-1: creating DataFrame with categorical data
df = pd.DataFrame({'City':['Delhi','Bhubaneswar','Bangalore','Mumbai']})

# Step-2: Initializing the OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse_output=False)
# As we know already drop='first' to avoids multicollinearity

# Step-3: Fiting and transforming the data
encoded_City = encoder.fit_transform(df[['City']]) # Passing the column as a DataFrame (2D)

# Step-4: Converting the result back to DataFrme
encoded_df=pd.DataFrame(encoded_City, columns=encoder.get_feature_names_out(['City']))

print(encoded_df)

   City_Bhubaneswar  City_Delhi  City_Mumbai
0               0.0         1.0          0.0
1               1.0         0.0          0.0
2               0.0         0.0          0.0
3               0.0         0.0          1.0


## 3. Pandas One-Hot Encoding Shortcut

In [47]:
df = pd.DataFrame({"City":['Delhi','Mumbai','Bangalore','Kolkata']})

df_encoded = pd.get_dummies(df, columns=["City"], drop_first=True, dtype=int)
print(df_encoded)

''' 
        City_Delhi  City_Kolkata  City_Mumbai
0           1             0            0
1           0             0            1
2           0             0            0
3           0             1            0
'''

# can also be written like; df_encoded = pd.get_dummies(df, columns=["City"], drop_first=True)
# here drop_first=True means, one of the categories is dropped to avoid multicollinearity (redundancy in the data).

# 🤖 How drop_first=True Works:
# When we use drop_first=True, pandas drops the first category alphabetically to avoid multicollinearity. The categories in ouy City column are: ['Bangalore', 'Delhi', 'Kolkata', 'Mumbai']
# Alphabetically the order is: 'Bangalore' < 'Delhi' < 'Kolkata' < 'Mumbai'
# Therefore, pd.get_dummies() drops Bangalore as the first category.
# Bangalore becomes the baseline category (implicitly represented by all zeros).
# The remaining categories (Delhi, Kolkata, Mumbai) are explicitly encoded.

# Why Does This Happen ❓
# The drop_first=True parameter ensures that one category is dropped to avoid redundancy. For example:
# - Without drop_first=True, all four cities would be encoded, and the sum of all columns for each row would always equal 1.
# - With drop_first=True, one category (alphabetically first) is dropped, and the model interprets the other categories relative to the dropped one.

   City_Delhi  City_Kolkata  City_Mumbai
0           1             0            0
1           0             0            1
2           0             0            0
3           0             1            0


' \n        City_Delhi  City_Kolkata  City_Mumbai\n0           1             0            0\n1           0             0            1\n2           0             0            0\n3           0             1            0\n'

> This pd.get_dummies() trick is 🔑 in 90% of ML workflows — it’s fast, simple, and model-friendly.