# 📘 Encoding Categorical Variables

### 🔹 Why Encoding is Needed
- Machine learning models require **numerical input**.
- Categorical variables (strings/labels) must be **converted to numbers**.

---

### 🔹 Common Methods

1. **Label Encoding**
- Converts each category to a **unique integer**.
- Example: `["Red", "Green", "Blue"] → [0, 1, 2]`
- Use for **ordinal categories** (with order).

2. **One-Hot Encoding**
- Creates **binary columns** for each category.
- Example: `["Red", "Green", "Blue"] → Red=[1,0,0], Green=[0,1,0], Blue=[0,0,1]`
- Use for **nominal categories** (no order).

3. **Ordinal Encoding**
- Similar to label encoding but preserves **custom order**.
- Example: `["Low", "Medium", "High"] → [0, 1, 2]`

---

### 🔹 Libraries in Python
```python
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd

# Label Encoding
le = LabelEncoder()
df['Color_encoded'] = le.fit_transform(df['Color'])

# One-Hot Encoding
df = pd.get_dummies(df, columns=['Color'])


In [196]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [197]:
df = pd.read_csv(r'C:\Basic_Datascience_4ML\assets\data\customer.csv')
df.head()

Unnamed: 0,age,gender,review,education,purchased
0,30,Female,Average,School,No
1,68,Female,Poor,UG,No
2,70,Female,Good,PG,No
3,72,Female,Good,PG,No
4,16,Female,Average,UG,No


In [198]:
df = df.iloc[:,2:]

In [199]:
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [200]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.iloc[:, 0:2],    # Selects all rows, columns 0 and 1 (features)
    df.iloc[:, -1],     # Selects all rows, last column (target/label)
    test_size=0.2
)
print(X_train.head())

     review education
18     Good    School
38     Good    School
10     Good        UG
24  Average        PG
0   Average    School


# 📘 OrdinalEncoder vs LabelEncoder

### 🔹 LabelEncoder
- Used for **1-D target labels (y)** in supervised learning.
- Converts categories into integers:
  `["cat","dog","mouse"] → [0,1,2]`
- Works only on a **single column**.
- Example: encoding `y_train` in classification,  it creates a **fake order** (e.g., Cat=0, Dog=1, Mouse=2), which can **mislead ML models**. .

### 🔹 OrdinalEncoder
- Used for **2-D feature columns (X)**.
- Converts each categorical column into integers.
- Example:
  | Size     | → | Encoded |
  |----------|--|----------|
  | Small    | → | 0        |
  | Medium   | → | 1        |
  | Large    | → | 2        |

✅ **Rule of Thumb**
- Use **LabelEncoder → y (target variable)**
- Use **OrdinalEncoder → X (features with ordinal categories)**


In [201]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['Poor','Average','Good'],['School','UG','PG']])
oe.fit(X_train)
X_train_Encoded = oe.transform(X_train)
X_test_Encoded = oe.transform(X_test)
print(X_train)

     review education
18     Good    School
38     Good    School
10     Good        UG
24  Average        PG
0   Average    School
17     Poor        UG
15     Poor        UG
43     Poor        PG
39     Poor        PG
7      Poor    School
49     Good        UG
1      Poor        UG
11     Good        UG
35     Poor    School
30  Average        UG
22     Poor        PG
27     Poor        PG
42     Good        PG
16     Poor        UG
44  Average        UG
32  Average        UG
37  Average        PG
8   Average        UG
19     Poor        PG
20  Average    School
23     Good    School
5   Average    School
9      Good        UG
26     Poor        PG
46     Poor        PG
48     Good        UG
29  Average        UG
4   Average        UG
13  Average    School
33     Good        PG
6      Good    School
40     Good    School
21  Average        PG
28     Poor    School
3      Good        PG


In [202]:
X_train_Encoded = pd.DataFrame(X_train_Encoded,columns=['review','education'])
print(df.head())

    review education purchased
0  Average    School        No
1     Poor        UG        No
2     Good        PG        No
3     Good        PG        No
4  Average        UG        No


In [203]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(y_train)
y_train_Encoded = le.transform(y_train)
y_test_Encoded = le.transform(y_train)
print(y_test_Encoded)


[0 0 0 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 1 0 1 0 0
 0 1 1]


In [204]:
y_train_Encoded=pd.Series(y_train_Encoded,name='Purchased')

In [205]:
final_df = pd.concat([X_train_Encoded, y_train_Encoded], axis=1)
print(final_df.head())

   review  education  Purchased
0     2.0        0.0          0
1     2.0        0.0          0
2     2.0        1.0          1
3     1.0        2.0          1
4     1.0        0.0          0
