# 📘 Encoding Categorical Variables

### 🔹 Why Encoding is Needed
- Machine learning models require **numerical input**.
- Categorical variables (strings/labels) must be **converted to numbers**.

---

### 🔹 Common Methods

1. **Label Encoding**
- Converts each category to a **unique integer**.
- Example: `["Red", "Green", "Blue"] → [0, 1, 2]`
- Use for **ordinal categories** (with order).

2. **One-Hot Encoding**
- Creates **binary columns** for each category.
- Example: `["Red", "Green", "Blue"] → Red=[1,0,0], Green=[0,1,0], Blue=[0,0,1]`
- Use for **nominal categories** (no order).

3. **Ordinal Encoding**
- Similar to label encoding but preserves **custom order**.
- Example: `["Low", "Medium", "High"] → [0, 1, 2]`

---

### 🔹 Libraries in Python
```python
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd

# Label Encoding
le = LabelEncoder()
df['Color_encoded'] = le.fit_transform(df['Color'])

# One-Hot Encoding
df = pd.get_dummies(df, columns=['Color'])


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
df = pd.read_csv(r'C:\Basic_Datascience_4ML\assets\data\customer.csv')
df.head()
a = df['gender'].value_counts().nunique()
print(a)

In [None]:
df = df.iloc[:,2:]

In [None]:
df.head()

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.iloc[:, 0:2],    # Selects all rows, columns 0 and 1 (features)
    df.iloc[:, -1],     # Selects all rows, last column (target/label)
    test_size=0.2
)
print(X_train.head())

# 📘 OrdinalEncoder vs LabelEncoder

### 🔹 LabelEncoder
- Used for **1-D target labels (y)** in supervised learning.
- Converts categories into integers:
  `["cat","dog","mouse"] → [0,1,2]`
- Works only on a **single column**.
- Example: encoding `y_train` in classification,  it creates a **fake order** (e.g., Cat=0, Dog=1, Mouse=2), which can **mislead ML models**. .

### 🔹 OrdinalEncoder
- Used for **2-D feature columns (X)**.
- Converts each categorical column into integers.
- Example:
  | Size     | → | Encoded |
  |----------|--|----------|
  | Small    | → | 0        |
  | Medium   | → | 1        |
  | Large    | → | 2        |

✅ **Rule of Thumb**
- Use **LabelEncoder → y (target variable)**
- Use **OrdinalEncoder → X (features with ordinal categories)**


In [None]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['Poor','Average','Good'],['School','UG','PG']])
oe.fit(X_train)
X_train_Encoded = oe.transform(X_train)
X_test_Encoded = oe.transform(X_test)
print(X_train)

In [None]:
X_train_Encoded = pd.DataFrame(X_train_Encoded,columns=['review','education'])
print(df.head())

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(y_train)
y_train_Encoded = le.transform(y_train)
y_test_Encoded = le.transform(y_train)
print(y_test_Encoded)


In [None]:
y_train_Encoded=pd.Series(y_train_Encoded,name='Purchased')

In [None]:
final_df = pd.concat([X_train_Encoded, y_train_Encoded], axis=1)
print(final_df.head())

# 📘 One-Hot Encoding (OHE)

### 🔹 Theory
- One-Hot Encoding is used for **nominal categorical data** (categories without order).
- It creates a new **binary column (0/1)** for each category.
- Example: "Red", "Green", "Blue"

| Red | Green | Blue |
|-----|-------|------|
|  1  |   0   |   0  |
|  0  |   1   |   0  |
|  0  |   0   |   1  |

---

### 🔹 Why use **n-1 Columns (Dummy Variable Trap)?**
- If we keep all `n` dummy variables, they become **linearly dependent**.
  - Example: If you know `Red=0` and `Green=0`, then `Blue` must be `1`.
  - This introduces **multicollinearity** in models like **Linear Regression**.
- To avoid this, we drop one column → use **n-1 dummies**.
- The dropped category is still represented implicitly.
  - Example: If we drop "Blue", then:
    - Red=1 → Red
    - Green=1 → Green
    - Both Red=0 and Green=0 → Blue

✅ This is called avoiding the **dummy variable trap**.

---

### 🔹 Summary
- **OHE** → For nominal categorical features.
- **n columns** → safe for tree-based models (no collinearity issue).
- **n-1 columns** → better for linear models (avoids redundancy).


In [None]:
df = pd.read_csv(r'C:\Basic_Datascience_4ML\assets\data\cars.csv')
df['brand'].value_counts().nunique()

ohi using only pandas
in this new row have col_category and we dont use it because it forgets the addition and removealof col my the user

In [None]:
pd.get_dummies(data=df,columns=['fuel','owner'],drop_first=True).head()


onehot encoding using sklearn

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 0:4], df.iloc[:, -1], test_size=0.2)
print(X_train)

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
X_test_Encoded = ohe.fit_transform(X_test[['fuel','owner']])
X_train_Encoded = ohe.fit_transform(X_train[['fuel','owner']]).toarray()
print(X_train_Encoded)
print(X_train)

In [None]:
Table = np.hstack((X_train[['brand','km_driven']].values, X_train_Encoded))


if it is pandas and want a array to values ,
if it is matrix and want to do array  do toarrays once they all are array simply do hstack(x,y)
------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
d = pd.DataFrame(Table)
print(df.head())
print(d.iloc[:,2:].mean())

now making a new col in which car brands less then 100 is inside that col

In [None]:
# Count brands
count = df['brand'].value_counts()
print(count)

# Rare brands (freq < 100)
rare_brands = count[count < 100].index
print(rare_brands)

# Replace rare brands with "uncommon"
df['brand_clean'] = df['brand'].replace(rare_brands, 'uncommon')

# One-hot encoding
new = pd.get_dummies(df['brand_clean'])
print(new.head())
