# 📘 Encoding Categorical Variables

### 🔹 Why Encoding is Needed
- Machine learning models require **numerical input**.
- Categorical variables (strings/labels) must be **converted to numbers**.

---

### 🔹 Common Methods

1. **Label Encoding**
- Converts each category to a **unique integer**.
- Example: `["Red", "Green", "Blue"] → [0, 1, 2]`
- Use for **ordinal categories** (with order).

2. **One-Hot Encoding**
- Creates **binary columns** for each category.
- Example: `["Red", "Green", "Blue"] → Red=[1,0,0], Green=[0,1,0], Blue=[0,0,1]`
- Use for **nominal categories** (no order).

3. **Ordinal Encoding**
- Similar to label encoding but preserves **custom order**.
- Example: `["Low", "Medium", "High"] → [0, 1, 2]`

---

### 🔹 Libraries in Python
```python
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd

# Label Encoding
le = LabelEncoder()
df['Color_encoded'] = le.fit_transform(df['Color'])

# One-Hot Encoding
df = pd.get_dummies(df, columns=['Color'])


In [186]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [187]:
df = pd.read_csv(r'C:\Basic_Datascience_4ML\assets\data\customer.csv')
df.head()
a = df['gender'].value_counts().nunique()
print(a)

2


In [188]:
df = df.iloc[:,2:]

In [189]:
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [190]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.iloc[:, 0:2],    # Selects all rows, columns 0 and 1 (features)
    df.iloc[:, -1],     # Selects all rows, last column (target/label)
    test_size=0.2
)
print(X_train.head())

     review education
37  Average        PG
24  Average        PG
10     Good        UG
36     Good        UG
41     Good        PG


# 📘 OrdinalEncoder vs LabelEncoder

### 🔹 LabelEncoder
- Used for **1-D target labels (y)** in supervised learning.
- Converts categories into integers:
  `["cat","dog","mouse"] → [0,1,2]`
- Works only on a **single column**.
- Example: encoding `y_train` in classification,  it creates a **fake order** (e.g., Cat=0, Dog=1, Mouse=2), which can **mislead ML models**. .

### 🔹 OrdinalEncoder
- Used for **2-D feature columns (X)**.
- Converts each categorical column into integers.
- Example:
  | Size     | → | Encoded |
  |----------|--|----------|
  | Small    | → | 0        |
  | Medium   | → | 1        |
  | Large    | → | 2        |

✅ **Rule of Thumb**
- Use **LabelEncoder → y (target variable)**
- Use **OrdinalEncoder → X (features with ordinal categories)**


In [191]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['Poor','Average','Good'],['School','UG','PG']])
oe.fit(X_train)
X_train_Encoded = oe.transform(X_train)
X_test_Encoded = oe.transform(X_test)
print(X_train)

     review education
37  Average        PG
24  Average        PG
10     Good        UG
36     Good        UG
41     Good        PG
4   Average        UG
30  Average        UG
14     Poor        PG
42     Good        PG
7      Poor    School
22     Poor        PG
35     Poor    School
5   Average    School
15     Poor        UG
45     Poor        PG
34  Average    School
11     Good        UG
18     Good    School
17     Poor        UG
21  Average        PG
0   Average    School
48     Good        UG
12     Poor    School
2      Good        PG
46     Poor        PG
23     Good    School
8   Average        UG
29  Average        UG
19     Poor        PG
49     Good        UG
6      Good    School
9      Good        UG
3      Good        PG
47     Good        PG
27     Poor        PG
32  Average        UG
20  Average    School
39     Poor        PG
13  Average    School
43     Poor        PG


In [192]:
X_train_Encoded = pd.DataFrame(X_train_Encoded,columns=['review','education'])
print(df.head())

    review education purchased
0  Average    School        No
1     Poor        UG        No
2     Good        PG        No
3     Good        PG        No
4  Average        UG        No


In [193]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(y_train)
y_train_Encoded = le.transform(y_train)
y_test_Encoded = le.transform(y_train)
print(y_test_Encoded)


[1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 0 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 1 1
 0 0 0]


In [194]:
y_train_Encoded=pd.Series(y_train_Encoded,name='Purchased')

In [195]:
final_df = pd.concat([X_train_Encoded, y_train_Encoded], axis=1)
print(final_df.head())

   review  education  Purchased
0     1.0        2.0          1
1     1.0        2.0          1
2     2.0        1.0          1
3     2.0        1.0          1
4     2.0        2.0          1


# 📘 One-Hot Encoding (OHE)

### 🔹 Theory
- One-Hot Encoding is used for **nominal categorical data** (categories without order).
- It creates a new **binary column (0/1)** for each category.
- Example: "Red", "Green", "Blue"

| Red | Green | Blue |
|-----|-------|------|
|  1  |   0   |   0  |
|  0  |   1   |   0  |
|  0  |   0   |   1  |

---

### 🔹 Why use **n-1 Columns (Dummy Variable Trap)?**
- If we keep all `n` dummy variables, they become **linearly dependent**.
  - Example: If you know `Red=0` and `Green=0`, then `Blue` must be `1`.
  - This introduces **multicollinearity** in models like **Linear Regression**.
- To avoid this, we drop one column → use **n-1 dummies**.
- The dropped category is still represented implicitly.
  - Example: If we drop "Blue", then:
    - Red=1 → Red
    - Green=1 → Green
    - Both Red=0 and Green=0 → Blue

✅ This is called avoiding the **dummy variable trap**.

---

### 🔹 Summary
- **OHE** → For nominal categorical features.
- **n columns** → safe for tree-based models (no collinearity issue).
- **n-1 columns** → better for linear models (avoids redundancy).


In [196]:
df = pd.read_csv(r'C:\Basic_Datascience_4ML\assets\data\cars.csv')
df['brand'].value_counts().nunique()

27

ohi using only pandas
in this new row have col_category and we dont use it because it forgets the addition and removealof col my the user

In [197]:
pd.get_dummies(data=df,columns=['fuel','owner'],drop_first=True).head()


Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,True,False,False,False,False,False,False
1,Skoda,120000,370000,True,False,False,False,True,False,False
2,Honda,140000,158000,False,False,True,False,False,False,True
3,Hyundai,127000,225000,True,False,False,False,False,False,False
4,Maruti,120000,130000,False,False,True,False,False,False,False


onehot encoding using sklearn

In [198]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 0:4], df.iloc[:, -1], test_size=0.2)
print(X_train)

         brand  km_driven    fuel         owner
1244    Maruti     120000  Petrol   First Owner
7319  Mahindra      30000  Diesel  Second Owner
1767      Jeep      80000  Diesel   First Owner
5606      Tata     155000  Diesel  Second Owner
5143    Maruti     110000  Diesel  Second Owner
...        ...        ...     ...           ...
2699    Maruti      35000  Petrol   First Owner
639       Tata      50000  Diesel   First Owner
6022    Maruti      55403  Petrol   First Owner
7067    Maruti      80000  Petrol   Third Owner
1357    Maruti      60000  Diesel   First Owner

[6502 rows x 4 columns]


In [199]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
X_test_Encoded = ohe.fit_transform(X_test[['fuel','owner']])
X_train_Encoded = ohe.fit_transform(X_train[['fuel','owner']]).toarray()
print(X_train_Encoded)
print(X_train)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 1. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 1. 0. ... 0. 0. 0.]]
         brand  km_driven    fuel         owner
1244    Maruti     120000  Petrol   First Owner
7319  Mahindra      30000  Diesel  Second Owner
1767      Jeep      80000  Diesel   First Owner
5606      Tata     155000  Diesel  Second Owner
5143    Maruti     110000  Diesel  Second Owner
...        ...        ...     ...           ...
2699    Maruti      35000  Petrol   First Owner
639       Tata      50000  Diesel   First Owner
6022    Maruti      55403  Petrol   First Owner
7067    Maruti      80000  Petrol   Third Owner
1357    Maruti      60000  Diesel   First Owner

[6502 rows x 4 columns]


In [200]:
Table = np.hstack((X_train[['brand','km_driven']].values, X_train_Encoded))


if it is pandas and want a array to values ,
if it is matrix and want to do array  do toarrays once they all are array simply do hstack(x,y)
------------------------------------------------------------------------------------------------------------------------------------------------

In [201]:
d = pd.DataFrame(Table)
print(df.head())
print(d.iloc[:,2:].mean())

     brand  km_driven    fuel         owner  selling_price
0   Maruti     145500  Diesel   First Owner         450000
1    Skoda     120000  Diesel  Second Owner         370000
2    Honda     140000  Petrol   Third Owner         158000
3  Hyundai     127000  Diesel   First Owner         225000
4   Maruti     120000  Petrol   First Owner         130000
2      0.00646
3     0.542295
4      0.00446
5     0.446786
6     0.650261
7     0.021839
8     0.260535
9     0.000769
10    0.066595
dtype: object


now making a new col in which car brands less then 100 is inside that col

In [202]:
# Count brands
count = df['brand'].value_counts()
print(count)

# Rare brands (freq < 100)
rare_brands = count[count < 100].index
print(rare_brands)

# Replace rare brands with "uncommon"
df['brand_clean'] = df['brand'].replace(rare_brands, 'uncommon')

# One-hot encoding
new = pd.get_dummies(df['brand_clean'])
print(new.head())


brand
Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Force               6
Land                6
Isuzu               5
Kia                 4
Ambassador          4
Daewoo              3
MG                  3
Ashok               1
Opel                1
Peugeot             1
Name: count, dtype: int64
Index(['Nissan', 'Jaguar', 'Volvo', 'Datsun', 'Mercedes-Benz', 'Fiat', 'Audi',
       'Lexus', 'Jeep', 'Mitsubishi', 'Force', 'Land', 'Isuzu', 'Kia',
       'Ambassador', 'Daewoo', 'MG', 'Ashok', 'Opel', 'Peugeot'],
      dtype='object', name='brand')
     BMW  Ch