<a href="https://colab.research.google.com/github/mobndash/Exploratory-Data-Analysis-EDA-Techniques/blob/main/EDA_Encoding_for_categorical_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import sklearn

In [None]:
df = pd.read_csv("/content/customer.csv")

In [None]:
df.head()

Unnamed: 0,age,gender,review,education,purchased
0,30,Female,Average,School,No
1,68,Female,Poor,UG,No
2,70,Female,Good,PG,No
3,72,Female,Good,PG,No
4,16,Female,Average,UG,No




> Ordinal Encoding
---
Here, education is **cateorical but is ordered** like PG > UF > Schoold. So, we use
ordinal encoding



In [None]:
df["review"].unique(), df["education"].unique()

(array(['Average', 'Poor', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object))

In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

In [None]:
X_train = df[["review", "education"]]

In [None]:
oe = OrdinalEncoder(categories = [["Poor", "Average", "Good"], ["School", "UG", "PG"]])

In [None]:
oe.fit(X_train)

In [None]:
X_train = oe.transform(X_train)

In [None]:
pd.DataFrame(X_train).head()

Unnamed: 0,0,1
0,1.0,0.0
1,0.0,1.0
2,2.0,2.0
3,2.0,2.0
4,1.0,1.0


> Label Encoding
---
Label Encoding has only be used on **Output columns and NOT on Input columns**
---


Label Encoding converts categorical values into integers. It’s useful when categories don’t have a natural order, but ⚠️ be careful: in linear models or distance-based models, the numbers might imply an order/size relationship that doesn’t exist. For tree-based models (Random Forest, XGBoost, etc.), Label Encoding is usually fine.

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
df = pd.read_csv("/content/customer.csv")

In [None]:
y_train = df["purchased"]

In [None]:
le = LabelEncoder()

In [None]:
le.fit(y_train)

In [None]:
y_train_encoded = le.transform(y_train)

In [None]:
y_train_encoded

array([0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 1, 1, 0])

> One Hot Encoding

---
Unlike Label Encoding (which assigns integers to categories), One-Hot Encoding creates a binary column for each category → avoids implying any order between categories.




```
Original DataFrame:
   Color  Size
0    Red     S
1  Green     M
2   Blue     L
3  Green     S

After One-Hot Encoding:
   Color_Blue  Color_Green  Color_Red  Size_L  Size_M  Size_S
0           0            0          1       0       0       1
1           0            1          0       0       1       0
2           1            0          0       1       0       0
3           0            1          0       0       0       1

```



Drop one dummy column (to avoid multicollinearity, i.e., “dummy variable trap”)
Problem arises with Linear Regression

In [None]:
df = pd.read_csv("/content/cars.csv")

In [None]:
df.columns

Index(['brand', 'km_driven', 'fuel', 'owner', 'selling_price'], dtype='object')



> One Hot Encoding using Pandas



In [None]:
pd.get_dummies(df, columns = ["fuel", "owner"])

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,False,True,False,False,True,False,False,False,False
1,Skoda,120000,370000,False,True,False,False,False,False,True,False,False
2,Honda,140000,158000,False,False,False,True,False,False,False,False,True
3,Hyundai,127000,225000,False,True,False,False,True,False,False,False,False
4,Maruti,120000,130000,False,False,False,True,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,False,True,True,False,False,False,False
8124,Hyundai,119000,135000,False,True,False,False,False,True,False,False,False
8125,Maruti,120000,382000,False,True,False,False,True,False,False,False,False
8126,Tata,25000,290000,False,True,False,False,True,False,False,False,False




> K-1 Hot Encoding - Avoid multicollinearity, automatically drops the columns
Pandas does not remember the column sequesnt of encoded columns. hence, use scikit-learn



In [None]:
df = pd.get_dummies(df, columns = ["fuel", "owner"], drop_first = True, dtype = "int")
df

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,1,0,0,0,0,0,0
1,Skoda,120000,370000,1,0,0,0,1,0,0
2,Honda,140000,158000,0,0,1,0,0,0,1
3,Hyundai,127000,225000,1,0,0,0,0,0,0
4,Maruti,120000,130000,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,0,0,1,0,0,0,0
8124,Hyundai,119000,135000,1,0,0,1,0,0,0
8125,Maruti,120000,382000,1,0,0,0,0,0,0
8126,Tata,25000,290000,1,0,0,0,0,0,0




> OHE using ScikitLearn


In [None]:
df = pd.read_csv("/content/cars.csv")

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
X = df.drop(columns = ["selling_price"])
y = df["selling_price"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
X_train.shape, X_test.shape

((6502, 4), (1626, 4))

In [None]:
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
6518,Tata,2560,Petrol,First Owner
6144,Honda,80000,Petrol,Second Owner
6381,Hyundai,150000,Diesel,Fourth & Above Owner
438,Maruti,120000,Diesel,Second Owner
5939,Maruti,25000,Petrol,First Owner


In [None]:

# Initialize encoder
encoder = OneHotEncoder(sparse_output=False, drop=None)
# drop="first" → would drop first category to avoid dummy variable trap

# Fit + Transform
X_train_new = encoder.fit_transform(X_train[["fuel", "owner"]])
X_test_new = encoder.transform(X_test[["fuel", "owner"]])


In [None]:
# Convert back to DataFrame
encoded_df = pd.DataFrame(X_train_new, columns=encoder.get_feature_names_out(["fuel", "owner"]))

df = df.reset_index(drop=True)
encoded_df = encoded_df.reset_index(drop=True)

# Combine
df_final = pd.concat([df, encoded_df], axis=1)
print(df_final.head())

     brand  km_driven    fuel         owner  selling_price  fuel_CNG  \
0   Maruti     145500  Diesel   First Owner         450000       0.0   
1    Skoda     120000  Diesel  Second Owner         370000       0.0   
2    Honda     140000  Petrol   Third Owner         158000       0.0   
3  Hyundai     127000  Diesel   First Owner         225000       0.0   
4   Maruti     120000  Petrol   First Owner         130000       0.0   

   fuel_Diesel  fuel_LPG  fuel_Petrol  owner_First Owner  \
0          0.0       0.0          1.0                1.0   
1          0.0       0.0          1.0                0.0   
2          1.0       0.0          0.0                0.0   
3          1.0       0.0          0.0                0.0   
4          0.0       0.0          1.0                1.0   

   owner_Fourth & Above Owner  owner_Second Owner  owner_Test Drive Car  \
0                         0.0                 0.0                   0.0   
1                         0.0                 1.0       

In [None]:
final_df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,Diesel,First Owner,450000,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
1,Skoda,120000,Diesel,Second Owner,370000,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,Honda,140000,Petrol,Third Owner,158000,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,Hyundai,127000,Diesel,First Owner,225000,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,Maruti,120000,Petrol,First Owner,130000,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
