# **One Hot Encoding**
One-Hot Encoding is a popular technique used in machine learning and data preprocessing, especially when dealing with categorical data. It is used to represent categorical variables as binary vectors or matrices, where each category is mapped to a unique binary value. 

This transformation is necessary because many machine learning algorithms and models require numerical input, and categorical data in its raw form cannot be directly used in these algorithms.

<img src="https://miro.medium.com/v2/resize:fit:1358/1*ggtP4a5YaRx6l09KQaYOnw.png">

## **Import Required Libraries**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## **Read the Data**

In [2]:
df = pd.read_csv("D:\Coding\Datasets\cars.csv")
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [3]:
df.shape

(8128, 5)

In [4]:
# Check the number of unique brand names
df["brand"].nunique()

32

In [5]:
# Count the values for each brand in 'brand' column
df["brand"].value_counts()

Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Force               6
Land                6
Isuzu               5
Kia                 4
Ambassador          4
Daewoo              3
MG                  3
Ashok               1
Opel                1
Peugeot             1
Name: brand, dtype: int64

In [6]:
# Count the values for each unique name in 'fuel' column
df["fuel"].value_counts()

Diesel    4402
Petrol    3631
CNG         57
LPG         38
Name: fuel, dtype: int64

In [7]:
# Count the values for each unique name in 'owner' column
df["owner"].value_counts()

First Owner             5289
Second Owner            2105
Third Owner              555
Fourth & Above Owner     174
Test Drive Car             5
Name: owner, dtype: int64

## **One Hot Encoding with Pandas**

In [8]:
# Applying One Hot Encoding on 'fuel' and 'owner' columns
pd.get_dummies(data=df, columns=["fuel", "owner"])

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,0,1,0,0,1,0,0,0,0
1,Skoda,120000,370000,0,1,0,0,0,0,1,0,0
2,Honda,140000,158000,0,0,0,1,0,0,0,0,1
3,Hyundai,127000,225000,0,1,0,0,1,0,0,0,0
4,Maruti,120000,130000,0,0,0,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,0,0,0,1,1,0,0,0,0
8124,Hyundai,119000,135000,0,1,0,0,0,1,0,0,0
8125,Maruti,120000,382000,0,1,0,0,1,0,0,0,0
8126,Tata,25000,290000,0,1,0,0,1,0,0,0,0


## **K-1 One Hot Encoding with Pandas**

When using the pd.get_dummies() function in Pandas, you can drop the first category (column) of each categorical variable to avoid multicollinearity, which can be useful in certain situations. This is done using the drop_first parameter. Setting drop_first=True will drop the first category from each categorical variable after one-hot encoding.

In [9]:
# Applying One Hot Encoding on 'fuel' and 'owner' columns
# Removing the first categorical variable to avoid multicolinearity
pd.get_dummies(data=df, columns=["fuel", "owner"], drop_first=True)

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,1,0,0,0,0,0,0
1,Skoda,120000,370000,1,0,0,0,1,0,0
2,Honda,140000,158000,0,0,1,0,0,0,1
3,Hyundai,127000,225000,1,0,0,0,0,0,0
4,Maruti,120000,130000,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,0,0,1,0,0,0,0
8124,Hyundai,119000,135000,1,0,0,1,0,0,0
8125,Maruti,120000,382000,1,0,0,0,0,0,0
8126,Tata,25000,290000,1,0,0,0,0,0,0


## **One Hot Encoding using Sklearn**

### **Train Test Split**

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
# Print the dataframe
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [12]:
x_train, x_test, y_train, y_test = train_test_split(df.drop("selling_price", axis=1),
                                                    df["selling_price"],
                                                    test_size=0.3,
                                                    random_state=0)
x_train.shape, y_train.shape

((5689, 4), (5689,))

In [13]:
x_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
5224,Tata,20000,Petrol,First Owner
520,Maruti,30000,Petrol,First Owner
36,Maruti,15000,Petrol,First Owner
5782,Ford,53000,Diesel,First Owner
6522,Chevrolet,120000,Diesel,First Owner


In [14]:
x_train.shape

(5689, 4)

### **Apply OHE on 'fuel' and 'owner' Columns**

In [15]:
from sklearn.preprocessing import OneHotEncoder

In [16]:
# Creating an object of the One Hot Encode class
one_hot_encoder = OneHotEncoder(drop="first", sparse_output=False, dtype=np.int8)

# Separating the 'fuel' and 'owner' columns from the x_train dataframe
# Fit the separated training data
one_hot_encoder.fit(x_train[["fuel", "owner"]])

# Transform the separated training data
x_train_encoded = one_hot_encoder.transform(x_train[["fuel", "owner"]])
x_train_encoded

array([[0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 1, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]], dtype=int8)

In [17]:
x_train_encoded.shape

(5689, 7)

In [18]:
# Merge the x_train_encoded columns with the 'brand' and 'km_driven' columns
x_train_merged = np.hstack((x_train[["brand", "km_driven"]], x_train_encoded))
x_train_merged

array([['Tata', 20000, 0, ..., 0, 0, 0],
       ['Maruti', 30000, 0, ..., 0, 0, 0],
       ['Maruti', 15000, 0, ..., 0, 0, 0],
       ...,
       ['Hyundai', 90000, 0, ..., 1, 0, 0],
       ['Volkswagen', 90000, 1, ..., 0, 0, 0],
       ['Hyundai', 110000, 0, ..., 0, 0, 0]], dtype=object)

In [19]:
x_train_merged.shape

(5689, 9)

In [20]:
# Print the column names of the encoded x_train data
one_hot_encoder.get_feature_names_out()

array(['fuel_Diesel', 'fuel_LPG', 'fuel_Petrol',
       'owner_Fourth & Above Owner', 'owner_Second Owner',
       'owner_Test Drive Car', 'owner_Third Owner'], dtype=object)

In [21]:
# Define the column names in an array
column_names = np.concatenate((x_train.columns[0:2], one_hot_encoder.get_feature_names_out()), axis=0)
print(len(column_names))
column_names

9


array(['brand', 'km_driven', 'fuel_Diesel', 'fuel_LPG', 'fuel_Petrol',
       'owner_Fourth & Above Owner', 'owner_Second Owner',
       'owner_Test Drive Car', 'owner_Third Owner'], dtype=object)

In [22]:
# Convert the x_train_merged array into pandas dataframe
x_train_encoded = pd.DataFrame(data=x_train_merged, columns=column_names)
x_train_encoded

Unnamed: 0,brand,km_driven,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Tata,20000,0,0,1,0,0,0,0
1,Maruti,30000,0,0,1,0,0,0,0
2,Maruti,15000,0,0,1,0,0,0,0
3,Ford,53000,1,0,0,0,0,0,0
4,Chevrolet,120000,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
5684,Tata,70000,1,0,0,0,0,0,1
5685,Ford,100000,1,0,0,0,1,0,0
5686,Hyundai,90000,0,0,1,0,1,0,0
5687,Volkswagen,90000,1,0,0,0,0,0,0


In [23]:
x_train_encoded.shape

(5689, 9)

In [24]:
# Print the x_test data
x_test.head()

Unnamed: 0,brand,km_driven,fuel,owner
3558,Hyundai,40000,Diesel,First Owner
233,Mahindra,70000,Diesel,First Owner
7952,Maruti,5000,Petrol,First Owner
572,Maruti,120000,Petrol,Third Owner
6960,Lexus,20000,Petrol,First Owner


In [25]:
# Encode x_test data
x_test_encoded = one_hot_encoder.transform(x_test[["fuel", "owner"]])
x_test_encoded

array([[1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 1, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], dtype=int8)

In [26]:
# Merge the x_test_encoded columns with the 'brand' and 'km_driven' columns
x_test_merged = np.hstack((x_test.iloc[:, 0:2], x_test_encoded))
x_test_merged

array([['Hyundai', 40000, 1, ..., 0, 0, 0],
       ['Mahindra', 70000, 1, ..., 0, 0, 0],
       ['Maruti', 5000, 0, ..., 0, 0, 0],
       ...,
       ['Ford', 35000, 1, ..., 0, 0, 0],
       ['Mahindra', 110000, 1, ..., 1, 0, 0],
       ['Maruti', 80000, 1, ..., 0, 0, 0]], dtype=object)

In [27]:
# Convert the x_test_merged array into pandas dataframe
x_test_encoded = pd.DataFrame(data=x_test_merged, columns=column_names)
x_test_encoded

Unnamed: 0,brand,km_driven,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Hyundai,40000,1,0,0,0,0,0,0
1,Mahindra,70000,1,0,0,0,0,0,0
2,Maruti,5000,0,0,1,0,0,0,0
3,Maruti,120000,0,0,1,0,0,0,1
4,Lexus,20000,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...
2434,Maruti,5000,0,0,1,0,0,0,0
2435,Ford,9500,0,0,1,0,0,0,0
2436,Ford,35000,1,0,0,0,0,0,0
2437,Mahindra,110000,1,0,0,0,1,0,0


## **Apply OHE on 'brand' Column using Pandas**

In [41]:
# Count the values for each brand in 'brand' column
counts = df["brand"].value_counts()
counts

Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Force               6
Land                6
Isuzu               5
Kia                 4
Ambassador          4
Daewoo              3
MG                  3
Ashok               1
Opel                1
Peugeot             1
Name: brand, dtype: int64

In [31]:
# Check the total number of unique brands
df["brand"].nunique()

32

In [32]:
# Define a threshold
threshold = 100

In [46]:
# Store the name of brands in a list where the value count is less than 100
repl = counts[counts <= threshold].index
repl

Index(['Nissan', 'Jaguar', 'Volvo', 'Datsun', 'Mercedes-Benz', 'Fiat', 'Audi',
       'Lexus', 'Jeep', 'Mitsubishi', 'Force', 'Land', 'Isuzu', 'Kia',
       'Ambassador', 'Daewoo', 'MG', 'Ashok', 'Opel', 'Peugeot'],
      dtype='object')

In [50]:
# Replace the name of the brand with 'others'
new_df = df.replace(to_replace=repl, value="Others")
new_df

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000
...,...,...,...,...,...
8123,Hyundai,110000,Petrol,First Owner,320000
8124,Hyundai,119000,Diesel,Fourth & Above Owner,135000
8125,Maruti,120000,Diesel,First Owner,382000
8126,Tata,25000,Diesel,First Owner,290000


In [51]:
new_df["brand"].value_counts()

Maruti        2448
Hyundai       1415
Mahindra       772
Tata           734
Others         538
Toyota         488
Honda          467
Ford           397
Chevrolet      230
Renault        228
Volkswagen     186
BMW            120
Skoda          105
Name: brand, dtype: int64

In [57]:
# Apply OHE on 'brand' column of the new dataframe
pd.get_dummies(data=new_df["brand"]).sample(20)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Others,Renault,Skoda,Tata,Toyota,Volkswagen
3461,0,0,0,0,0,0,0,1,0,0,0,0,0
332,0,0,0,0,0,0,0,1,0,0,0,0,0
2543,0,1,0,0,0,0,0,0,0,0,0,0,0
1146,0,0,0,0,0,0,1,0,0,0,0,0,0
3243,0,0,0,0,0,0,0,0,0,1,0,0,0
2511,0,0,0,0,1,0,0,0,0,0,0,0,0
7321,0,0,0,0,0,1,0,0,0,0,0,0,0
6232,0,0,0,1,0,0,0,0,0,0,0,0,0
3085,0,0,0,0,0,0,1,0,0,0,0,0,0
6899,0,0,0,0,0,0,1,0,0,0,0,0,0
