# One-Hot Encoding

One Hot Encoding is a technique used to convert categorical data into numerical data, specifically when the categorical data has no intrinsic order. This technique creates binary columns for each unique category, where each column indicates whether or not that category is present for a particular data point. This allows us to use categorical data in machine learning models that require numerical input.

<img src="ohe.png" height="150px">

## Dummy Variable Trap

Dummy variable trap occurs when we use dummy variables to represent categorical data and include all the dummy variables in our model. This leads to a problem known as multicollinearity, where two or more predictor variables are highly correlated with each other.

To avoid this problem, we can remove one of the dummy variables. For example, if we have a categorical variable with three categories (A, B, and C), we can create two dummy variables: one for category A and another for category B. However, including both of these variables in our model can lead to multicollinearity. To avoid this, we can drop one of the dummy variables (e.g., the one for category A) and use only the remaining variables (e.g., the one for category B and the original variable for category C) in our model.

By doing this, we ensure that there is no mathematical relationship between the dummy variables, and we avoid the problem of multicollinearity.

## One Hot Encoding using Most Frequent Variable

Sometimes in a categorical variable, one category may be significantly more frequent than the others. In this case, One Hot Encoding can be applied to only that category and treat all other categories as a single category.

For example, let's consider a dataset containing car brands where the majority of the cars are Toyota, and the remaining brands are relatively less common. In this case, we can apply One Hot Encoding to only the Toyota brand, and all other brands can be treated as a single category. This approach can be useful in reducing the number of columns created due to One Hot Encoding and improving model performance.

## Example

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('cars.csv')

In [3]:
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [4]:
df['owner'].value_counts()

First Owner             5289
Second Owner            2105
Third Owner              555
Fourth & Above Owner     174
Test Drive Car             5
Name: owner, dtype: int64

### 1. OneHotEncoding using Pandas

In [6]:
pd.get_dummies(df,columns=['fuel','owner'])

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,0,1,0,0,1,0,0,0,0
1,Skoda,120000,370000,0,1,0,0,0,0,1,0,0
2,Honda,140000,158000,0,0,0,1,0,0,0,0,1
3,Hyundai,127000,225000,0,1,0,0,1,0,0,0,0
4,Maruti,120000,130000,0,0,0,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,0,0,0,1,1,0,0,0,0
8124,Hyundai,119000,135000,0,1,0,0,0,1,0,0,0
8125,Maruti,120000,382000,0,1,0,0,1,0,0,0,0
8126,Tata,25000,290000,0,1,0,0,1,0,0,0,0


### 2. K-1 OneHotEncoding

In [7]:
pd.get_dummies(df,columns=['fuel','owner'],drop_first=True)

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,1,0,0,0,0,0,0
1,Skoda,120000,370000,1,0,0,0,1,0,0
2,Honda,140000,158000,0,0,1,0,0,0,1
3,Hyundai,127000,225000,1,0,0,0,0,0,0
4,Maruti,120000,130000,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,0,0,1,0,0,0,0
8124,Hyundai,119000,135000,1,0,0,1,0,0,0
8125,Maruti,120000,382000,1,0,0,0,0,0,0
8126,Tata,25000,290000,1,0,0,0,0,0,0


### 3. OneHotEncoding using Sklearn

In [8]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.iloc[:,0:4],df.iloc[:,-1],test_size=0.2,random_state=2)

In [9]:
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
5571,Hyundai,35000,Diesel,First Owner
2038,Jeep,60000,Diesel,First Owner
2957,Hyundai,25000,Petrol,First Owner
7618,Mahindra,130000,Diesel,Second Owner
6684,Hyundai,155000,Diesel,First Owner


In [10]:
from sklearn.preprocessing import OneHotEncoder

In [11]:
ohe = OneHotEncoder(drop='first',sparse=False,dtype=np.int32)
X_train_new = ohe.fit_transform(X_train[['fuel','owner']])
X_test_new = ohe.transform(X_test[['fuel','owner']])



In [12]:
X_train_new.shape

(6502, 7)

In [13]:
np.hstack((X_train[['brand','km_driven']].values,X_train_new))

array([['Hyundai', 35000, 1, ..., 0, 0, 0],
       ['Jeep', 60000, 1, ..., 0, 0, 0],
       ['Hyundai', 25000, 0, ..., 0, 0, 0],
       ...,
       ['Tata', 15000, 0, ..., 0, 0, 0],
       ['Maruti', 32500, 1, ..., 1, 0, 0],
       ['Isuzu', 121000, 1, ..., 0, 0, 0]], dtype=object)

### 4. OneHotEncoding with Top Categories

In [14]:
counts = df['brand'].value_counts()

In [15]:
df['brand'].nunique()
threshold = 100

In [16]:
repl = counts[counts <= threshold].index

In [17]:
pd.get_dummies(df['brand'].replace(repl, 'uncommon')).sample(5)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
3925,0,0,0,0,0,0,1,0,0,0,0,0,0
1093,0,0,0,1,0,0,0,0,0,0,0,0,0
476,0,0,0,0,0,0,1,0,0,0,0,0,0
305,0,0,0,0,1,0,0,0,0,0,0,0,0
3314,0,0,1,0,0,0,0,0,0,0,0,0,0


## Conclusion
one hot encoding is a widely used technique to encode categorical data into numerical data, particularly in machine learning applications. It helps to prevent any bias towards a particular category and provides the necessary input to machine learning models for better performance. However, care must be taken to avoid the dummy variable trap by removing one of the encoded columns to avoid collinearity. Additionally, it may be beneficial to use the most frequent variable as a reference category to improve interpretability of the results. Overall, one hot encoding is a valuable tool in the data scientist's arsenal for preparing data for machine learning applications.