# One Hot Encoding Explained

We will create a dataset and encode categorical data using two methods.

In [1]:

import pandas as pd

# Create dataset
df = pd.DataFrame({
    'City': ['Mumbai', 'Delhi', 'Pune', 'Mumbai', 'Delhi'],
    'Price': [100, 150, 120, 130, 170]
})

df


Unnamed: 0,City,Price
0,Mumbai,100
1,Delhi,150
2,Pune,120
3,Mumbai,130
4,Delhi,170


## One Hot Encoding using pandas.get_dummies

In [2]:

df_encoded_pandas = pd.get_dummies(df, columns=['City'])
df_encoded_pandas


Unnamed: 0,Price,City_Delhi,City_Mumbai,City_Pune
0,100,False,True,False
1,150,True,False,False
2,120,False,False,True
3,130,False,True,False
4,170,True,False,False


In [3]:
df_encoded_panda = pd.get_dummies(df, columns=['City'], drop_first=True)
df_encoded_panda

Unnamed: 0,Price,City_Mumbai,City_Pune
0,100,True,False
1,150,False,False
2,120,False,True
3,130,True,False
4,170,False,False


## One Hot Encoding using sklearn

In [4]:

from sklearn.preprocessing import OneHotEncoder

In [8]:
encoder = OneHotEncoder(sparse_output=False)

In [9]:
encoded = encoder.fit_transform(df[['City']])
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(['City']))


In [10]:
final_df = pd.concat([encoded_df, df[['Price']]], axis=1)
final_df

Unnamed: 0,City_Delhi,City_Mumbai,City_Pune,Price
0,0.0,1.0,0.0,100
1,1.0,0.0,0.0,150
2,0.0,0.0,1.0,120
3,0.0,1.0,0.0,130
4,1.0,0.0,0.0,170


## Observation
Both methods achieve the same goal. sklearn version is preferred in ML pipelines.