Data Encoding

1. Nominal/OHE Encoding

2. Label and Ordinal Encoding

3. Target Guided Ordinal Encoding

Nominal/OHE Encoding

One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it one hot encoding as follows:

1. Red: [1, 0, 0]

2. Green: [0, 1, 0]

3. Blue: [0, 0, 1]

In [None]:
# Categorical to numerical

In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [3]:
# Create a DataFrame
df = pd.DataFrame({
    'color' : ['red', 'green', 'blue', 'green', 'red']
})

In [4]:
df

Unnamed: 0,color
0,red
1,green
2,blue
3,green
4,red


In [5]:
# Create an instance of OneHotEncoder
encoder = OneHotEncoder()

In [6]:
# fit and then transform
encoder.fit_transform(df[['color']])

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 5 stored elements and shape (5, 3)>

In [9]:
encoded = encoder.fit_transform(df[['color']]).toarray()
# X = Blue, Y = Green, Z = Red
encoded

array([[0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [12]:
encoder_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,0.0,1.0,0.0
2,1.0,0.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0


In [13]:
encoder.transform([['blue']]).toarray() # for new data



array([[1., 0., 0.]])

In [14]:
pd.concat([df, encoder_df], axis=1) # combine the original data with the encoded data

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,green,0.0,1.0,0.0
2,blue,1.0,0.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0


In [17]:
import seaborn as sns
df = sns.load_dataset('tips')
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [25]:
# Categorical: sex, smoker, day, time
df_cat = df[['sex', 'smoker', 'day', 'time']]
df_cat

Unnamed: 0,sex,smoker,day,time
0,Female,No,Sun,Dinner
1,Male,No,Sun,Dinner
2,Male,No,Sun,Dinner
3,Male,No,Sun,Dinner
4,Female,No,Sun,Dinner
...,...,...,...,...
239,Male,No,Sat,Dinner
240,Female,Yes,Sat,Dinner
241,Male,Yes,Sat,Dinner
242,Male,No,Sat,Dinner


In [26]:
encoder = OneHotEncoder()

In [27]:
encoder.fit_transform(df[['sex', 'smoker', 'day', 'time']])

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 976 stored elements and shape (244, 10)>

In [30]:
encoded = encoder.fit_transform(df[['sex', 'smoker', 'day', 'time']]).toarray()
encoded

array([[1., 0., 1., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       ...,
       [0., 1., 0., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       [1., 0., 1., ..., 1., 1., 0.]])

In [31]:
encoder_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())
encoder_df

Unnamed: 0,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...
239,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
240,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
241,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
242,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [33]:
pd.concat([df_cat, encoder_df], axis=1)

Unnamed: 0,sex,smoker,day,time,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
0,Female,No,Sun,Dinner,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,Male,No,Sun,Dinner,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,Male,No,Sun,Dinner,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,Male,No,Sun,Dinner,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,Female,No,Sun,Dinner,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,Male,No,Sat,Dinner,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
240,Female,Yes,Sat,Dinner,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
241,Male,Yes,Sat,Dinner,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
242,Male,No,Sat,Dinner,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [34]:
pd.concat([df, encoder_df], axis=1)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
0,16.99,1.01,Female,No,Sun,Dinner,2,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,21.01,3.50,Male,No,Sun,Dinner,3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,24.59,3.61,Female,No,Sun,Dinner,4,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
240,27.18,2.00,Female,Yes,Sat,Dinner,2,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
241,22.67,2.00,Male,Yes,Sat,Dinner,2,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
242,17.82,1.75,Male,No,Sat,Dinner,2,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
