Data Encoding


  1. Nominal/OHE Encoding
  2. Label and Ordinal Encoding
  3. Target Guided Ordinal Encoding

Nominal/OHE Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:

Red: [1, 0, 0]
Green: [0, 1, 0]
Blue: [0, 0, 1]

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [2]:
# object for one hot encoder
onehotencoder = OneHotEncoder()

In [3]:
df = pd.DataFrame({
     'color' : ['red', 'blue', 'green', 'green', 'red', 'blue']
})

In [4]:
df

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red
5,blue


In [9]:
# perform fit and transform 
encoded = onehotencoder.fit_transform(df[['color']]).toarray()

In [10]:
encoded

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [12]:
import pandas as pd
encoder_df = pd.DataFrame(encoded, columns = onehotencoder.get_feature_names_out())

In [13]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [14]:
# testing new data
onehotencoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [16]:
onehotencoder.transform([['green']]).toarray()



array([[0., 1., 0.]])

In [17]:
pd.concat([df, encoder_df], axis = 1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0
