# Exploring Different Encoding Techniques in Machine Learning

# Abstract:

In this notebook, we explore various encoding techniques commonly used in machine learning to handle categorical data. The methods covered include One-Hot Encoding, Dummy Encoding, Label Encoding, Ordinal Encoding, Binary Encoding, Count Encoding, and Target Encoding. These techniques are demonstrated with code examples and explanations for their appropriate use cases. The dataset used is a small sample with categorical features, and each encoding method is applied step-by-step. The notebook aims to help users understand when and how to use these encoding techniques in their machine learning workflows.


# Theory and Background:

Encoding categorical data is a fundamental preprocessing step in machine learning. Categorical features must be converted into numerical format for models to process. Different encoding methods offer various ways to represent categorical data, depending on the nature of the data (nominal or ordinal) and the problem at hand (classification, regression, etc.).

- One-Hot Encoding creates binary columns for each category.
- Dummy Encoding drops one category to avoid multicollinearity.
- Label Encoding assigns arbitrary integers to categories.
- Ordinal Encoding preserves the order in ordinal data.
- Binary Encoding converts integer-encoded categories into binary digits.
- Count Encoding replaces categories with their frequency of occurrence.
- Target Encoding assigns categories based on their relationship with the target variable.


In [2]:
# Import necessary libraries
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
# Install category_encoders if not already installed
!pip install category_encoders
# Then import
import category_encoders as ce


# Sample dataset
data = {'Fruit': ['Apple', 'Banana', 'Orange', 'Banana', 'Apple', 'Orange'],
        'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small', 'Large'],
        'Price': [100, 150, 200, 150, 100, 200],
        'Target': [1, 0, 0, 1, 1, 0]}
df = pd.DataFrame(data)
print("Original Data:")
print(df)

Collecting category_encoders
  Downloading category_encoders-2.6.4-py2.py3-none-any.whl.metadata (8.0 kB)
Downloading category_encoders-2.6.4-py2.py3-none-any.whl (82 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.0/82.0 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: category_encoders
Successfully installed category_encoders-2.6.4
Original Data:
    Fruit    Size  Price  Target
0   Apple   Small    100       1
1  Banana  Medium    150       0
2  Orange   Large    200       0
3  Banana  Medium    150       1
4   Apple   Small    100       1
5  Orange   Large    200       0


# 1. One-Hot Encoding

One-hot encoding is suitable for nominal data. This method creates binary columns for each unique category. In the case of a large number of categories, it can lead to high dimensionality.


In [4]:
one_hot_encoder = OneHotEncoder(sparse_output=False)
one_hot_encoded = one_hot_encoder.fit_transform(df[['Fruit']])
one_hot_encoded_df = pd.DataFrame(one_hot_encoded, columns=one_hot_encoder.get_feature_names_out())
df_one_hot = pd.concat([df, one_hot_encoded_df], axis=1)
print("Original Data:")
print(df)
print("\nOne-Hot Encoded Data:")
print(df_one_hot)

Original Data:
    Fruit    Size  Price  Target
0   Apple   Small    100       1
1  Banana  Medium    150       0
2  Orange   Large    200       0
3  Banana  Medium    150       1
4   Apple   Small    100       1
5  Orange   Large    200       0

One-Hot Encoded Data:
    Fruit    Size  Price  Target  Fruit_Apple  Fruit_Banana  Fruit_Orange
0   Apple   Small    100       1          1.0           0.0           0.0
1  Banana  Medium    150       0          0.0           1.0           0.0
2  Orange   Large    200       0          0.0           0.0           1.0
3  Banana  Medium    150       1          0.0           1.0           0.0
4   Apple   Small    100       1          1.0           0.0           0.0
5  Orange   Large    200       0          0.0           0.0           1.0


# 2. Dummy Encoding
Dummy encoding is similar to one-hot encoding but drops one category to avoid multicollinearity. It is commonly used in regression models.

In [5]:
dummy_encoded_df = pd.get_dummies(df['Fruit'], drop_first=True)
df_dummy = pd.concat([df, dummy_encoded_df], axis=1)
print("Original Data:")
print(df)
print("\nDummy Encoded Data:")
print(df_dummy)

Original Data:
    Fruit    Size  Price  Target
0   Apple   Small    100       1
1  Banana  Medium    150       0
2  Orange   Large    200       0
3  Banana  Medium    150       1
4   Apple   Small    100       1
5  Orange   Large    200       0

Dummy Encoded Data:
    Fruit    Size  Price  Target  Banana  Orange
0   Apple   Small    100       1   False   False
1  Banana  Medium    150       0    True   False
2  Orange   Large    200       0   False    True
3  Banana  Medium    150       1    True   False
4   Apple   Small    100       1   False   False
5  Orange   Large    200       0   False    True


#3.Label Encoding
Label encoding assigns an integer to each category. This is efficient but may introduce ordinal relationships between categories, which may not be suitable for nominal data.


In [18]:
print("\nInput for Label Encoding:")
print(df[['Fruit']])

label_encoder = LabelEncoder()
df['Fruit_Label_Encoded'] = label_encoder.fit_transform(df['Fruit'])

print("\nOutput for Label Encoding:")
print(df[['Fruit', 'Fruit_Label_Encoded']])



Input for Label Encoding:
    Fruit
0   Apple
1  Banana
2  Orange
3  Banana
4   Apple
5  Orange

Output for Label Encoding:
    Fruit  Fruit_Label_Encoded
0   Apple                    0
1  Banana                    1
2  Orange                    2
3  Banana                    1
4   Apple                    0
5  Orange                    2


# 4. Ordinal Encoding
Ordinal encoding is used when the data has an inherent order, such as 'Small', 'Medium', 'Large'. This technique preserves the order of the categories.


In [19]:
print("\nInput for Ordinal Encoding (Size column):")
print(df[['Size']])

ordinal_encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
df['Size_Ordinal_Encoded'] = ordinal_encoder.fit_transform(df[['Size']])

print("\nOutput for Ordinal Encoding:")
print(df[['Size', 'Size_Ordinal_Encoded']])



Input for Ordinal Encoding (Size column):
     Size
0   Small
1  Medium
2   Large
3  Medium
4   Small
5   Large

Output for Ordinal Encoding:
     Size  Size_Ordinal_Encoded
0   Small                   0.0
1  Medium                   1.0
2   Large                   2.0
3  Medium                   1.0
4   Small                   0.0
5   Large                   2.0


# 5. Binary Encoding
Binary encoding combines label encoding and binary representation of categories. It is useful for high-cardinality categorical data.

In [20]:
print("\nInput for Binary Encoding:")
print(df[['Fruit']])

binary_encoder = ce.BinaryEncoder(cols=['Fruit'])
df_binary_encoded = binary_encoder.fit_transform(df)

print("\nOutput for Binary Encoding:")
print(df_binary_encoded)


Input for Binary Encoding:
    Fruit
0   Apple
1  Banana
2  Orange
3  Banana
4   Apple
5  Orange

Output for Binary Encoding:
   Fruit_0  Fruit_1    Size  Price  Target  Fruit_Label_Encoded  \
0        0        1   Small    100       1                    0   
1        1        0  Medium    150       0                    1   
2        1        1   Large    200       0                    2   
3        1        0  Medium    150       1                    1   
4        0        1   Small    100       1                    0   
5        1        1   Large    200       0                    2   

   Size_Ordinal_Encoded  
0                   0.0  
1                   1.0  
2                   2.0  
3                   1.0  
4                   0.0  
5                   2.0  


# 6. Count Encoding
Count encoding replaces categories with their frequency of occurrence in the dataset. It is efficient for high-cardinality features and maintains information about the distribution of the data.


In [22]:
print("\nInput for Count Encoding:")
print(df[['Fruit']])

count_encoding = df['Fruit'].value_counts()
df['Fruit_Count_Encoded'] = df['Fruit'].map(count_encoding)

print("\nOutput for Count Encoding:")
print(df[['Fruit', 'Fruit_Count_Encoded']])


Input for Count Encoding:
    Fruit
0   Apple
1  Banana
2  Orange
3  Banana
4   Apple
5  Orange

Output for Count Encoding:
    Fruit  Fruit_Count_Encoded
0   Apple                    2
1  Banana                    2
2  Orange                    2
3  Banana                    2
4   Apple                    2
5  Orange                    2


# 7. Target Encoding
Target encoding replaces each category with the mean of the target variable for that category. This is useful in supervised learning but may introduce data leakage if not handled properly.


In [23]:
print("\nInput for Target Encoding:")
print(df[['Fruit', 'Target']])

target_encoder = ce.TargetEncoder(cols=['Fruit'])
df['Fruit_Target_Encoded'] = target_encoder.fit_transform(df['Fruit'], df['Target'])

print("\nOutput for Target Encoding:")
print(df[['Fruit', 'Fruit_Target_Encoded']])


Input for Target Encoding:
    Fruit  Target
0   Apple       1
1  Banana       0
2  Orange       0
3  Banana       1
4   Apple       1
5  Orange       0

Output for Target Encoding:
    Fruit  Fruit_Target_Encoded
0   Apple              0.570926
1  Banana              0.500000
2  Orange              0.429074
3  Banana              0.500000
4   Apple              0.570926
5  Orange              0.429074


# Conclusion:

In this notebook, we demonstrated several encoding techniques that are essential for handling categorical data in machine learning.
Each method has its strengths and appropriate use cases depending on the type of data (nominal vs ordinal) and the nature of the problem
(supervised vs unsupervised learning). We also highlighted how certain techniques like One-Hot Encoding can increase dimensionality, while
methods like Count and Target Encoding can help reduce it, especially for high-cardinality data. Future work may explore using these
techniques on larger datasets and in more complex machine learning models.


# References:

- Scikit-learn documentation: https://scikit-learn.org/stable/
- Category Encoders documentation: https://contrib.scikit-learn.org/category_encoders/
- Machine Learning Mastery: How to Handle Categorical Data for Machine Learning


# License:

MIT License

Copyright (c) 2024 Ansh Vaghela

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

MIT License

Copyright (c) 2024 dhirthacker7neu

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
