# 1. Encoding Categorical Features(label encoding)

Label Encoding is a straightforward technique used to convert categorical features into numerical form. In this method, each unique category in a categorical column is assigned a unique integer label. These labels are typically assigned in alphabetical order or based on the order of appearance of the categories in the column.

How Label Encoding Works:

Identify Unique Categories: The algorithm first identifies all the distinct categories present in the categorical column.

Assign Numerical Labels: Each unique category is then mapped to an integer. For example, if a column 'Color' has categories 'Red', 'Green', and 'Blue', they might be encoded as:

'Blue': 0
'Green': 1
'Red': 2
The assignment is usually based on alphabetical order, but some implementations might use the order in which the categories first appear in the data.

Replace Categorical Values: The original categorical values in the column are then replaced with their corresponding numerical labels.

Why Use Label Encoding?

1. Simplicity: It's a very easy and quick method to implement.
Suitable for Ordinal Data: Label encoding can be appropriate for ordinal categorical features, where the categories have a meaningful order (e.g., 'Small' < 'Medium' < 'Large'). The numerical labels can help the model understand this order.

2. Reduces Dimensionality: It replaces a categorical column with a single numerical column, thus not increasing the dimensionality of the dataset.

Limitations of Label Encoding:

1. Introduces Ordinality for Nominal Data: A major drawback of label encoding is that it can introduce an artificial ordinal relationship between categories that are inherently nominal (i.e., have no specific order). For example, encoding 'Red', 'Green', and 'Blue' as 2, 1, and 0 might lead a model to incorrectly assume that 'Green' is somehow "less than" 'Red' or "greater than" 'Blue'.

2. Can Confuse Models: Machine learning models might interpret the numerical labels as having a continuous or ordered relationship, which can negatively impact their performance, especially for nominal data.

When to Consider Label Encoding:

1. Ordinal Categorical Features: When the categories have a clear and meaningful order.

2. Binary Categorical Features: For features with only two categories, label encoding to 0 and 1 is often acceptable and doesn't introduce a misleading ordinal relationship.

3. As a Preprocessing Step for Certain Algorithms: Some tree-based algorithms (like Decision Trees and Random Forests) can work directly with label-encoded features, and in some cases, it might be sufficient. However, be mindful of the potential for the model to exploit the artificial order.

In summary, label encoding is a simple way to convert categorical data to numerical form, but it should be used judiciously, especially for nominal data, due to the risk of introducing artificial ordinality.

# Import necessary dependencies

In [63]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Create sample dataset

In [67]:
# Sample DataFrame with a 'Brand' categorical feature (many categories)
data = pd.DataFrame({
    'Brand': ['Puma', 'Adidas', 'Nike', 'Reebok', 'Under Armour', 'Fila', 'Asics', 'New Balance',
              'Skechers', 'Columbia', 'The North Face', 'Lotto', 'Kappa', 'Yonex', 'Li-Ning',
              'HRX by Hrithik Roshan', 'Decathlon', 'Wildcraft', 'Nivia', 'Vector X',
              'US Polo Assn', 'Monte Carlo', 'Duke', 'Shiv Naresh', 'Cosco']
})

print("Original Data:")
data

Original Data:


Unnamed: 0,Brand
0,Puma
1,Adidas
2,Nike
3,Reebok
4,Under Armour
5,Fila
6,Asics
7,New Balance
8,Skechers
9,Columbia


In [69]:
data.Brand.nunique()

25

# Categorical Features(label encoding) implementation

In [70]:
# Initialize the LabelEncoder
label_encoder = LabelEncoder()

In [71]:
# Fit and transform the 'Brand' column
data['Brand_Encoded'] = label_encoder.fit_transform(data['Brand'])

print("\nData with Label Encoding:")
data


Data with Label Encoding:


Unnamed: 0,Brand,Brand_Encoded
0,Puma,15
1,Adidas,0
2,Nike,13
3,Reebok,16
4,Under Armour,21
5,Fila,6
6,Asics,1
7,New Balance,12
8,Skechers,18
9,Columbia,2


In [72]:
# See the mapping of categories to labels
print("\nCategory to Label Mapping:")
for i, label in enumerate(label_encoder.classes_):
    print(f"{label}: {i}")


Category to Label Mapping:
Adidas: 0
Asics: 1
Columbia: 2
Cosco: 3
Decathlon: 4
Duke: 5
Fila: 6
HRX by Hrithik Roshan: 7
Kappa: 8
Li-Ning: 9
Lotto: 10
Monte Carlo: 11
New Balance: 12
Nike: 13
Nivia: 14
Puma: 15
Reebok: 16
Shiv Naresh: 17
Skechers: 18
The North Face: 19
US Polo Assn: 20
Under Armour: 21
Vector X: 22
Wildcraft: 23
Yonex: 24


Each of the 25 unique brand names will be assigned an integer from 0 to 24 (in alphabetical order by default).

Challenges and Implications of Label Encoding with Many Categories:

1. Artificial Ordinality: As emphasized before, label encoding implies an order ('Adidas' < 'Asics' < 'Columbia', etc.) even though there's no inherent ranking or relationship between these brands in terms of preference or any other meaningful scale for a machine learning model. The model might mistakenly learn patterns based on this arbitrary numerical order.

2. Increased Risk of Misinterpretation: With a larger number of categories, the range of assigned numerical labels also increases. A model might treat 'Puma' (e.g., encoded as 18) as being "closer" to 'Nike' (e.g., encoded as 15) than to 'Adidas' (e.g., encoded as 0) simply based on the numerical values, which is not necessarily true in reality.

3. Impact on Linear Models and Distance-Based Algorithms: Linear models will try to find a linear relationship based on these numerical labels, which is likely meaningless for nominal categories like brands. Distance-based algorithms (like KNN) will calculate distances based on these arbitrary numerical values, leading to potentially incorrect neighbor relationships.

4. Tree-Based Models: While tree-based models can handle label-encoded features, they might still make splits based on the numerical order, which might not be optimal if the order is arbitrary. One-hot encoding generally allows tree-based models to make more natural splits based on the presence or absence of a specific category.

Conclusion:

While label encoding can handle categorical features with many categories by assigning a unique number to each, it's crucial to be aware of the significant risk of introducing artificial ordinality, which can mislead machine learning models, especially those sensitive to numerical relationships or distances. For nominal data with high cardinality, one-hot encoding or other dimensionality reduction techniques are generally preferred. Always evaluate the impact of your chosen encoding method on your model's performance.