# 1. Encoding Categorical Features(Count encoding)

Count Encoding, also known as Frequency Encoding, is a technique used to convert categorical features into numerical features by replacing each category with the count of its occurrences in the dataset. In other words, for each unique value in a categorical column, we calculate how many times it appears in the entire dataset and then use this count as the numerical representation of that category.

How Count Encoding Works:

Calculate Category Counts: For each categorical column you want to encode, iterate through the column and count the occurrences of each unique category.
Create a Mapping: Create a dictionary or a similar mapping where the keys are the unique categories and the values are their corresponding counts.
Replace Categories with Counts: Iterate through the original categorical column again and replace each category with its count obtained from the mapping.
Example (Men's Sports Apparel Colors):

Consider a 'Color' column in a dataset of men's sports apparel reviews:

Color

Red

Blue

Red

Green

Blue

Blue

Red

Applying count encoding would result in:

Category Counts:

Red: 3

Blue: 3

Green: 1

Mapping:

{'Red': 3, 'Blue': 3, 'Green': 1}

Encoded Column:

Color_Encoded

3

3

3

1

3

3

3

Why Use Count Encoding?

1. Simplicity: It's a very straightforward and easy technique to implement.
Captures Category Popularity: The encoded values directly reflect how common each category is in the dataset. This information can be useful for some models, as more frequent categories might have a stronger influence on the target variable.
2. Handles Nominal and Ordinal Data: Count encoding can be applied to both nominal and ordinal categorical features.
3. Reduces Dimensionality: It replaces a categorical column with a single numerical column, thus not increasing the dimensionality of the dataset.

Limitations of Count Encoding:

1. Loss of Information: Count encoding loses the distinction between different categories that have the same frequency. For example, if both 'Red' and 'Blue' appear 5 times, they will both be encoded as 5, even though they are distinct categories. This can be problematic if the identity of the category itself is important.
2. Can Be Problematic for Rare Categories: Rare categories will have very low counts. If there are many such rare categories, they might all get similar low encoded values, potentially not providing much discriminatory power.
Potential for Target Leakage (if not careful): If you are working with time-series data or have a specific train-test split, you need to ensure that the counts are calculated only based on the training data to avoid information leakage from the test set.

When to Consider Count Encoding:

1. As a Simple Baseline: It can be a good starting point to see if the frequency of categories alone provides any predictive power.
2. When Category Frequency is Expected to Be Informative: In some domains, the popularity or prevalence of a category might be a strong predictor.
3. As a Complementary Feature: Count encoding can sometimes be used in combination with other encoding techniques like one-hot encoding to provide additional information to the model.
4. For Tree-Based Models (Potentially): Tree-based models can sometimes effectively use count-encoded features.

Conclusion:

Count encoding is a simple and dimensionality-reducing technique that captures the frequency of categories. While it can be useful in certain scenarios where category popularity is important, it suffers from information loss by not distinguishing between categories with the same frequency. Consider its limitations and potential alternatives based on the nature of your categorical data and the requirements of your machine learning task.

# Import necessary dependencies

In [83]:
import pandas as pd

# Create sample dataset

In [84]:
# Sample DataFrame representing customer interactions for Men's Sports Apparel
data = pd.DataFrame({
    'Product_Type': ['T-shirt', 'Shorts', 'T-shirt', 'Track Pants', 'Shorts',
                     'T-shirt', 'Joggers', 'Shorts', 'T-shirt', 'Cap'],
    'Color': ['Navy Blue', 'Grey', 'Navy Blue', 'Black', 'Grey',
              'White', 'Black', 'Grey', 'Navy Blue', 'Red'],
    'Size': ['M', 'S', 'L', 'M', 'S', 'M', 'L', 'M', 'S', 'Free']
})

print("Original Data:")
data

Original Data:


Unnamed: 0,Product_Type,Color,Size
0,T-shirt,Navy Blue,M
1,Shorts,Grey,S
2,T-shirt,Navy Blue,L
3,Track Pants,Black,M
4,Shorts,Grey,S
5,T-shirt,White,M
6,Joggers,Black,L
7,Shorts,Grey,M
8,T-shirt,Navy Blue,S
9,Cap,Red,Free


# Categorical Features(Count encoding) implementation

In [85]:
# 1. Count Encoding for 'Product_Type'

product_type_counts = data['Product_Type'].value_counts().to_dict()
data['Product_Type_Encoded'] = data['Product_Type'].map(product_type_counts)
print("\nData with Count Encoding for Product_Type:")
data


Data with Count Encoding for Product_Type:


Unnamed: 0,Product_Type,Color,Size,Product_Type_Encoded
0,T-shirt,Navy Blue,M,4
1,Shorts,Grey,S,3
2,T-shirt,Navy Blue,L,4
3,Track Pants,Black,M,1
4,Shorts,Grey,S,3
5,T-shirt,White,M,4
6,Joggers,Black,L,1
7,Shorts,Grey,M,3
8,T-shirt,Navy Blue,S,4
9,Cap,Red,Free,1


In [86]:
# 2. Count Encoding for 'Color'

color_counts = data['Color'].value_counts().to_dict()
data['Color_Encoded'] = data['Color'].map(color_counts)
print("\nData with Count Encoding for Color:")
data


Data with Count Encoding for Color:


Unnamed: 0,Product_Type,Color,Size,Product_Type_Encoded,Color_Encoded
0,T-shirt,Navy Blue,M,4,3
1,Shorts,Grey,S,3,3
2,T-shirt,Navy Blue,L,4,3
3,Track Pants,Black,M,1,2
4,Shorts,Grey,S,3,3
5,T-shirt,White,M,4,1
6,Joggers,Black,L,1,2
7,Shorts,Grey,M,3,3
8,T-shirt,Navy Blue,S,4,3
9,Cap,Red,Free,1,1


In [87]:
# 3. Count Encoding for 'Size'

size_counts = data['Size'].value_counts().to_dict()
data['Size_Encoded'] = data['Size'].map(size_counts)
print("\nData with Count Encoding for Size:")
data


Data with Count Encoding for Size:


Unnamed: 0,Product_Type,Color,Size,Product_Type_Encoded,Color_Encoded,Size_Encoded
0,T-shirt,Navy Blue,M,4,3,4
1,Shorts,Grey,S,3,3,3
2,T-shirt,Navy Blue,L,4,3,2
3,Track Pants,Black,M,1,2,4
4,Shorts,Grey,S,3,3,3
5,T-shirt,White,M,4,1,4
6,Joggers,Black,L,1,2,2
7,Shorts,Grey,M,3,3,4
8,T-shirt,Navy Blue,S,4,3,3
9,Cap,Red,Free,1,1,1


The output will show the original DataFrame and then the DataFrame after applying count encoding to each of the specified categorical columns. For example, in the 'Product_Type_Encoded' column:

1. 'T-shirt' will be replaced by 4 (as it appears 4 times).
2. 'Shorts' will be replaced by 3 (as it appears 3 times).

And so on for other product types.

The same logic applies to the 'Color_Encoded' and 'Size_Encoded' columns. Each category in those columns will be replaced by its respective count in the DataFrame.

This implementation demonstrates how to perform count encoding for multiple categorical features in a Pandas DataFrame using Python. Remember that the encoded values represent the frequency of each category within the entire dataset.