# 1. Grouping Categories

Grouping categories in categorical feature engineering involves combining multiple less frequent or semantically similar categories into a single, broader category. This is done to simplify the feature, reduce dimensionality, and potentially improve the performance and interpretability of machine learning models.

Why Group Categories?

1. Reduce Dimensionality: Categorical features with a large number of unique categories (high cardinality) can lead to a high-dimensional feature space after one-hot encoding. This can cause issues like the curse of dimensionality, making models more complex, prone to overfitting, and slower to train. Grouping reduces the number of distinct categories.
2. Handle Infrequent Categories: Categories with very few occurrences might not provide enough information for the model to learn reliable patterns. They can also be more susceptible to noise in the data. Grouping these infrequent categories into a common "Other" or "Rare" category can make the model more robust.
3. Capture Semantic Similarity: Sometimes, different categories might represent conceptually similar things. Combining them can create a more meaningful and general feature. For example, different shades of blue in clothing color might be grouped into a single "Blue" category if the distinction between shades is not crucial for the prediction task.
4. Improve Model Generalization: By reducing the number of categories, especially for infrequent ones, the model might generalize better to unseen data.

How to Group Categories:

Identify Categories to Group:

1. Frequency-Based: Analyze the frequency distribution of the categories. Categories with counts below a certain threshold can be grouped.
2. Domain Knowledge: Use your understanding of the domain to identify categories that are similar in meaning or impact.
3. Impact on Target Variable: Analyze how different categories relate to the target variable. Categories with similar patterns or average target values might be candidates for grouping.

Determine the New Group(s):

1. "Other" or "Rare" Category: A common approach is to group all infrequent categories into a single "Other" or "Rare" category.
2. Semantic Grouping: Combine categories based on their meaning (e.g., different types of pants like "Track Pants," "Lounge Shorts," "Freck Panty" might be grouped into a broader "Pants & Shorts" category, although the image suggests these are already somewhat distinct categories).
3. Data-Driven Grouping: Use techniques like clustering based on the relationship of categories with the target variable.
4. Implement the Grouping: Use programming techniques (like conditional statements or dictionary mapping in Python) to replace the original categories with the new, grouped categories.



# Import necessary dependencies

In [164]:
import pandas as pd

# Create sample dataset

In [165]:
# Sample DataFrame representing Men's Sports Apparel data
data = pd.DataFrame({
    'Category': ['Shorts', 'Track Pants', 'Shorts', 'Lounge Shorts', 'Track Pants',
                 'Freck Panty', 'Shorts', 'This', 'Lounge Shorts', 'Shorts',
                 'Men Sports Track Reits', 'Mumming Sptete Track Pants',
                 'Men Hamind Shorts', 'Men Sports Joggers', 'Men Sports Shorts',
                 'Men Sports Joggers', 'Men Sports Shorts', 'Men Sports Track Reits',
                 'Mumming Sptete Track Pants', 'This'],
    'Color': ['Navy Blue', 'Grey', 'White', 'Navy Blue', 'Black',
              'Red', 'Grey', 'White', 'Black', 'Navy Blue',
              'Navy Blue', 'Black', 'Grey', 'Black', 'White',
              'Red', 'Navy Blue', 'Grey', 'White', 'Grim'],
    'Brand': ['Puma', 'Alcis', 'ADIDAS', 'HRX by Hrithik Roshan', 'Sports52 wear',
              'Bushirt', 'Anow Sport', 'ADIDAS', 'Slazenger', 'FUAARK',
              'Puma', 'VIGOSTING', 'HRX by Hrithik Roshan', 'Sports52 wear',
              'Bushirt', 'Anow Sport', 'ADIDAS', 'Slazenger', 'FUAARK', 'Puma'],
    'Sales': [100, 150, 120, 80, 180, 30, 90, 15, 70, 110,
              200, 160, 95, 170, 130, 115, 140, 190, 155, 25]
})

print("Original Data:")
data

Original Data:


Unnamed: 0,Category,Color,Brand,Sales
0,Shorts,Navy Blue,Puma,100
1,Track Pants,Grey,Alcis,150
2,Shorts,White,ADIDAS,120
3,Lounge Shorts,Navy Blue,HRX by Hrithik Roshan,80
4,Track Pants,Black,Sports52 wear,180
5,Freck Panty,Red,Bushirt,30
6,Shorts,Grey,Anow Sport,90
7,This,White,ADIDAS,15
8,Lounge Shorts,Black,Slazenger,70
9,Shorts,Navy Blue,FUAARK,110


In [166]:
# 1. Grouping Infrequent Categories in 'Category'

category_counts = data['Category'].value_counts()
infrequent_categories = category_counts[category_counts < 3].index
data['Category_Grouped'] = data['Category'].replace(infrequent_categories, 'Other Bottoms')
print("\nData with Grouped Categories:")
data


Data with Grouped Categories:


Unnamed: 0,Category,Color,Brand,Sales,Category_Grouped
0,Shorts,Navy Blue,Puma,100,Shorts
1,Track Pants,Grey,Alcis,150,Other Bottoms
2,Shorts,White,ADIDAS,120,Shorts
3,Lounge Shorts,Navy Blue,HRX by Hrithik Roshan,80,Other Bottoms
4,Track Pants,Black,Sports52 wear,180,Other Bottoms
5,Freck Panty,Red,Bushirt,30,Other Bottoms
6,Shorts,Grey,Anow Sport,90,Shorts
7,This,White,ADIDAS,15,Other Bottoms
8,Lounge Shorts,Black,Slazenger,70,Other Bottoms
9,Shorts,Navy Blue,FUAARK,110,Shorts


In [169]:
data['Category'].value_counts()

Unnamed: 0_level_0,count
Category,Unnamed: 1_level_1
Shorts,4
Track Pants,2
Lounge Shorts,2
This,2
Men Sports Track Reits,2
Mumming Sptete Track Pants,2
Men Sports Shorts,2
Men Sports Joggers,2
Freck Panty,1
Men Hamind Shorts,1


In [170]:
data['Category_Grouped'].value_counts()

Unnamed: 0_level_0,count
Category_Grouped,Unnamed: 1_level_1
Other Bottoms,16
Shorts,4


In [171]:
# 2. Grouping Colors based on Semantic Similarity (Hypothetical)

color_mapping = {
    'Navy Blue': 'Blue',
    'Black': 'Dark',
    'Grey': 'Neutral',
    'White': 'Light',
    'Red': 'Bright',
    'Grim': 'Other',
}
data['Color_Grouped'] = data['Color'].map(color_mapping).fillna('Other') # Handle any unmapped colors
print("\nData with Semantically Grouped Colors:")
data


Data with Semantically Grouped Colors:


Unnamed: 0,Category,Color,Brand,Sales,Category_Grouped,Color_Grouped
0,Shorts,Navy Blue,Puma,100,Shorts,Blue
1,Track Pants,Grey,Alcis,150,Other Bottoms,Neutral
2,Shorts,White,ADIDAS,120,Shorts,Light
3,Lounge Shorts,Navy Blue,HRX by Hrithik Roshan,80,Other Bottoms,Blue
4,Track Pants,Black,Sports52 wear,180,Other Bottoms,Dark
5,Freck Panty,Red,Bushirt,30,Other Bottoms,Bright
6,Shorts,Grey,Anow Sport,90,Shorts,Neutral
7,This,White,ADIDAS,15,Other Bottoms,Light
8,Lounge Shorts,Black,Slazenger,70,Other Bottoms,Dark
9,Shorts,Navy Blue,FUAARK,110,Shorts,Blue


In [172]:
data['Color'].value_counts()

Unnamed: 0_level_0,count
Color,Unnamed: 1_level_1
Navy Blue,5
Grey,4
White,4
Black,4
Red,2
Grim,1


In [173]:
data['Color_Grouped'].value_counts()

Unnamed: 0_level_0,count
Color_Grouped,Unnamed: 1_level_1
Blue,5
Neutral,4
Light,4
Dark,4
Bright,2
Other,1


In [174]:
# 3. Grouping Brands based on Popularity (Hypothetical - based on sales)

brand_sales = data.groupby('Brand')['Sales'].sum()
low_selling_brands = brand_sales[brand_sales < 300].index
data['Brand_Grouped'] = data['Brand'].replace(low_selling_brands, 'Other Brands')
print("\nData with Grouped Brands (based on total sales):")
data


Data with Grouped Brands (based on total sales):


Unnamed: 0,Category,Color,Brand,Sales,Category_Grouped,Color_Grouped,Brand_Grouped
0,Shorts,Navy Blue,Puma,100,Shorts,Blue,Puma
1,Track Pants,Grey,Alcis,150,Other Bottoms,Neutral,Other Brands
2,Shorts,White,ADIDAS,120,Shorts,Light,Other Brands
3,Lounge Shorts,Navy Blue,HRX by Hrithik Roshan,80,Other Bottoms,Blue,Other Brands
4,Track Pants,Black,Sports52 wear,180,Other Bottoms,Dark,Sports52 wear
5,Freck Panty,Red,Bushirt,30,Other Bottoms,Bright,Other Brands
6,Shorts,Grey,Anow Sport,90,Shorts,Neutral,Other Brands
7,This,White,ADIDAS,15,Other Bottoms,Light,Other Brands
8,Lounge Shorts,Black,Slazenger,70,Other Bottoms,Dark,Other Brands
9,Shorts,Navy Blue,FUAARK,110,Shorts,Blue,Other Brands


In [175]:
data['Brand'].value_counts()

Unnamed: 0_level_0,count
Brand,Unnamed: 1_level_1
Puma,3
ADIDAS,3
Anow Sport,2
HRX by Hrithik Roshan,2
Sports52 wear,2
Bushirt,2
FUAARK,2
Slazenger,2
Alcis,1
VIGOSTING,1


In [176]:
data['Brand_Grouped'].value_counts()

Unnamed: 0_level_0,count
Brand_Grouped,Unnamed: 1_level_1
Other Brands,15
Puma,3
Sports52 wear,2


This code demonstrates three common ways to group categorical features :

1. Frequency-based grouping: Combining categories that appear infrequently.
2. Semantic grouping: Combining categories based on their meaning or similarity.
3. Grouping based on relationship with another variable: Combining categories based on their aggregate behavior with respect to the target variable (in this case, 'Sales' as a proxy for popularity).

Remember to choose the grouping strategy that makes the most sense for your specific data and the goals of your machine learning task.