# 1. Creating Binary Features : One hot encoding

Creating binary features through one-hot encoding of specific categories involves taking a categorical feature and, for a chosen subset of its categories, creating new binary (0 or 1) columns. Each new column represents one of the selected categories. If a data point's original categorical feature matches the category represented by the new binary column, that column will have a value of 1; otherwise, it will have a value of 0.

Why One-Hot Encode Specific Categories?

1. Focus on Important Categories: You might have domain knowledge or observed patterns that suggest certain categories within a feature are particularly important for predicting your target variable. One-hot encoding only these key categories allows your model to specifically learn their impact.
2. Reduce Dimensionality: If a categorical feature has many categories, one-hot encoding all of them can significantly increase the dimensionality of your dataset. If only a few categories are truly relevant, one-hot encoding just those can help keep the feature space manageable and prevent the "curse of dimensionality."
3. Handle Nominal Data: Many machine learning algorithms work best with numerical input. One-hot encoding converts categorical data into a numerical format that these algorithms can process. Binary features (0 and 1) are a simple and effective way to represent the presence or absence of a specific category.
4. Avoid Ordinality Assumption: If the categories in your feature don't have a natural order (nominal data, like colors or product categories), one-hot encoding avoids implicitly imposing an order that might be misinterpreted by the model (which could happen if you simply assigned numerical labels to the categories).

How to One-Hot Encode Specific Categories:

1. Identify the Categorical Feature: Choose the categorical column you want to work with.
2. Select the Specific Categories: Determine which categories within that feature you want to create binary features for. This selection can be based on frequency, domain knowledge, or initial exploratory analysis showing a strong relationship with the target.
3. Create Binary Columns: For each selected category, create a new column in your dataset.
4. Populate Binary Columns: Iterate through your data. For each row, if the value in the original categorical feature matches the category represented by the new binary column, set the value in that new column to 1. Otherwise, set it to 0.

# Import necessary dependencies

In [1]:
import pandas as pd

# Create sample dataset

In [2]:
# Sample DataFrame representing Men's Sports Apparel data
data = pd.DataFrame({
    'Category': ['Shorts', 'Track Pants', 'Shorts', 'Lounge Shorts', 'Track Pants',
                 'Freck Panty', 'Shorts', 'This', 'Lounge Shorts'],
    'Color': ['Navy Blue', 'Grey', 'White', 'Navy Blue', 'Black',
              'Red', 'Grey', 'White', 'Black'],
    'Brand': ['Puma', 'Alcis', 'ADIDAS', 'HRX by Hrithik Roshan', 'Sports52 wear',
              'Bushirt', 'Anow Sport', 'ADIDAS', 'Slazenger']
})

print("Original Data:")
data

Original Data:


Unnamed: 0,Category,Color,Brand
0,Shorts,Navy Blue,Puma
1,Track Pants,Grey,Alcis
2,Shorts,White,ADIDAS
3,Lounge Shorts,Navy Blue,HRX by Hrithik Roshan
4,Track Pants,Black,Sports52 wear
5,Freck Panty,Red,Bushirt
6,Shorts,Grey,Anow Sport
7,This,White,ADIDAS
8,Lounge Shorts,Black,Slazenger


### A. One-Hot Encoding Specific Categories:

1. We define a list selected_categories containing the categories we want to create binary features for ('Shorts', 'Track Pants').
2. We iterate through this list. For each category, we create a new column with a name like 'Is_Shorts' or 'Is_Track_Pants'.
3. For each row in the DataFrame, we check if the value in the 'Category' column matches the current cat. If it does, we set the value in the new binary column to 1; otherwise, we set it to 0. .astype(int) converts the boolean result to an integer.

In [3]:
# 1. One-Hot Encoding Specific Categories in 'Category'

selected_categories = ['Shorts', 'Track Pants']
for cat in selected_categories:
    data[f'Is_{cat.replace(" ", "_")}'] = (data['Category'] == cat).astype(int)

print("\nData with One-Hot Encoded Specific Categories (Shorts, Track Pants):")
data


Data with One-Hot Encoded Specific Categories (Shorts, Track Pants):


Unnamed: 0,Category,Color,Brand,Is_Shorts,Is_Track_Pants
0,Shorts,Navy Blue,Puma,1,0
1,Track Pants,Grey,Alcis,0,1
2,Shorts,White,ADIDAS,1,0
3,Lounge Shorts,Navy Blue,HRX by Hrithik Roshan,0,0
4,Track Pants,Black,Sports52 wear,0,1
5,Freck Panty,Red,Bushirt,0,0
6,Shorts,Grey,Anow Sport,1,0
7,This,White,ADIDAS,0,0
8,Lounge Shorts,Black,Slazenger,0,0


### B. One-Hot Encoding Specific Colors:

We repeat the same process for the 'Color' column, selecting 'Navy Blue', 'White', and 'Grey' to create binary features.

In [4]:
# 2. One-Hot Encoding Specific Colors in 'Color'

selected_colors = ['Navy Blue', 'White', 'Grey']
for color in selected_colors:
    data[f'Is_{color.replace(" ", "_")}'] = (data['Color'] == color).astype(int)

print("\nData with One-Hot Encoded Specific Colors (Navy Blue, White, Grey):")
data


Data with One-Hot Encoded Specific Colors (Navy Blue, White, Grey):


Unnamed: 0,Category,Color,Brand,Is_Shorts,Is_Track_Pants,Is_Navy_Blue,Is_White,Is_Grey
0,Shorts,Navy Blue,Puma,1,0,1,0,0
1,Track Pants,Grey,Alcis,0,1,0,0,1
2,Shorts,White,ADIDAS,1,0,0,1,0
3,Lounge Shorts,Navy Blue,HRX by Hrithik Roshan,0,0,1,0,0
4,Track Pants,Black,Sports52 wear,0,1,0,0,0
5,Freck Panty,Red,Bushirt,0,0,0,0,0
6,Shorts,Grey,Anow Sport,1,0,0,0,1
7,This,White,ADIDAS,0,0,0,1,0
8,Lounge Shorts,Black,Slazenger,0,0,0,0,0


### C. One-Hot Encoding Specific Brands:

 We do the same for the 'Brand' column, selecting 'ADIDAS', 'Puma', and 'HRX by Hrithik Roshan'.

In [5]:
# 3. One-Hot Encoding Specific Brands in 'Brand'

selected_brands = ['ADIDAS', 'Puma', 'HRX by Hrithik Roshan']
for brand in selected_brands:
    data[f'Is_{brand.replace(" ", "_")}'] = (data['Brand'] == brand).astype(int)

print("\nData with One-Hot Encoded Specific Brands (ADIDAS, Puma, HRX by Hrithik Roshan):")
data


Data with One-Hot Encoded Specific Brands (ADIDAS, Puma, HRX by Hrithik Roshan):


Unnamed: 0,Category,Color,Brand,Is_Shorts,Is_Track_Pants,Is_Navy_Blue,Is_White,Is_Grey,Is_ADIDAS,Is_Puma,Is_HRX_by_Hrithik_Roshan
0,Shorts,Navy Blue,Puma,1,0,1,0,0,0,1,0
1,Track Pants,Grey,Alcis,0,1,0,0,1,0,0,0
2,Shorts,White,ADIDAS,1,0,0,1,0,1,0,0
3,Lounge Shorts,Navy Blue,HRX by Hrithik Roshan,0,0,1,0,0,0,0,1
4,Track Pants,Black,Sports52 wear,0,1,0,0,0,0,0,0
5,Freck Panty,Red,Bushirt,0,0,0,0,0,0,0,0
6,Shorts,Grey,Anow Sport,1,0,0,0,1,0,0,0
7,This,White,ADIDAS,0,0,0,1,0,1,0,0
8,Lounge Shorts,Black,Slazenger,0,0,0,0,0,0,0,0


# When to consider doing Binary Features: One-hot encoding specific categories over doing one hot encoding of whole category column ?

You should consider one-hot encoding specific categories over one-hot encoding the entire category column in the following situations:

1. High Cardinality with Few Dominant Categories: When a categorical feature has a large number of unique categories, but only a small subset of these categories are frequent or particularly important for your prediction task. One-hot encoding the entire column would create a very high-dimensional space, potentially leading to the curse of dimensionality, increased computational cost, and difficulty in model interpretation.Focusing on the dominant or most relevant categories keeps the feature space manageable.

2. Focus on Specific Business Insights: If your domain knowledge suggests that the presence or absence of certain specific categories has a strong and direct impact on the target variable, while other categories are less significant or have a more diffuse effect. By creating binary features for these key categories, you allow the model to directly learn their specific influence.

3. Reducing Dimensionality for Linear Models: Linear models (like linear regression or logistic regression) can become complex and prone to multicollinearity with a very large number of one-hot encoded features. If only a few categories are deemed crucial, encoding just them can help maintain a more stable and interpretable linear model.

4. Handling Rare Categories: If you have many infrequent categories in a column, one-hot encoding all of them can lead to many binary features with very few '1' values. These rare categories might not provide enough statistical power for the model to learn effectively and can contribute noise. Instead, you could create binary features for the more frequent categories and group the rest into an "Other" category (which might or might not be one-hot encoded as a binary feature).

5. Feature Selection Strategy: If you plan to perform feature selection later, one-hot encoding only the categories you hypothesize to be important can be a way to proactively guide the model and the selection process.

6. Maintaining Interpretability: A large number of one-hot encoded features can make it harder to interpret the model's coefficients or feature importances. Focusing on a smaller set of key categories can lead to a more understandable model.

7. Computational Constraints: If you are working with limited computational resources or need to train models quickly, reducing the number of features through selective one-hot encoding can be beneficial.

Example Scenario :

Let's say you are trying to predict the sales of men's sports apparel. The "Brand" column might have many different brands, but "ADIDAS," "Puma," and "HRX by Hrithik Roshan" are likely to be popular and have a significant impact on sales. Instead of one-hot encoding every brand, you might choose to create binary features like Is_ADIDAS, Is_Puma, and Is_HRX_by_Hrithik_Roshan. Other less frequent brands could be left as their original categorical values or grouped into an "Other Brands" category, depending on your modeling approach.

In summary, consider one-hot encoding specific categories when you have a good reason to believe that those particular categories are especially relevant, when the overall cardinality of the feature is high, or when you need to manage dimensionality for model performance and interpretability.