# Feature Engineering Examples: Binning Categorical Features

In this notebook, we will explore various techniques for binning categorical features as described in the article [Feature Engineering Examples: Binning Categorical Features](https://towardsdatascience.com/feature-engineering-examples-binning-categorical-features-9f8d582455da). Binning categorical features involves grouping categories together, which can be useful for reducing the dimensionality of the data and improving the performance of machine learning models.

We will cover the following techniques:
1. Frequency Binning
2. Binning by Similarity
3. Custom Binning


## 1. Frequency Binning

Frequency binning involves grouping categories based on their frequency of occurrence in the dataset. This method can help to reduce the number of categories by combining less frequent ones into a single group.


In [1]:
import pandas as pd

# Sample data
data = pd.DataFrame({
    'Category': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'Frequency': [50, 45, 40, 35, 30, 25, 20, 15, 10, 5]
})

# Frequency binning
threshold = 20
data['Binned'] = data['Category'].apply(lambda x: x if data[data['Category'] == x]['Frequency'].values[0] >= threshold else 'Other')

# Display the result
print(data)


  Category  Frequency Binned
0        A         50      A
1        B         45      B
2        C         40      C
3        D         35      D
4        E         30      E
5        F         25      F
6        G         20      G
7        H         15  Other
8        I         10  Other
9        J          5  Other


## 2. Binning by Similarity

Binning by similarity involves grouping categories based on some similarity metric. This could be domain knowledge, semantic similarity, or statistical similarity.


In [2]:
# Sample data with similarity scores (for simplicity, using numeric values to represent similarity)
similarity_data = pd.DataFrame({
    'Category': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
    'Similarity': [1, 1, 2, 2, 3, 3, 4, 4]
})

# Binning by similarity
similarity_data['Binned'] = similarity_data['Similarity'].apply(lambda x: f'Group {x}')

# Display the result
print(similarity_data)


  Category  Similarity   Binned
0        A           1  Group 1
1        B           1  Group 1
2        C           2  Group 2
3        D           2  Group 2
4        E           3  Group 3
5        F           3  Group 3
6        G           4  Group 4
7        H           4  Group 4


## 3. Custom Binning

Custom binning allows you to define your own rules for grouping categories based on domain knowledge or specific requirements.


In [3]:
# Sample data
custom_data = pd.DataFrame({
    'Category': ['Cat1', 'Cat2', 'Cat3', 'Cat4', 'Cat5', 'Cat6', 'Cat7', 'Cat8'],
    'Value': [10, 15, 10, 20, 25, 20, 30, 35]
})

# Custom binning based on domain knowledge
bins = {
    'Low': ['Cat1', 'Cat3'],
    'Medium': ['Cat2', 'Cat4', 'Cat6'],
    'High': ['Cat5', 'Cat7', 'Cat8']
}

def custom_bin(category):
    for key, value in bins.items():
        if category in value:
            return key
    return 'Other'

custom_data['Binned'] = custom_data['Category'].apply(custom_bin)

# Display the result
print(custom_data)


  Category  Value  Binned
0     Cat1     10     Low
1     Cat2     15  Medium
2     Cat3     10     Low
3     Cat4     20  Medium
4     Cat5     25    High
5     Cat6     20  Medium
6     Cat7     30    High
7     Cat8     35    High
