Function to group rare categories into a single category "OTHER"

Problems posed by rare category values to logistic regression model:
    
    - non-ordered, high-cardinality categorical variables are typically one-hot encoded (equivalently, converted to dummy variables), leading to sparse feature vectors and more coefficients to fit (one coefficient per category)

    - in parametric methods such as Logistic Regression, coefficients of categories with few samples become unstable and there is increased risk of overfitting to noise

Advantages of grouping rare category values into single category:
   
    - reduce variance (at the expense of increased bias) of coefficients of rare categories, thereby reducing overall model variance - particularly relevant for parametric methos such as Logistic Regression
    
    - reduce dimensionality of model

    - make model capable of handling other values of that category not seen during training


Alternative method to handle high-cardinality categorical features - Target mean encoding:
    
    - How it works: replace each category with the mean target variable value for that category
    
    - Advantages: 
        
        - still captures relationship of each category and the target variable

        - the categorical feature effectively becomes a numerical feature, avoiding dimensionality growth and feature vector sparsity

    - Vulnerabilities:

        - still prone to overfitting for rare categories (e.g. category with one sample with positive target class will have mean = 1)

        - Introduces target leakage (for each sample, the target variable value is present in the features)

A popular algorithm that addresses the problem of categorical features with numerous distinct values is CatBoost, or Categorical Boosting.
In short, CatBoost is a type of Gradient Boosted Decision Tree ensemble classifier. It successively fits new trees to the gradient (error) of the previously fitted trees, and its predictions are a weighted average of all those trees.
CatBoost handles categorical features without any previous feature tranformation (e.g. one-hot encoding). It does so with a technique called Ordered Target Statistics (OTS).
OTS is similar to the technique of target mean encoding mentioned above, but also has some differences:

    - each category of a categorical feature is encoded per sample with the mean value of the target variable for that category, but this mean is calculated using only other samples (instead of using the entire training set, including the current sample). This avoids target leakage.

In [None]:
import math

FUNCTION TO GROUP RARE CATEGORIES:

In [None]:
def group_rare_categories(category_values,min_frequency=0.05):
    """
    Groups category values with frequency smaller than min_frequency into one category "OTHER"
    """
    n_samples = len(category_values)
    min_ocurrences = math.ceil(n_samples*min_frequency)
    value_counts = [category_values.count(v) for v in category_values]
    grouped_category_values = ["OTHER" if v < min_ocurrences else category_values[i] for (i,v) in enumerate(value_counts)]
    return grouped_category_values


Test function

In [None]:
categorical_values = ["US","US","UK","India","India","Brasil","US","Brasil","Portugal","Portugal","US","US"]
thr = 0.1
grouped_categorical_values = group_rare_categories(categorical_values,thr)
grouped_categorical_values

In [None]:
thr = 0.2
grouped_categorical_values = group_rare_categories(categorical_values,thr)
grouped_categorical_values