* **Pros**: It reduces dimensionality compared to one-hot encoding. Count encoding retains the original information about the frequency of each category in the dataset.

* **Cons:** While count encoding preserves frequency information, it discards any other meaningful information or relationships that may exist between categories. Count encoding can be sensitive to data imbalances.

* When to use: This encoding technique can be useful when there’s a correlation between the frequency of a category and the target variable. Also applicable for categorical features with a lot of categories. Also, the count_encoder should be fit only on the train dataset. The fitted object should be used to transform test and out of time (OOT) datasets.

In [None]:
!pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.6.4-py2.py3-none-any.whl.metadata (8.0 kB)
Downloading category_encoders-2.6.4-py2.py3-none-any.whl (82 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.0/82.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: category_encoders
Successfully installed category_encoders-2.6.4


In [None]:
import category_encoders as ce
import pandas as pd

In [None]:
data = {
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Red', 'Blue', 'Green'],
    'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small', 'Small', 'Medium'],
    'Label': [1, 0, 1, 1, 0, 0, 1]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Color,Size,Label
0,Red,Small,1
1,Blue,Medium,0
2,Green,Large,1
3,Red,Medium,1
4,Red,Small,0
5,Blue,Small,0
6,Green,Medium,1


In [None]:
column_names = df.select_dtypes(include = ['object']).columns.to_list()

In [None]:
encoder = ce.CountEncoder()
df[['Color', 'Size']] = encoder.fit_transform(df[column_names])


In [None]:
df

Unnamed: 0,Color,Size,Label
0,3,3,1
1,2,3,0
2,2,1,1
3,3,3,1
4,3,3,0
5,2,3,0
6,2,3,1
