## Count or frequency encoding

In count encoding we replace the categories by the count of the observations that show that category in the dataset. Similarly, we can replace the category by the frequency -or percentage- of observations in the dataset. That is, if 10 of our 100 observations show the colour blue, we would replace blue by 10 if doing count encoding, or by 0.1 if replacing by the frequency. These techniques capture the representation of each label in a dataset, but the encoding may not necessarily be predictive of the outcome. These are however, very popular encoding methods in Kaggle competitions.

The assumption of this technique is that the number observations shown by each variable is somewhat informative of the predictive power of the category.


### Advantages

- Simple
- Does not expand the feature space

### Disadvantages

- If 2 different categories appear the same amount of times in the dataset, that is, they appear in the same number of observations, they will be replaced by the same number: may lose valuable information.

For example, if there are 10 observations for the category blue and 10 observations for the category red, both will be replaced by 10, and therefore, after the encoding, will appear to be the same thing. 


Follow this [thread in Kaggle](https://www.kaggle.com/general/16927) for more information.



## In this demo:

We will see how to perform count or frequency encoding with:
- pandas
- Feature-Engine

And the advantages and limitations of each implementation using the House Prices dataset.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.encoding import CountFrequencyEncoder

In [2]:
data = pd.read_excel(
    'HousingPrices.xls', usecols=['Neighborhood', 'Exterior1st', 'Exterior2nd', 'SalePrice'])
data.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd,SalePrice
0,CollgCr,VinylSd,VinylSd,208500.0
1,Veenker,MetalSd,MetalSd,181500.0
2,CollgCr,VinylSd,VinylSd,223500.0
3,Crawfor,Wd Sdng,Wd Shng,140000.0
4,NoRidge,VinylSd,VinylSd,250000.0


**Check how many labels each variable has!**

In [3]:
for col in data.columns:
    print(col, ': ', len(data[col].unique()), ' labels')

Neighborhood :  25  labels
Exterior1st :  16  labels
Exterior2nd :  17  labels
SalePrice :  664  labels


### Important

When doing count transformation of categorical variables, it is important to calculate the count (or frequency = count / total observations) **over the training set**, and then use those numbers to replace the labels in the test set.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    data[['Neighborhood', 'Exterior1st', 'Exterior2nd']], # predictors
    data['SalePrice'],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility
X_train.shape, X_test.shape

((2043, 3), (876, 3))

## Count and Frequency encoding with pandas

**Obtain the counts for each one of the labels in the variable Neigbourhood!**

In [5]:
count_map = X_train['Neighborhood'].value_counts().to_dict()
count_map

{'NAmes': 321,
 'CollgCr': 184,
 'OldTown': 165,
 'Edwards': 138,
 'Somerst': 120,
 'NridgHt': 118,
 'Gilbert': 118,
 'Sawyer': 110,
 'SawyerW': 91,
 'NWAmes': 82,
 'BrkSide': 77,
 'Crawfor': 75,
 'Mitchel': 74,
 'IDOTRR': 63,
 'Timber': 52,
 'NoRidge': 51,
 'SWISU': 35,
 'StoneBr': 34,
 'ClearCr': 32,
 'MeadowV': 26,
 'BrDale': 24,
 'Blmngtn': 17,
 'NPkVill': 16,
 'Veenker': 15,
 'Blueste': 5}

The dictionary contains the number of observations per category in Neighbourhood.

**Replace the labels with the counts!**

In [6]:
X_train['Neighborhood'] = X_train['Neighborhood'].map(count_map)
X_test['Neighborhood'] = X_test['Neighborhood'].map(count_map)

**Explore the result!**

In [7]:
X_train['Neighborhood'].head(10)

1448    138
1397     77
1        15
384      32
530      52
588      32
1027     52
2779    165
453     120
2057    321
Name: Neighborhood, dtype: int64

**For the frequency, we need only divide the count by the total number of observations:**

In [8]:
frequency_map = (X_train['Exterior1st'].value_counts() / len(X_train) ).to_dict()
frequency_map

{'VinylSd': 0.34801762114537443,
 'HdBoard': 0.1527165932452276,
 'MetalSd': 0.15222711698482624,
 'Wd Sdng': 0.14341654429760156,
 'Plywood': 0.07293196279980421,
 'CemntBd': 0.04552129221732746,
 'BrkFace': 0.03181595692608909,
 'WdShing': 0.01957905041605482,
 'Stucco': 0.015173764072442487,
 'AsbShng': 0.014684287812041116,
 'BrkComm': 0.0014684287812041115,
 'Stone': 0.0009789525208027412,
 'AsphShn': 0.0009789525208027412}

**Replace the labels with the frequencies!**

In [9]:
X_train['Exterior1st'] = X_train['Exterior1st'].map(frequency_map)
X_test['Exterior1st'] = X_test['Exterior1st'].map(frequency_map)

We can then put these commands into 2 functions as we did in the previous 3 notebooks, and loop over all the categorical variables. If you don't know how to do this, please check any of the previous notebooks.

## Count or Frequency Encoding with Feature-Engine

In [16]:
data1=data[['Neighborhood', 'Exterior1st', 'Exterior2nd', 'SalePrice']].dropna()

In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    data1[['Neighborhood', 'Exterior1st', 'Exterior2nd']], # predictors
    data1['SalePrice'],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility
X_train.shape, X_test.shape

((1022, 3), (438, 3))

In [18]:
count_enc = CountFrequencyEncoder(
    encoding_method='count', # to do frequency ==> encoding_method='frequency'
    variables=['Neighborhood', 'Exterior1st', 'Exterior2nd'])
count_enc.fit(X_train)

CountFrequencyEncoder(variables=['Neighborhood', 'Exterior1st', 'Exterior2nd'])

**In the encoder dict we can observe the number of observations per category for each variable!**

In [19]:
count_enc.encoder_dict_

{'Neighborhood': {'NAmes': 151,
  'CollgCr': 105,
  'OldTown': 73,
  'Edwards': 71,
  'Sawyer': 61,
  'Somerst': 56,
  'Gilbert': 55,
  'NWAmes': 51,
  'NridgHt': 51,
  'SawyerW': 45,
  'BrkSide': 41,
  'Mitchel': 36,
  'Crawfor': 35,
  'Timber': 30,
  'NoRidge': 30,
  'ClearCr': 24,
  'IDOTRR': 24,
  'SWISU': 18,
  'StoneBr': 16,
  'Blmngtn': 12,
  'MeadowV': 12,
  'BrDale': 10,
  'NPkVill': 7,
  'Veenker': 6,
  'Blueste': 2},
 'Exterior1st': {'VinylSd': 364,
  'HdBoard': 153,
  'Wd Sdng': 148,
  'MetalSd': 138,
  'Plywood': 86,
  'CemntBd': 39,
  'BrkFace': 35,
  'WdShing': 21,
  'Stucco': 17,
  'AsbShng': 15,
  'Stone': 2,
  'AsphShn': 1,
  'BrkComm': 1,
  'ImStucc': 1,
  'CBlock': 1},
 'Exterior2nd': {'VinylSd': 353,
  'Wd Sdng': 142,
  'HdBoard': 141,
  'MetalSd': 136,
  'Plywood': 112,
  'CmentBd': 39,
  'Wd Shng': 29,
  'BrkFace': 18,
  'AsbShng': 17,
  'Stucco': 16,
  'ImStucc': 8,
  'Stone': 4,
  'Brk Cmn': 4,
  'AsphShn': 1,
  'CBlock': 1,
  'Other': 1}}

**Explore the result!**

In [20]:
X_train = count_enc.transform(X_train)
X_test = count_enc.transform(X_test)
X_train.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
64,105,364,353
682,24,148,142
960,41,148,112
1384,71,21,29
1100,18,148,142


**Note**

If the argument variables is left to None, then the encoder will automatically identify all categorical variables. Is that not sweet?

The encoder will not encode numerical variables. So if some of your numerical variables are in fact categories, you will need to re-cast them as object before using the encoder.

Note, if there is a variable in the test set, for which the encoder doesn't have a number to assigned (the category was not seen in the train set), the encoder will return an error.