Target and Count encodings for categorical features #3234

shiyu1994 · 2020-07-16T14:23:38Z

This pull request is to support converting categorical features into CTR and count values. The CTR values are calculated by dividing the training data into folds, with a cross-validation style.

The performance evaluation has been added in #3234 (comment). Note that this is a basic version without ensemble trick. We will have a separate PR to implement ensemble trick to boost the performance of this categorical encoding method.

Detailed descriptions and guideline for reviews can be found below.

Description

Idea

LightGBM can handle categorical features internally, without any manual encoding into numerical values by users. The current approach that LightGBM handles categorical features can be found here https://lightgbm.readthedocs.io/en/latest/Features.html#optimal-split-for-categorical-features.
However, this approach has two known drawbacks:

Strong regularization is often required for this approach to be effective when the number of categories are high. The 3 hyperparameters cat_l2 (https://lightgbm.readthedocs.io/en/latest/Parameters.html#cat_l2), cat_smooth (https://lightgbm.readthedocs.io/en/latest/Parameters.html#cat_smooth) and max_cat_to_onehot (https://lightgbm.readthedocs.io/en/latest/Parameters.html#max_cat_to_onehot) need to be tuned carefully.
The current approach requires sorting over call categories for every categorical feature on every node, which can be slow with a large number of categories.

Inspired by CatBoost, we implement two new approach for internal encoding of categorical features:

Encode with the categories by target (label) values (Unknown as target encoding). A category value c of feature j is encoded as
$\frac{\sum_i I[x_{ij}=c] y_i + w \cdot p}{\sum_i I[x_{ij}=c] + w}$
where $x_{ij}$ is the value of feature j for data i in the training set, is the indicator function, is the label for data i, p is the prior value (which is the mean of labels over the whole dataset, by default), and w is the weight for the prior value.
Encode the categories by the number of appearances in the whole training data, i.e.,
$\sum_i I[x_{ij}=c]$

For better generalization ability, for the target encoding, we use a cross-validation style approach. The training data is first divided into K folds. When encoding the category values in the fold k, we only use the other K-1 except fold k to calculate the encoded value. Thus, for each category c, we will have K different encoded values for a target encoding approach.

New Parameters

We add a new parameter category_encoders to specify the encoding methods for categorical features. Users can specify multiple encoding methods at the same time, and separate them by commas. For example, with category_encoders=target:0.5,count,target, we'll use 3 encoding methods at the same time: target encoding with prior value p=0.5, count encoding, and target encoding with default prior value. Each encoding method creates a new feature for each categorical feature for training. We also allow a raw encoding method which indicates the current approach of handling categorical feature of LightGBM.

Besides, we also add two parameters prior_weight and num_target_encoding_folds to allow users specify the p and K in the target encoding.

Implementation

The core of this PR is a new class CategoryEncodingProvider, which is defined in https://github.com/shiyu1994/LightGBM/blob/ctr/include/LightGBM/category_encoding_provider.hpp and https://github.com/shiyu1994/LightGBM/blob/ctr/src/io/category_encoding_provider.cpp. These two files amount to 2/3 of the code changes in the PR.
CategoryEncodingProvider works by 2 steps:

Takes the raw data as input (can be LibSVM file, numpy/pandas/scipy.sparse matrices), and calculates the encoding values. Store these encoding values in side its own members.
Expanding the raw data with an additional new feature for each categorical feature and each encoding method, before the data is fed into any internal data preprocessing modules by LightGBM (including bin finding, exclusive feature bundling and so on).

Since we need to support multiple data input formats, CategoryEncodingProvider is used in corresponding functions in src/c_api.cpp which is called when processing inputs of these formats. Including

For LibSVM files: LGBM_DatasetCreateFromFile.
For numpy/pandas matrix: LGBM_DatasetCreateFromMat and LGBM_DatasetCreateFromMats.
For row-wise sparse matrix: LGBM_DatasetCreateFromCSR.
For col-wise sparse matrix: LGBM_DatasetCreateFromCSC.

guolinke · 2020-07-17T01:00:43Z

include/LightGBM/ctr_provider.hpp

+  }
+
+  inline double EncodeCatValueForValidation(const int fid, const double feature_value) const {
+    const auto& ctr_encoding = ctr_encodings_.at(fid)[config_.num_ctr_folds];


avoid to use .at(), which is slow.

include/LightGBM/config.h

include/LightGBM/ctr_provider.hpp

include/LightGBM/c_api.h

src/c_api.cpp

jameslamb · 2020-08-10T01:58:56Z

At your earliest convenience, can you please merge in master? Some changes have been made to our CI jobs. I apologize for the inconvenience.

guolinke · 2020-09-06T02:48:24Z

@shiyu1994 can you fix the conflict?

jameslamb

@shiyu1994 for the R side, I see you have this error in CI:

Can you please update

LightGBM/R-package/src/lightgbm_R.cpp

Line 681 in 82e2ff7

    
           {"LGBM_DatasetCreateFromCSC_R"      , (DL_FUNC) &LGBM_DatasetCreateFromCSC_R      , 10},

? Now that you've added two more arguments to this function, it should be

  {"LGBM_DatasetCreateFromCSC_R"      , (DL_FUNC) &LGBM_DatasetCreateFromCSC_R      , 12},

include/LightGBM/boosting.h

include/LightGBM/c_api.h

include/LightGBM/config.h

include/LightGBM/dataset_loader.h

include/LightGBM/utils/text_reader.h

src/main.cpp

shiyu1994 · 2020-09-14T03:37:12Z

Comparison between CTR and old categorical feature approach:
We use 7 datasets with categorical features and numerical features. We tune the hyperparameters of LightGBM with HyperOpt.
The following table lists the range of tuning for parameters shared by CTR version and original version.

Parameter	Range
learning_rate	[e^-7, 1]
num_leavs	[e^1, e^7]
max_bin	[2^4, 2^10]
feature_fraction	[0.5, 1]
bagging_fraction	[0.5, 1]
min_data_in_leaf	[1, e^6]
min_sum_hessians_in_leaf	[e^-16, e^5]
lambda_l1	[0, e^2]
lambda_l2	[0, e^2]

The hyper-parameters of old categorical method are

Parameter	Range
cat_smooth	[1,e^8]
cat_l2	[0, e^6]
max_cat_threshold	[0, max_cat_count / 2]

where max_cat_count is the maximum number of categorical values of categorical features in the training set.

For CTR version, we fix the number of folds for CTR calculation to 4. And try both CTR with count and CTR only.

We use 5-fold cross validation. For the number of trials allowed to tune hyperparameters, we allow the old categorical method to use 500 trials. And for each of CTR with count and CTR only, we use 200 trials. The tuning for each algorithm and each dataset is repeated for 5 times (with 5 different CV fold partitions).

Following table shows the AUC on test sets

Method	Adult	Amazon	Appetency	Click	Internet	Kick	Upselling
Old Categorical	0.929203 +/-0.000331	0.856392 +/-0.000738	0.841982 +/-0.002691	0.720245 +/-0.000060	0.959978 +/-0.000445	0.665910 +/-0.003899	0.864470 +/-0.000858
CTR and Count	0.929112 +/-0.000394	0.863328 +/-0.001924	0.850926 +/-0.001570	0.741018 +/-0.000200	0.960743 +/-0.000411	0.644784 +/-0.009142	0.864755 +/-0.000763
CTR only	0.929004 +/-0.000435	0.852688 +/-0.002328	0.853207 +/-0.001834	0.735202 +/-0.000144	0.960847 +/-0.000529	0.656863 +/-0.002550	0.864416 +/-0.000881

tongwu-sh · 2021-12-06T10:11:31Z