Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target and Count encodings for categorical features #3234

Closed
wants to merge 140 commits into from

Conversation

shiyu1994
Copy link
Collaborator

@shiyu1994 shiyu1994 commented Jul 16, 2020

This pull request is to support converting categorical features into CTR and count values. The CTR values are calculated by dividing the training data into folds, with a cross-validation style.

The performance evaluation has been added in #3234 (comment). Note that this is a basic version without ensemble trick. We will have a separate PR to implement ensemble trick to boost the performance of this categorical encoding method.

Detailed descriptions and guideline for reviews can be found below.

Description

Idea

LightGBM can handle categorical features internally, without any manual encoding into numerical values by users. The current approach that LightGBM handles categorical features can be found here https://lightgbm.readthedocs.io/en/latest/Features.html#optimal-split-for-categorical-features.
However, this approach has two known drawbacks:

  1. Strong regularization is often required for this approach to be effective when the number of categories are high. The 3 hyperparameters cat_l2 (https://lightgbm.readthedocs.io/en/latest/Parameters.html#cat_l2), cat_smooth (https://lightgbm.readthedocs.io/en/latest/Parameters.html#cat_smooth) and max_cat_to_onehot (https://lightgbm.readthedocs.io/en/latest/Parameters.html#max_cat_to_onehot) need to be tuned carefully.
  2. The current approach requires sorting over call categories for every categorical feature on every node, which can be slow with a large number of categories.

Inspired by CatBoost, we implement two new approach for internal encoding of categorical features:

  1. Encode with the categories by target (label) values (Unknown as target encoding). A category value c of feature j is encoded as

    where is the value of feature j for data i in the training set, is the indicator function, is the label for data i, p is the prior value (which is the mean of labels over the whole dataset, by default), and w is the weight for the prior value.
  2. Encode the categories by the number of appearances in the whole training data, i.e.,

For better generalization ability, for the target encoding, we use a cross-validation style approach. The training data is first divided into K folds. When encoding the category values in the fold k, we only use the other K-1 except fold k to calculate the encoded value. Thus, for each category c, we will have K different encoded values for a target encoding approach.

New Parameters

We add a new parameter category_encoders to specify the encoding methods for categorical features. Users can specify multiple encoding methods at the same time, and separate them by commas. For example, with category_encoders=target:0.5,count,target, we'll use 3 encoding methods at the same time: target encoding with prior value p=0.5, count encoding, and target encoding with default prior value. Each encoding method creates a new feature for each categorical feature for training. We also allow a raw encoding method which indicates the current approach of handling categorical feature of LightGBM.

Besides, we also add two parameters prior_weight and num_target_encoding_folds to allow users specify the p and K in the target encoding.

Implementation

The core of this PR is a new class CategoryEncodingProvider, which is defined in https://github.com/shiyu1994/LightGBM/blob/ctr/include/LightGBM/category_encoding_provider.hpp and https://github.com/shiyu1994/LightGBM/blob/ctr/src/io/category_encoding_provider.cpp. These two files amount to 2/3 of the code changes in the PR.
CategoryEncodingProvider works by 2 steps:

  1. Takes the raw data as input (can be LibSVM file, numpy/pandas/scipy.sparse matrices), and calculates the encoding values. Store these encoding values in side its own members.
  2. Expanding the raw data with an additional new feature for each categorical feature and each encoding method, before the data is fed into any internal data preprocessing modules by LightGBM (including bin finding, exclusive feature bundling and so on).

Since we need to support multiple data input formats, CategoryEncodingProvider is used in corresponding functions in src/c_api.cpp which is called when processing inputs of these formats. Including

  1. For LibSVM files: LGBM_DatasetCreateFromFile.
  2. For numpy/pandas matrix: LGBM_DatasetCreateFromMat and LGBM_DatasetCreateFromMats.
  3. For row-wise sparse matrix: LGBM_DatasetCreateFromCSR.
  4. For col-wise sparse matrix: LGBM_DatasetCreateFromCSC.

}

inline double EncodeCatValueForValidation(const int fid, const double feature_value) const {
const auto& ctr_encoding = ctr_encodings_.at(fid)[config_.num_ctr_folds];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid to use .at(), which is slow.

include/LightGBM/config.h Outdated Show resolved Hide resolved
include/LightGBM/c_api.h Outdated Show resolved Hide resolved
src/c_api.cpp Outdated Show resolved Hide resolved
@jameslamb
Copy link
Collaborator

At your earliest convenience, can you please merge in master? Some changes have been made to our CI jobs. I apologize for the inconvenience.

@guolinke
Copy link
Collaborator

guolinke commented Sep 6, 2020

@shiyu1994 can you fix the conflict?

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiyu1994 for the R side, I see you have this error in CI:

image

Can you please update

{"LGBM_DatasetCreateFromCSC_R" , (DL_FUNC) &LGBM_DatasetCreateFromCSC_R , 10},
? Now that you've added two more arguments to this function, it should be

  {"LGBM_DatasetCreateFromCSC_R"      , (DL_FUNC) &LGBM_DatasetCreateFromCSC_R      , 12},

include/LightGBM/boosting.h Outdated Show resolved Hide resolved
include/LightGBM/c_api.h Outdated Show resolved Hide resolved
include/LightGBM/c_api.h Show resolved Hide resolved
include/LightGBM/config.h Outdated Show resolved Hide resolved
include/LightGBM/dataset_loader.h Outdated Show resolved Hide resolved
include/LightGBM/dataset_loader.h Outdated Show resolved Hide resolved
include/LightGBM/utils/text_reader.h Outdated Show resolved Hide resolved
src/main.cpp Outdated Show resolved Hide resolved
@shiyu1994
Copy link
Collaborator Author

shiyu1994 commented Sep 14, 2020

Comparison between CTR and old categorical feature approach:
We use 7 datasets with categorical features and numerical features. We tune the hyperparameters of LightGBM with HyperOpt.
The following table lists the range of tuning for parameters shared by CTR version and original version.

Parameter Range
learning_rate [e^-7, 1]
num_leavs [e^1, e^7]
max_bin [2^4, 2^10]
feature_fraction [0.5, 1]
bagging_fraction [0.5, 1]
min_data_in_leaf [1, e^6]
min_sum_hessians_in_leaf [e^-16, e^5]
lambda_l1 [0, e^2]
lambda_l2 [0, e^2]

The hyper-parameters of old categorical method are

Parameter Range
cat_smooth [1,e^8]
cat_l2 [0, e^6]
max_cat_threshold [0, max_cat_count / 2]

where max_cat_count is the maximum number of categorical values of categorical features in the training set.

For CTR version, we fix the number of folds for CTR calculation to 4. And try both CTR with count and CTR only.

We use 5-fold cross validation. For the number of trials allowed to tune hyperparameters, we allow the old categorical method to use 500 trials. And for each of CTR with count and CTR only, we use 200 trials. The tuning for each algorithm and each dataset is repeated for 5 times (with 5 different CV fold partitions).

Following table shows the AUC on test sets

Method Adult Amazon Appetency Click Internet Kick Upselling
Old Categorical 0.929203
+/-0.000331
0.856392
+/-0.000738
0.841982
+/-0.002691
0.720245
+/-0.000060
0.959978
+/-0.000445
0.665910
+/-0.003899
0.864470
+/-0.000858
CTR and Count 0.929112
+/-0.000394
0.863328
+/-0.001924
0.850926
+/-0.001570
0.741018
+/-0.000200
0.960743
+/-0.000411
0.644784
+/-0.009142
0.864755
+/-0.000763
CTR only 0.929004
+/-0.000435
0.852688
+/-0.002328
0.853207
+/-0.001834
0.735202
+/-0.000144
0.960847
+/-0.000529
0.656863
+/-0.002550
0.864416
+/-0.000881

// transform categorical features to encoded numerical values before the bin construction process
class CategoryEncodingProvider {
public:
class CatConverter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put CatConverter as separate class in separate file to reduce complexity for CategoryEncodingProvider? Looks like we might need to access properties in CategoryEncodingProvider, then we can pass CategoryEncodingProvider to the CatConverter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For every converter we can have separate unit tests?


virtual std::string DumpToString() const = 0;

virtual json11::Json DumpToJSONObject() const = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure we have common pattern for serialization could we use json as general pattern? json is a special string then we do not need to define our customize pattern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I believe we should not do serialization & deserialization very often?

return cat_fid_to_convert_fid_.at(cat_fid);
}

static CatConverter* CreateFromCharPointer(const char* char_pointer, size_t* used_len, double prior_weight) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can have separate file for class initialization logic.

return str_stream.str();
}

json11::Json DumpToJSONObject() const override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it is another serialization logic for sub/base class, could we have pattern like below:

DumpToJSONObject(json11::Json context)
{
base.DumpToJSONObject(context);

 // Dump to json logic for sub class

}

}
};

class TargetEncoderLabelMean: public CatConverter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these converter, looks like separate class would be clear?

}

// parameter configuration
Config config_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we should not have config after initialize?

Config config_;

// size of training data
data_size_t num_data_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for below properties we'd better have only 1 unified pair serialization & deserialization function?

bool accumulated_from_file_;
};

class CategoryEncodingParser : public Parser {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use parser as middleware looks like wield, it is more like a step in pipeline processing. Not sure is that possible we refactor the processing steps to pipeline pattern?

/*!
* \brief Constructor for customized parser. The constructor accepts content not path because need to save/load the config along with model string
*/
explicit Parser(std::string) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to have a constructor of string looks like wield, not sure any purpose for this in base class?


namespace LightGBM {

CategoryEncodingProvider::CategoryEncodingProvider(Config* config) {
Copy link
Contributor

@tongwu-sh tongwu-sh Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a unified construction for CategoryEncodingProvider? And others are just transform inputs into this construction method? something like below:

CategoryEncodingProvider(a, b, c, d, e)

CategoryEncodingProvider(a, b, c, d) {
CategoryEncodingProvider((a, b, c, d, default_value)
}

tmp_parser_ = nullptr;
}

std::string CategoryEncodingProvider::DumpToJSON() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as converter, Looks like unified single pair serialization/deserialization would reduce lots of complexity.

@shiyu1994
Copy link
Collaborator Author

Thanks @tongwu-msft for the valuable suggestions for refactoring the code. We can have follow-up PRs for the refactor after this is merged.

@StrikerRUS
Copy link
Collaborator

We can have follow-up PRs for the refactor after this is merged.

TBH, I'm not sure this is a good approach of adding new features for the open source project where there are no any deadlines and the work is based on a collaboration of multiple persons. We may end up having a lot of such "valuable suggestions for refactoring the code" that may be never get enough attention in the future. As a result, the whole codebase of LightGBM will become less readable, unoptimized and extremely hard to make contributions for outside contributors.

@shiyu1994
Copy link
Collaborator Author

@StrikerRUS Thanks for the reminder. I've discussed with @tongwu-msft, and he will do the refactor and then push to this branch.

@jameslamb
Copy link
Collaborator

I agree with @StrikerRUS 's comments in #3234 (comment). Right now, while attention is focused on this PR, is the best time to address review comments.

@guolinke
Copy link
Collaborator

@shiyu1994 should we close this PR? I think this is hard to merge now.

@jameslamb
Copy link
Collaborator

It's been 6 months since @guolinke asked if we should close this PR with no response, and more than a year since the most recent commit.

@shiyu1994 I'm closing this. If you'd like to continue this work, please propose a new PR and we can start a new review cycle.

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants