Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target and Count encodings for categorical features #3234

Closed
wants to merge 140 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
140 commits
Select commit Hold shift + click to select a range
3d3d899
support categorical feature converters (CTR and count)
shiyu1994 Sep 7, 2020
02fe28b
merge master into CTR
shiyu1994 Sep 7, 2020
da9a051
sort categorical features in ctr provider (for distributed learning)
shiyu1994 Sep 8, 2020
b8b5747
store convert feature index map in model for prediction
shiyu1994 Sep 10, 2020
5eda890
fix R interface
shiyu1994 Sep 14, 2020
788dc77
Merge branch 'master' into ctr
shiyu1994 Sep 14, 2020
c9ded18
add prior weight
shiyu1994 Sep 17, 2020
dc0a7e2
Merge branch 'ctr' of https://github.com/shiyu1994/LightGBM into ctr
shiyu1994 Sep 17, 2020
271197a
used fold wise prior for training data
shiyu1994 Sep 27, 2020
7cfcd9a
Merge branch 'master' into ctr
shiyu1994 Sep 27, 2020
0d23835
recover prior_weight parameter
shiyu1994 Sep 27, 2020
e631b28
fix memory
shiyu1994 Oct 21, 2020
f25d9e9
remove useless code
shiyu1994 Oct 21, 2020
26ef4b9
fix CTRProvider copy constructor
shiyu1994 Oct 21, 2020
d12ead5
make copy constructor public
shiyu1994 Oct 21, 2020
d6eba9f
copy data pushing functions in copy constructor of CTRProvider
shiyu1994 Oct 21, 2020
f630ce9
fix copy constructor of CTRProvider
shiyu1994 Oct 21, 2020
cf728af
add destructor for CTRProvider
shiyu1994 Oct 21, 2020
19e0153
fix memory leak in CTRProvider of GBDT
shiyu1994 Oct 22, 2020
1e4a923
fix CatConverter Copy function
shiyu1994 Oct 28, 2020
abd6990
add ctr file parser
shiyu1994 Oct 29, 2020
205cc60
remove code changes in dataset
shiyu1994 Nov 18, 2020
ec17a68
wrap data reading functions with CTRProvider
shiyu1994 Nov 19, 2020
e8ec5f0
use ctr csc col iterators
shiyu1994 Nov 22, 2020
ffb3eb1
pad zeros when CTR is used
shiyu1994 Nov 25, 2020
a3ae6ab
wrap parser and row functions for prediction with CTRProvider
shiyu1994 Nov 26, 2020
9b6a157
rename parser.h in include/LightGBM to parser_base.h
shiyu1994 Nov 26, 2020
391c2eb
remove changes in src/main.cpp
shiyu1994 Nov 26, 2020
702c72f
merge LightGBM/master into shiyu1994/ctr
shiyu1994 Nov 26, 2020
a9cddcb
merge src/c_api.cpp from LightGBM/master
shiyu1994 Nov 26, 2020
e5db275
change data_size_t to INDEX_T in TextReader::SampleFromFileWithIndices
shiyu1994 Nov 26, 2020
32344ca
add virtual destructor for CTRProvider::CatConverter
shiyu1994 Nov 26, 2020
7558207
mark SetPrior as override for CatConverter sub-classes
shiyu1994 Nov 26, 2020
28206d3
replace keep_old_cat_method with raw cat_converters option
shiyu1994 Nov 27, 2020
dc70eec
check label function exists after extracting cat converters
shiyu1994 Nov 27, 2020
34ec34e
check can load from binary before constructing ctr provider
shiyu1994 Nov 30, 2020
49c78be
add train.conf before swithing branch
shiyu1994 Nov 30, 2020
03f5f9b
update data create function calls in c_api_test/test_.py
shiyu1994 Dec 2, 2020
066e9e9
add simple ctr test
shiyu1994 Dec 2, 2020
dbdb028
fix doc for LGBM_DatasetCreate functions
shiyu1994 Dec 2, 2020
c17a9c6
add doc string for c_float_label in basic.py
shiyu1994 Dec 3, 2020
af3f30b
fix c++ linter problems
shiyu1994 Dec 3, 2020
392d1e2
fix doc string indent for the description of cat_converters in python
shiyu1994 Dec 3, 2020
0916948
fix c++ linter prolems and regenerate Parameters.rst
shiyu1994 Dec 3, 2020
3d69501
fix explicit constructor issue
shiyu1994 Dec 3, 2020
d9c6fd0
use count lines when constructing CTRProvider from file
shiyu1994 Dec 22, 2020
c304fac
Merge branch 'master' into ctr
shiyu1994 Dec 22, 2020
08aaf8d
fix corner case bug when subcol is used
shiyu1994 Dec 23, 2020
cee63dc
Merge branch 'ctr' of https://github.com/shiyu1994/LightGBM into ctr
shiyu1994 Dec 23, 2020
84e9951
fix training data fold id bug
shiyu1994 Dec 24, 2020
8aa1146
Merge with LightGBM/master
shiyu1994 Dec 24, 2020
fe3f608
merge master
shiyu1994 Jan 10, 2021
3bef245
remove redundant LGBM_BoosterGetLinear in c_api.h
shiyu1994 Jan 10, 2021
8431ebf
add support of CTR for R package
shiyu1994 Jan 13, 2021
2bf9a69
use template to reduce redundant code
shiyu1994 Jan 18, 2021
74d80d5
remove label type and avoid stramge warns when using categorical feat…
shiyu1994 Jan 20, 2021
fa7bb6b
merge master
shiyu1994 Jan 20, 2021
c1c0839
fix python test
shiyu1994 Jan 20, 2021
255f546
fix R test
shiyu1994 Jan 20, 2021
5595c36
fix python Dataset _extract_categorical_info_from_params
shiyu1994 Jan 21, 2021
b5f9302
test where the R test failed (will remove this commit)
shiyu1994 Jan 21, 2021
a4caf2c
replace array with vector when converting R label
shiyu1994 Jan 21, 2021
3a3ef94
test raw cat_converters
shiyu1994 Jan 21, 2021
6cd5cf1
check null of label in R Dataset
shiyu1994 Jan 21, 2021
eb2b5d3
remove R test for CTR
shiyu1994 Jan 21, 2021
8b237db
remove R support for CTR
shiyu1994 Jan 22, 2021
fac5a9d
remove changes in install.libs.R
shiyu1994 Jan 22, 2021
db1eb1b
fix R label type for CTR construction (can be integer)
shiyu1994 Jan 22, 2021
267058b
recover R tests for CTR
shiyu1994 Jan 22, 2021
10402f3
save prior_weight_ in model of CTRProvider
shiyu1994 Jan 22, 2021
742d4aa
check R test output
shiyu1994 Jan 22, 2021
f53824c
directly check CTR output
shiyu1994 Jan 22, 2021
aceb122
use prior_weight_ instead of config_.prior_weight when recovering CTR…
shiyu1994 Jan 22, 2021
abb20b4
try wrong ctr string
shiyu1994 Jan 22, 2021
b08507e
try wrong ctr string
shiyu1994 Jan 22, 2021
2bf9228
test that features are expanded with CTR
shiyu1994 Jan 22, 2021
b434784
skip tests due to CTR inconsistence across platforms
shiyu1994 Jan 22, 2021
fd6d8db
accumulate CTR statistics when sampling from file
shiyu1994 Jan 26, 2021
15b8bb0
Merge branch 'master' into ctr
shiyu1994 Jan 26, 2021
10c97e8
dynamic adjustment of num_original_features_ when load CTRProvider fr…
shiyu1994 Jan 26, 2021
5382b8b
add test for CTR in multi-class tasks
shiyu1994 Jan 26, 2021
0eeef3f
fix the case when categorical_feature is empty but cat_converters is …
shiyu1994 Jan 26, 2021
fd39a9f
rearrange headers in alphabetic order
shiyu1994 Jan 26, 2021
2cfbbfc
fix linter problems
shiyu1994 Jan 26, 2021
ae060c8
Merge branch 'master' into ctr
shiyu1994 Jan 26, 2021
c724fb8
fix two round loading from file
shiyu1994 Jan 26, 2021
d2c9b32
fix two round loading from file
shiyu1994 Jan 26, 2021
94aae86
clean-up code and fix load file bug
shiyu1994 Jan 27, 2021
595a2a4
Merge branch 'master' into ctr
shiyu1994 Jan 27, 2021
e2531f3
fix python linter problem
shiyu1994 Jan 27, 2021
b2260cd
Merge branch 'master' of https://github.com/microsoft/LightGBM into ctr
shiyu1994 Feb 2, 2021
dd41228
rename cat_converters to category_encoders and ctr to target encoding
shiyu1994 Apr 16, 2021
55920c2
remove model files in tests/python_package_test
shiyu1994 Apr 16, 2021
cad72eb
Merge branch 'master' into ctr
shiyu1994 Apr 16, 2021
18ffb3a
Add description for category_encoders
shiyu1994 Apr 16, 2021
e5379e8
Fix format
shiyu1994 Apr 16, 2021
fd6fa1e
remove redundant session name
shiyu1994 Apr 16, 2021
a544a16
Apply suggestions from code review
shiyu1994 Apr 22, 2021
b416916
Add CommonC::UnorderedMapToString
shiyu1994 Apr 22, 2021
a25f323
improve the format of CategoryEncodingProvider and add tests for vali…
shiyu1994 May 25, 2021
cdd96f7
merge master into ctr
shiyu1994 May 25, 2021
75a7fdd
fix R interface
shiyu1994 May 26, 2021
66826a2
increment model version and add backward compatibility
shiyu1994 May 26, 2021
60daa2d
merge LightGBM/master into ctr
shiyu1994 Oct 29, 2021
2d00fad
remove useless files
shiyu1994 Oct 29, 2021
45b208e
use multi_error instead of multi_loglsos for test for stability
shiyu1994 Oct 29, 2021
e9fd7db
remove comment
shiyu1994 Oct 29, 2021
39f08de
document that add_features_from cannot be used with non-default categ…
shiyu1994 Oct 29, 2021
2ed6a84
add support for category_encoders with monotone constraints
shiyu1994 Nov 1, 2021
e139fd1
ignore category_encoders when labels are not provided
shiyu1994 Nov 1, 2021
e84a330
add check for interaction constraints range in CategoryEncodingProvider
shiyu1994 Nov 9, 2021
17c3f29
comment new category encoding tests
shiyu1994 Nov 9, 2021
51a43e3
pull LightGBM/master into shiyu1994/ctr
shiyu1994 Nov 9, 2021
47edf6d
check whether force splits are specified
shiyu1994 Nov 9, 2021
a24da60
comment out CheckForcedSplitsForCategoryEncoding
shiyu1994 Nov 9, 2021
5f2aac2
recover gbdt.cpp
shiyu1994 Nov 9, 2021
f997fe0
change return type of CategoryEncodingProvider::CheckForcedSplitsForC…
shiyu1994 Nov 10, 2021
67bf9f8
merge master into ctr
shiyu1994 Nov 10, 2021
1116879
remove useless file
shiyu1994 Nov 10, 2021
6f50504
fix python linter errors
shiyu1994 Nov 10, 2021
4e7d8ba
remove white space
shiyu1994 Nov 10, 2021
a4ec01f
keep old C APIs
shiyu1994 Nov 15, 2021
936526d
Merge remote-tracking branch 'shiyu1994/ctr' into ctr
shiyu1994 Nov 15, 2021
4b95942
add blank lines
shiyu1994 Nov 15, 2021
500a426
test only no label data creation APIs in tests/c_api_test/test_.py
shiyu1994 Nov 15, 2021
cc58f19
test data creation APIs in tests/c_api_test/test_.py with label
shiyu1994 Nov 15, 2021
3d23c3b
fix return statements
shiyu1994 Nov 15, 2021
773c59b
fix solaris compatibility
shiyu1994 Nov 15, 2021
9dad619
fix lint errors
shiyu1994 Nov 15, 2021
6510a34
fix solaris compatibility
shiyu1994 Nov 15, 2021
b8d6ff3
fix linter error
shiyu1994 Nov 15, 2021
9fec031
fix conflict label_t definition with Solaris in c_api.cpp
shiyu1994 Nov 16, 2021
1bf6e94
fix return statement
shiyu1994 Nov 16, 2021
bbfc279
Merge remote-tracking branch 'LightGBM/master' into ctr
shiyu1994 Nov 16, 2021
f7370d2
fix conflicts with LightGBM/master
shiyu1994 Nov 16, 2021
fab88db
resolve R code issues
shiyu1994 Nov 17, 2021
880fbf8
Merge branch 'master' of https://github.com/microsoft/LightGBM into ctr
shiyu1994 Nov 17, 2021
8431411
merge master
shiyu1994 Dec 7, 2021
bcea451
merge master
shiyu1994 Dec 7, 2021
3e91f68
add more comments for CategoryEncodingProvider::SyncEncodingStat
shiyu1994 Dec 7, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions R-package/R/lgb.Dataset.R
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,7 @@ Dataset <- R6::R6Class(
"LGBM_DatasetCreateFromMat_R"
, ret = handle
, private$raw_data
, self$getinfo("label")
, nrow(private$raw_data)
, ncol(private$raw_data)
, params_str
Expand All @@ -226,6 +227,7 @@ Dataset <- R6::R6Class(
, private$raw_data@p
, private$raw_data@i
, private$raw_data@x
, self$getinfo("label")
, length(private$raw_data@p)
, length(private$raw_data@x)
, nrow(private$raw_data)
Expand Down
7 changes: 5 additions & 2 deletions R-package/src/lightgbm_R.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ LGBM_SE LGBM_DatasetCreateFromFile_R(LGBM_SE filename,
LGBM_SE LGBM_DatasetCreateFromCSC_R(LGBM_SE indptr,
LGBM_SE indices,
LGBM_SE data,
LGBM_SE label,
LGBM_SE num_indptr,
LGBM_SE nelem,
LGBM_SE num_row,
Expand All @@ -83,13 +84,14 @@ LGBM_SE LGBM_DatasetCreateFromCSC_R(LGBM_SE indptr,
const int* p_indptr = R_INT_PTR(indptr);
const int* p_indices = R_INT_PTR(indices);
const double* p_data = R_REAL_PTR(data);
const double* p_label = R_REAL_PTR(label);

int64_t nindptr = static_cast<int64_t>(R_AS_INT(num_indptr));
int64_t ndata = static_cast<int64_t>(R_AS_INT(nelem));
int64_t nrow = static_cast<int64_t>(R_AS_INT(num_row));
DatasetHandle handle = nullptr;
CHECK_CALL(LGBM_DatasetCreateFromCSC(p_indptr, C_API_DTYPE_INT32, p_indices,
p_data, C_API_DTYPE_FLOAT64, nindptr, ndata,
p_data, p_label, C_API_DTYPE_FLOAT64, C_API_DTYPE_FLOAT64, nindptr, ndata,
nrow, R_CHAR_PTR(parameters), R_GET_PTR(reference), &handle));
R_SET_PTR(out, handle);
R_API_END();
Expand All @@ -106,8 +108,9 @@ LGBM_SE LGBM_DatasetCreateFromMat_R(LGBM_SE data,
int32_t nrow = static_cast<int32_t>(R_AS_INT(num_row));
int32_t ncol = static_cast<int32_t>(R_AS_INT(num_col));
double* p_mat = R_REAL_PTR(data);
const double* p_label = R_REAL_PTR(label);
DatasetHandle handle = nullptr;
CHECK_CALL(LGBM_DatasetCreateFromMat(p_mat, C_API_DTYPE_FLOAT64, nrow, ncol, COL_MAJOR,
CHECK_CALL(LGBM_DatasetCreateFromMat(p_mat, p_label, C_API_DTYPE_FLOAT64, C_API_DTYPE_FLOAT64, nrow, ncol, COL_MAJOR,
R_CHAR_PTR(parameters), R_GET_PTR(reference), &handle));
R_SET_PTR(out, handle);
R_API_END();
Expand Down
18 changes: 10 additions & 8 deletions include/LightGBM/boosting.h
Original file line number Diff line number Diff line change
Expand Up @@ -131,10 +131,10 @@ class LIGHTGBM_EXPORT Boosting {
* \param output Prediction result for this record
* \param early_stop Early stopping instance. If nullptr, no early stopping is applied and all models are evaluated.
*/
virtual void PredictRaw(const double* features, double* output,
virtual void PredictRaw(double* features, double* output,
const PredictionEarlyStopInstance* early_stop) const = 0;

virtual void PredictRawByMap(const std::unordered_map<int, double>& features, double* output,
virtual void PredictRawByMap(std::unordered_map<int, double>& features, double* output,
shiyu1994 marked this conversation as resolved.
Show resolved Hide resolved
const PredictionEarlyStopInstance* early_stop) const = 0;


Expand All @@ -144,10 +144,10 @@ class LIGHTGBM_EXPORT Boosting {
* \param output Prediction result for this record
* \param early_stop Early stopping instance. If nullptr, no early stopping is applied and all models are evaluated.
*/
virtual void Predict(const double* features, double* output,
virtual void Predict(double* features, double* output,
const PredictionEarlyStopInstance* early_stop) const = 0;

virtual void PredictByMap(const std::unordered_map<int, double>& features, double* output,
virtual void PredictByMap(std::unordered_map<int, double>& features, double* output,
const PredictionEarlyStopInstance* early_stop) const = 0;


Expand All @@ -157,19 +157,19 @@ class LIGHTGBM_EXPORT Boosting {
* \param output Prediction result for this record
*/
virtual void PredictLeafIndex(
const double* features, double* output) const = 0;
double* features, double* output) const = 0;

virtual void PredictLeafIndexByMap(
const std::unordered_map<int, double>& features, double* output) const = 0;
std::unordered_map<int, double>& features, double* output) const = 0;

/*!
* \brief Feature contributions for the model's prediction of one record
* \param feature_values Feature value on this record
* \param output Prediction result for this record
*/
virtual void PredictContrib(const double* features, double* output) const = 0;
virtual void PredictContrib(double* features, double* output) const = 0;

virtual void PredictContribByMap(const std::unordered_map<int, double>& features,
virtual void PredictContribByMap(std::unordered_map<int, double>& features,
std::vector<std::unordered_map<int, double>>* output) const = 0;

/*!
Expand Down Expand Up @@ -295,6 +295,8 @@ class LIGHTGBM_EXPORT Boosting {
*/
virtual const char* SubModelName() const = 0;

virtual int num_extra_features() const { return 0; }

Boosting() = default;
/*! \brief Disable copy */
Boosting& operator=(const Boosting&) = delete;
Expand Down
20 changes: 20 additions & 0 deletions include/LightGBM/c_api.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ typedef void* FastConfigHandle; /*!< \brief Handle of FastConfig. */
#define C_API_DTYPE_FLOAT64 (1) /*!< \brief float64 (double precision float). */
#define C_API_DTYPE_INT32 (2) /*!< \brief int32. */
#define C_API_DTYPE_INT64 (3) /*!< \brief int64. */
#define C_API_DTYPE_NONE (4) /*!< \brief None. */
shiyu1994 marked this conversation as resolved.
Show resolved Hide resolved

#define C_API_PREDICT_NORMAL (0) /*!< \brief Normal prediction, with transform (if needed). */
#define C_API_PREDICT_RAW_SCORE (1) /*!< \brief Predict raw score. */
Expand Down Expand Up @@ -161,7 +162,9 @@ LIGHTGBM_C_EXPORT int LGBM_DatasetCreateFromCSR(const void* indptr,
int indptr_type,
const int32_t* indices,
const void* data,
const void* label,
int data_type,
int label_type,
int64_t nindptr,
int64_t nelem,
int64_t num_col,
Expand Down Expand Up @@ -206,7 +209,9 @@ LIGHTGBM_C_EXPORT int LGBM_DatasetCreateFromCSC(const void* col_ptr,
int col_ptr_type,
const int32_t* indices,
const void* data,
const void* label,
int data_type,
int label_type,
shiyu1994 marked this conversation as resolved.
Show resolved Hide resolved
int64_t ncol_ptr,
int64_t nelem,
int64_t num_row,
Expand All @@ -227,7 +232,9 @@ LIGHTGBM_C_EXPORT int LGBM_DatasetCreateFromCSC(const void* col_ptr,
* \return 0 when succeed, -1 when failure happens
*/
LIGHTGBM_C_EXPORT int LGBM_DatasetCreateFromMat(const void* data,
const void* label,
int data_type,
int label_type,
int32_t nrow,
int32_t ncol,
int is_row_major,
Expand All @@ -250,7 +257,9 @@ LIGHTGBM_C_EXPORT int LGBM_DatasetCreateFromMat(const void* data,
*/
LIGHTGBM_C_EXPORT int LGBM_DatasetCreateFromMats(int32_t nmat,
const void** data,
const void* label,
int data_type,
int label_type,
int32_t* nrow,
int32_t ncol,
int is_row_major,
Expand Down Expand Up @@ -389,6 +398,15 @@ LIGHTGBM_C_EXPORT int LGBM_DatasetGetNumData(DatasetHandle handle,
LIGHTGBM_C_EXPORT int LGBM_DatasetGetNumFeature(DatasetHandle handle,
int* out);

/*!
* \brief Get number of original features.
shiyu1994 marked this conversation as resolved.
Show resolved Hide resolved
* \param handle Handle of dataset
* \param[out] out The address to hold number of features
* \return 0 when succeed, -1 when failure happens
*/
LIGHTGBM_C_EXPORT int LGBM_DatasetGetNumOriginalFeature(DatasetHandle handle,
shiyu1994 marked this conversation as resolved.
Show resolved Hide resolved
int* out);

/*!
* \brief Add features from ``source`` to ``target``.
* \param target The handle of the dataset to add features to
Expand Down Expand Up @@ -422,6 +440,8 @@ LIGHTGBM_C_EXPORT int LGBM_BoosterCreateFromModelfile(const char* filename,
int* out_num_iterations,
BoosterHandle* out);

LIGHTGBM_C_EXPORT int LGBM_PrintCTRStatus(const DatasetHandle dataset);

/*!
* \brief Load an existing booster from string.
* \param model_str Model string
Expand Down
20 changes: 20 additions & 0 deletions include/LightGBM/config.h
Original file line number Diff line number Diff line change
Expand Up @@ -970,6 +970,26 @@ struct Config {

#pragma endregion

#pragma region CTR Parameters

// desc = ways to convert categorical features, currently supports:
// desc = ctr[:prior], where prior is a real number used to smooth the calculation of CTR values
// desc = count, the count of the categorical feature value in the dataset
// desc = for example "ctr:0.5,ctr:0.0:count will convert each categorical feature into 3 numerical features, with the 3 different ways separated by ','.
std::string cat_converters = "";

// desc = whether to keep the original feature values after the dataset is constructed
// desc = if set false, then once dataset is constructed, the cat_converters cannot be changed through parameters in train method
bool keep_raw_cat_data = false;
shiyu1994 marked this conversation as resolved.
Show resolved Hide resolved

// desc = number of folds that training data is divided into, to calculate ctr values
int num_ctr_folds = 4;

// desc = whether to use the old categorical handling
bool keep_old_cat_method = false;

#pragma endregion

#pragma endregion

size_t file_load_progress_interval_bytes = size_t(10) * 1024 * 1024 * 1024;
Expand Down
Loading