Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target and Count encodings for categorical features #3234

Closed
wants to merge 140 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
140 commits
Select commit Hold shift + click to select a range
3d3d899
support categorical feature converters (CTR and count)
shiyu1994 Sep 7, 2020
02fe28b
merge master into CTR
shiyu1994 Sep 7, 2020
da9a051
sort categorical features in ctr provider (for distributed learning)
shiyu1994 Sep 8, 2020
b8b5747
store convert feature index map in model for prediction
shiyu1994 Sep 10, 2020
5eda890
fix R interface
shiyu1994 Sep 14, 2020
788dc77
Merge branch 'master' into ctr
shiyu1994 Sep 14, 2020
c9ded18
add prior weight
shiyu1994 Sep 17, 2020
dc0a7e2
Merge branch 'ctr' of https://github.com/shiyu1994/LightGBM into ctr
shiyu1994 Sep 17, 2020
271197a
used fold wise prior for training data
shiyu1994 Sep 27, 2020
7cfcd9a
Merge branch 'master' into ctr
shiyu1994 Sep 27, 2020
0d23835
recover prior_weight parameter
shiyu1994 Sep 27, 2020
e631b28
fix memory
shiyu1994 Oct 21, 2020
f25d9e9
remove useless code
shiyu1994 Oct 21, 2020
26ef4b9
fix CTRProvider copy constructor
shiyu1994 Oct 21, 2020
d12ead5
make copy constructor public
shiyu1994 Oct 21, 2020
d6eba9f
copy data pushing functions in copy constructor of CTRProvider
shiyu1994 Oct 21, 2020
f630ce9
fix copy constructor of CTRProvider
shiyu1994 Oct 21, 2020
cf728af
add destructor for CTRProvider
shiyu1994 Oct 21, 2020
19e0153
fix memory leak in CTRProvider of GBDT
shiyu1994 Oct 22, 2020
1e4a923
fix CatConverter Copy function
shiyu1994 Oct 28, 2020
abd6990
add ctr file parser
shiyu1994 Oct 29, 2020
205cc60
remove code changes in dataset
shiyu1994 Nov 18, 2020
ec17a68
wrap data reading functions with CTRProvider
shiyu1994 Nov 19, 2020
e8ec5f0
use ctr csc col iterators
shiyu1994 Nov 22, 2020
ffb3eb1
pad zeros when CTR is used
shiyu1994 Nov 25, 2020
a3ae6ab
wrap parser and row functions for prediction with CTRProvider
shiyu1994 Nov 26, 2020
9b6a157
rename parser.h in include/LightGBM to parser_base.h
shiyu1994 Nov 26, 2020
391c2eb
remove changes in src/main.cpp
shiyu1994 Nov 26, 2020
702c72f
merge LightGBM/master into shiyu1994/ctr
shiyu1994 Nov 26, 2020
a9cddcb
merge src/c_api.cpp from LightGBM/master
shiyu1994 Nov 26, 2020
e5db275
change data_size_t to INDEX_T in TextReader::SampleFromFileWithIndices
shiyu1994 Nov 26, 2020
32344ca
add virtual destructor for CTRProvider::CatConverter
shiyu1994 Nov 26, 2020
7558207
mark SetPrior as override for CatConverter sub-classes
shiyu1994 Nov 26, 2020
28206d3
replace keep_old_cat_method with raw cat_converters option
shiyu1994 Nov 27, 2020
dc70eec
check label function exists after extracting cat converters
shiyu1994 Nov 27, 2020
34ec34e
check can load from binary before constructing ctr provider
shiyu1994 Nov 30, 2020
49c78be
add train.conf before swithing branch
shiyu1994 Nov 30, 2020
03f5f9b
update data create function calls in c_api_test/test_.py
shiyu1994 Dec 2, 2020
066e9e9
add simple ctr test
shiyu1994 Dec 2, 2020
dbdb028
fix doc for LGBM_DatasetCreate functions
shiyu1994 Dec 2, 2020
c17a9c6
add doc string for c_float_label in basic.py
shiyu1994 Dec 3, 2020
af3f30b
fix c++ linter problems
shiyu1994 Dec 3, 2020
392d1e2
fix doc string indent for the description of cat_converters in python
shiyu1994 Dec 3, 2020
0916948
fix c++ linter prolems and regenerate Parameters.rst
shiyu1994 Dec 3, 2020
3d69501
fix explicit constructor issue
shiyu1994 Dec 3, 2020
d9c6fd0
use count lines when constructing CTRProvider from file
shiyu1994 Dec 22, 2020
c304fac
Merge branch 'master' into ctr
shiyu1994 Dec 22, 2020
08aaf8d
fix corner case bug when subcol is used
shiyu1994 Dec 23, 2020
cee63dc
Merge branch 'ctr' of https://github.com/shiyu1994/LightGBM into ctr
shiyu1994 Dec 23, 2020
84e9951
fix training data fold id bug
shiyu1994 Dec 24, 2020
8aa1146
Merge with LightGBM/master
shiyu1994 Dec 24, 2020
fe3f608
merge master
shiyu1994 Jan 10, 2021
3bef245
remove redundant LGBM_BoosterGetLinear in c_api.h
shiyu1994 Jan 10, 2021
8431ebf
add support of CTR for R package
shiyu1994 Jan 13, 2021
2bf9a69
use template to reduce redundant code
shiyu1994 Jan 18, 2021
74d80d5
remove label type and avoid stramge warns when using categorical feat…
shiyu1994 Jan 20, 2021
fa7bb6b
merge master
shiyu1994 Jan 20, 2021
c1c0839
fix python test
shiyu1994 Jan 20, 2021
255f546
fix R test
shiyu1994 Jan 20, 2021
5595c36
fix python Dataset _extract_categorical_info_from_params
shiyu1994 Jan 21, 2021
b5f9302
test where the R test failed (will remove this commit)
shiyu1994 Jan 21, 2021
a4caf2c
replace array with vector when converting R label
shiyu1994 Jan 21, 2021
3a3ef94
test raw cat_converters
shiyu1994 Jan 21, 2021
6cd5cf1
check null of label in R Dataset
shiyu1994 Jan 21, 2021
eb2b5d3
remove R test for CTR
shiyu1994 Jan 21, 2021
8b237db
remove R support for CTR
shiyu1994 Jan 22, 2021
fac5a9d
remove changes in install.libs.R
shiyu1994 Jan 22, 2021
db1eb1b
fix R label type for CTR construction (can be integer)
shiyu1994 Jan 22, 2021
267058b
recover R tests for CTR
shiyu1994 Jan 22, 2021
10402f3
save prior_weight_ in model of CTRProvider
shiyu1994 Jan 22, 2021
742d4aa
check R test output
shiyu1994 Jan 22, 2021
f53824c
directly check CTR output
shiyu1994 Jan 22, 2021
aceb122
use prior_weight_ instead of config_.prior_weight when recovering CTR…
shiyu1994 Jan 22, 2021
abb20b4
try wrong ctr string
shiyu1994 Jan 22, 2021
b08507e
try wrong ctr string
shiyu1994 Jan 22, 2021
2bf9228
test that features are expanded with CTR
shiyu1994 Jan 22, 2021
b434784
skip tests due to CTR inconsistence across platforms
shiyu1994 Jan 22, 2021
fd6d8db
accumulate CTR statistics when sampling from file
shiyu1994 Jan 26, 2021
15b8bb0
Merge branch 'master' into ctr
shiyu1994 Jan 26, 2021
10c97e8
dynamic adjustment of num_original_features_ when load CTRProvider fr…
shiyu1994 Jan 26, 2021
5382b8b
add test for CTR in multi-class tasks
shiyu1994 Jan 26, 2021
0eeef3f
fix the case when categorical_feature is empty but cat_converters is …
shiyu1994 Jan 26, 2021
fd39a9f
rearrange headers in alphabetic order
shiyu1994 Jan 26, 2021
2cfbbfc
fix linter problems
shiyu1994 Jan 26, 2021
ae060c8
Merge branch 'master' into ctr
shiyu1994 Jan 26, 2021
c724fb8
fix two round loading from file
shiyu1994 Jan 26, 2021
d2c9b32
fix two round loading from file
shiyu1994 Jan 26, 2021
94aae86
clean-up code and fix load file bug
shiyu1994 Jan 27, 2021
595a2a4
Merge branch 'master' into ctr
shiyu1994 Jan 27, 2021
e2531f3
fix python linter problem
shiyu1994 Jan 27, 2021
b2260cd
Merge branch 'master' of https://github.com/microsoft/LightGBM into ctr
shiyu1994 Feb 2, 2021
dd41228
rename cat_converters to category_encoders and ctr to target encoding
shiyu1994 Apr 16, 2021
55920c2
remove model files in tests/python_package_test
shiyu1994 Apr 16, 2021
cad72eb
Merge branch 'master' into ctr
shiyu1994 Apr 16, 2021
18ffb3a
Add description for category_encoders
shiyu1994 Apr 16, 2021
e5379e8
Fix format
shiyu1994 Apr 16, 2021
fd6fa1e
remove redundant session name
shiyu1994 Apr 16, 2021
a544a16
Apply suggestions from code review
shiyu1994 Apr 22, 2021
b416916
Add CommonC::UnorderedMapToString
shiyu1994 Apr 22, 2021
a25f323
improve the format of CategoryEncodingProvider and add tests for vali…
shiyu1994 May 25, 2021
cdd96f7
merge master into ctr
shiyu1994 May 25, 2021
75a7fdd
fix R interface
shiyu1994 May 26, 2021
66826a2
increment model version and add backward compatibility
shiyu1994 May 26, 2021
60daa2d
merge LightGBM/master into ctr
shiyu1994 Oct 29, 2021
2d00fad
remove useless files
shiyu1994 Oct 29, 2021
45b208e
use multi_error instead of multi_loglsos for test for stability
shiyu1994 Oct 29, 2021
e9fd7db
remove comment
shiyu1994 Oct 29, 2021
39f08de
document that add_features_from cannot be used with non-default categ…
shiyu1994 Oct 29, 2021
2ed6a84
add support for category_encoders with monotone constraints
shiyu1994 Nov 1, 2021
e139fd1
ignore category_encoders when labels are not provided
shiyu1994 Nov 1, 2021
e84a330
add check for interaction constraints range in CategoryEncodingProvider
shiyu1994 Nov 9, 2021
17c3f29
comment new category encoding tests
shiyu1994 Nov 9, 2021
51a43e3
pull LightGBM/master into shiyu1994/ctr
shiyu1994 Nov 9, 2021
47edf6d
check whether force splits are specified
shiyu1994 Nov 9, 2021
a24da60
comment out CheckForcedSplitsForCategoryEncoding
shiyu1994 Nov 9, 2021
5f2aac2
recover gbdt.cpp
shiyu1994 Nov 9, 2021
f997fe0
change return type of CategoryEncodingProvider::CheckForcedSplitsForC…
shiyu1994 Nov 10, 2021
67bf9f8
merge master into ctr
shiyu1994 Nov 10, 2021
1116879
remove useless file
shiyu1994 Nov 10, 2021
6f50504
fix python linter errors
shiyu1994 Nov 10, 2021
4e7d8ba
remove white space
shiyu1994 Nov 10, 2021
a4ec01f
keep old C APIs
shiyu1994 Nov 15, 2021
936526d
Merge remote-tracking branch 'shiyu1994/ctr' into ctr
shiyu1994 Nov 15, 2021
4b95942
add blank lines
shiyu1994 Nov 15, 2021
500a426
test only no label data creation APIs in tests/c_api_test/test_.py
shiyu1994 Nov 15, 2021
cc58f19
test data creation APIs in tests/c_api_test/test_.py with label
shiyu1994 Nov 15, 2021
3d23c3b
fix return statements
shiyu1994 Nov 15, 2021
773c59b
fix solaris compatibility
shiyu1994 Nov 15, 2021
9dad619
fix lint errors
shiyu1994 Nov 15, 2021
6510a34
fix solaris compatibility
shiyu1994 Nov 15, 2021
b8d6ff3
fix linter error
shiyu1994 Nov 15, 2021
9fec031
fix conflict label_t definition with Solaris in c_api.cpp
shiyu1994 Nov 16, 2021
1bf6e94
fix return statement
shiyu1994 Nov 16, 2021
bbfc279
Merge remote-tracking branch 'LightGBM/master' into ctr
shiyu1994 Nov 16, 2021
f7370d2
fix conflicts with LightGBM/master
shiyu1994 Nov 16, 2021
fab88db
resolve R code issues
shiyu1994 Nov 17, 2021
880fbf8
Merge branch 'master' of https://github.com/microsoft/LightGBM into ctr
shiyu1994 Nov 17, 2021
8431411
merge master
shiyu1994 Dec 7, 2021
bcea451
merge master
shiyu1994 Dec 7, 2021
3e91f68
add more comments for CategoryEncodingProvider::SyncEncodingStat
shiyu1994 Dec 7, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions R-package/R/aliases.R
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
, "use_missing"
, "weight_column"
, "zero_as_missing"
, "category_encoders"
)])
}

Expand Down
14 changes: 12 additions & 2 deletions R-package/R/lgb.Dataset.R
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,6 @@ Dataset <- R6::R6Class(
}

} else {

# Check if more categorical features were output over the feature space
if (max(private$categorical_feature) > length(private$colnames)) {
stop(
Expand Down Expand Up @@ -249,18 +248,28 @@ Dataset <- R6::R6Class(
)

} else if (is.matrix(private$raw_data)) {

if (is.null(private$info[["label"]])) {
label <- NULL
} else {
label <- as.numeric(private$info[["label"]])
}
# Are we using a matrix?
handle <- .Call(
LGBM_DatasetCreateFromMat_R
, private$raw_data
, label
, nrow(private$raw_data)
, ncol(private$raw_data)
, params_str
, ref_handle
)

} else if (methods::is(private$raw_data, "dgCMatrix")) {
if (is.null(private$info[["label"]])) {
label <- NULL
} else {
label <- as.numeric(private$info[["label"]])
}
shiyu1994 marked this conversation as resolved.
Show resolved Hide resolved
if (length(private$raw_data@p) > 2147483647L) {
stop("Cannot support large CSC matrix")
}
Expand All @@ -270,6 +279,7 @@ Dataset <- R6::R6Class(
, private$raw_data@p
, private$raw_data@i
, private$raw_data@x
, label
, length(private$raw_data@p)
, length(private$raw_data@x)
, nrow(private$raw_data)
Expand Down
1 change: 1 addition & 0 deletions R-package/src/Makevars.in
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ OBJECTS = \
io/bin.o \
io/config.o \
io/config_auto.o \
io/category_encoding_provider.o \
io/dataset.o \
io/dataset_loader.o \
io/file_io.o \
Expand Down
1 change: 1 addition & 0 deletions R-package/src/Makevars.win.in
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ OBJECTS = \
io/bin.o \
io/config.o \
io/config_auto.o \
io/category_encoding_provider.o \
io/dataset.o \
io/dataset_loader.o \
io/file_io.o \
Expand Down
34 changes: 29 additions & 5 deletions R-package/src/lightgbm_R.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,7 @@ SEXP LGBM_DatasetCreateFromFile_R(SEXP filename,
SEXP LGBM_DatasetCreateFromCSC_R(SEXP indptr,
SEXP indices,
SEXP data,
SEXP label,
SEXP num_indptr,
SEXP nelem,
SEXP num_row,
Expand All @@ -155,17 +156,28 @@ SEXP LGBM_DatasetCreateFromCSC_R(SEXP indptr,
const int* p_indptr = INTEGER(indptr);
const int* p_indices = INTEGER(indices);
const double* p_data = REAL(data);
const double* p_label = Rf_isNull(label) ? nullptr : REAL(label);
int64_t nindptr = static_cast<int64_t>(Rf_asInteger(num_indptr));
int64_t ndata = static_cast<int64_t>(Rf_asInteger(nelem));
int64_t nrow = static_cast<int64_t>(Rf_asInteger(num_row));
const char* parameters_ptr = CHAR(PROTECT(Rf_asChar(parameters)));
const float* float_p_label = nullptr;
std::vector<float> float_label_vec;
if (p_label != nullptr) {
float_label_vec.resize(nrow);
#pragma omp parallel for schedule(static) if (nrow >= 1024)
for (int i = 0; i < nrow; ++i) {
float_label_vec[i] = static_cast<float>(p_label[i]);
}
float_p_label = float_label_vec.data();
}
DatasetHandle handle = nullptr;
DatasetHandle ref = nullptr;
if (!Rf_isNull(reference)) {
ref = R_ExternalPtrAddr(reference);
}
CHECK_CALL(LGBM_DatasetCreateFromCSC(p_indptr, C_API_DTYPE_INT32, p_indices,
p_data, C_API_DTYPE_FLOAT64, nindptr, ndata,
CHECK_CALL(LGBM_DatasetCreateFromCSCWithLabel(p_indptr, C_API_DTYPE_INT32, p_indices,
p_data, float_p_label, C_API_DTYPE_FLOAT64, nindptr, ndata,
nrow, parameters_ptr, ref, &handle));
R_SetExternalPtrAddr(ret, handle);
R_RegisterCFinalizerEx(ret, _DatasetFinalizer, TRUE);
Expand All @@ -175,6 +187,7 @@ SEXP LGBM_DatasetCreateFromCSC_R(SEXP indptr,
}

SEXP LGBM_DatasetCreateFromMat_R(SEXP data,
SEXP label,
SEXP num_row,
SEXP num_col,
SEXP parameters,
Expand All @@ -184,13 +197,24 @@ SEXP LGBM_DatasetCreateFromMat_R(SEXP data,
int32_t nrow = static_cast<int32_t>(Rf_asInteger(num_row));
int32_t ncol = static_cast<int32_t>(Rf_asInteger(num_col));
double* p_mat = REAL(data);
double* p_label = Rf_isNull(label) ? nullptr : REAL(label);
const float* float_p_label = nullptr;
std::vector<float> float_label_vec;
if (p_label != nullptr) {
float_label_vec.resize(nrow);
#pragma omp parallel for schedule(static) if (nrow >= 1024)
for (int i = 0; i < nrow; ++i) {
float_label_vec[i] = static_cast<float>(p_label[i]);
}
float_p_label = float_label_vec.data();
}
const char* parameters_ptr = CHAR(PROTECT(Rf_asChar(parameters)));
DatasetHandle handle = nullptr;
DatasetHandle ref = nullptr;
if (!Rf_isNull(reference)) {
ref = R_ExternalPtrAddr(reference);
}
CHECK_CALL(LGBM_DatasetCreateFromMat(p_mat, C_API_DTYPE_FLOAT64, nrow, ncol, COL_MAJOR,
CHECK_CALL(LGBM_DatasetCreateFromMatWithLabel(p_mat, float_p_label, C_API_DTYPE_FLOAT64, nrow, ncol, COL_MAJOR,
parameters_ptr, ref, &handle));
R_SetExternalPtrAddr(ret, handle);
R_RegisterCFinalizerEx(ret, _DatasetFinalizer, TRUE);
Expand Down Expand Up @@ -926,8 +950,8 @@ SEXP LGBM_DumpParamAliases_R() {
static const R_CallMethodDef CallEntries[] = {
{"LGBM_HandleIsNull_R" , (DL_FUNC) &LGBM_HandleIsNull_R , 1},
{"LGBM_DatasetCreateFromFile_R" , (DL_FUNC) &LGBM_DatasetCreateFromFile_R , 3},
{"LGBM_DatasetCreateFromCSC_R" , (DL_FUNC) &LGBM_DatasetCreateFromCSC_R , 8},
{"LGBM_DatasetCreateFromMat_R" , (DL_FUNC) &LGBM_DatasetCreateFromMat_R , 5},
{"LGBM_DatasetCreateFromCSC_R" , (DL_FUNC) &LGBM_DatasetCreateFromCSC_R , 9},
{"LGBM_DatasetCreateFromMat_R" , (DL_FUNC) &LGBM_DatasetCreateFromMat_R , 6},
{"LGBM_DatasetGetSubset_R" , (DL_FUNC) &LGBM_DatasetGetSubset_R , 4},
{"LGBM_DatasetSetFeatureNames_R" , (DL_FUNC) &LGBM_DatasetSetFeatureNames_R , 2},
{"LGBM_DatasetGetFeatureNames_R" , (DL_FUNC) &LGBM_DatasetGetFeatureNames_R , 1},
Expand Down
4 changes: 4 additions & 0 deletions R-package/src/lightgbm_R.h
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ LIGHTGBM_C_EXPORT SEXP LGBM_DatasetCreateFromFile_R(
* \param indptr pointer to row headers
* \param indices findex
* \param data fvalue
* \param label label
* \param num_indptr number of cols in the matrix + 1
* \param nelem number of nonzero elements in the matrix
* \param num_row number of rows
Expand All @@ -57,6 +58,7 @@ LIGHTGBM_C_EXPORT SEXP LGBM_DatasetCreateFromCSC_R(
SEXP indptr,
SEXP indices,
SEXP data,
SEXP label,
SEXP num_indptr,
SEXP nelem,
SEXP num_row,
Expand All @@ -67,6 +69,7 @@ LIGHTGBM_C_EXPORT SEXP LGBM_DatasetCreateFromCSC_R(
/*!
* \brief create Dataset from dense matrix
* \param data matrix data
* \param label label
* \param num_row number of rows
* \param num_col number columns
* \param parameters additional parameters
Expand All @@ -75,6 +78,7 @@ LIGHTGBM_C_EXPORT SEXP LGBM_DatasetCreateFromCSC_R(
*/
LIGHTGBM_C_EXPORT SEXP LGBM_DatasetCreateFromMat_R(
SEXP data,
SEXP label,
SEXP num_row,
SEXP num_col,
SEXP parameters,
Expand Down
107 changes: 107 additions & 0 deletions R-package/tests/testthat/test_basic.R
Original file line number Diff line number Diff line change
Expand Up @@ -2235,6 +2235,113 @@ test_that(paste0("lgb.train() gives same results when using interaction_constrai

})

test_that("Category encoding for R package works", {
# test category_encoders
set.seed(1L)
dtrain <- lgb.Dataset(train$data, label = train$label)
dtest <- lgb.Dataset(test$data, label = test$label, reference = dtrain)
cat_fid <- c(1L, 2L, 3L, 4L)
# ``` category_encoders = "" ``` is equal to ``` category_encoders = "raw" ```
params <- list(objective = "binary", categorical_feature = cat_fid, category_encoders = "")
bst <- lightgbm(
data = dtrain
, params = params
, nrounds = 10L
, verbose = 2L
, valids = list("valid1" = dtest)
)
pred1 <- bst$predict(test$data)

# treat the first 4 features as categorical features
dtrain <- lgb.Dataset(
train$data
, label = train$label
, categorical_feature = cat_fid
, category_encoders = "raw"
)
dtest <- lgb.Dataset(
test$data
, label = test$label
, categorical_feature = cat_fid
, reference = dtrain
)
params <- list(objective = "binary")
bst <- lightgbm(
data = dtrain
, params = params
, nrounds = 10L
, verbose = 2L
, valids = list("valid1" = dtest)
)
pred2 <- bst$predict(test$data)
expect_equal(pred1, pred2)

dtrain <- lgb.Dataset(
train$data
, label = train$label
, categorical_feature = cat_fid
)
dtest <- lgb.Dataset(
test$data
, label = test$label
, categorical_feature = cat_fid
, reference = dtrain
)
params <- list(objective = "binary", category_encoders = "target,count,raw")
bst <- lightgbm(
data = dtrain
, params = params
, nrounds = 10L
, verbose = 2L
, valids = list("valid1" = dtest)
)
pred3 <- bst$predict(test$data)
# one new "count" and "target" feature is added per categorical feature
num_new_cat_features <- length(cat_fid) * 2L
expect_equal(dim(dtrain), c(nrow(train$data), ncol(train$data) + num_new_cat_features))

# test gbdt model with category_encoders
model_file <- tempfile(fileext = ".model")
lgb.save(bst, model_file)
# finalize the booster and destroy it so you know we aren't cheating
bst$finalize()
expect_null(bst$.__enclos_env__$private$handle)
rm(bst)

bst2 <- lgb.load(
filename = model_file
)
pred4 <- predict(bst2, test$data)
expect_equal(pred3, pred4)


# test Dataset binary store with category_encoders
tmp_file <- tempfile(pattern = "lgb.Dataset_Category_Encoding_")
lgb.Dataset.save(
dataset = dtrain
, fname = tmp_file
)
dtrain_read_in <- lgb.Dataset(data = tmp_file)

tmp_file <- tempfile(pattern = "lgb.Dataset_Category_Encoding_2_")
lgb.Dataset.save(
dataset = dtest
, fname = tmp_file
)
dtest_read_in <- lgb.Dataset(data = tmp_file)

bst <- lightgbm(
data = dtrain_read_in
, params = params
, nrounds = 10L
, verbose = 2L
, valids = list("valid1" = dtest_read_in)
)
pred5 <- bst$predict(test$data)
expect_equal(pred3, pred5)
})


context("monotone constraints")

.generate_trainset_for_monotone_constraints_tests <- function(x3_to_categorical) {
Expand Down
1 change: 1 addition & 0 deletions R-package/tests/testthat/test_dataset.R
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,7 @@ test_that("lgb.Dataset: Dataset should be able to construct from matrix and retu
handle <- .Call(
LGBM_DatasetCreateFromMat_R
, rawData
, NULL
, nrow(rawData)
, ncol(rawData)
, lightgbm:::lgb.params2str(params = list())
Expand Down
44 changes: 36 additions & 8 deletions docs/Advanced-Topics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,22 +15,50 @@ Missing Value Handle
Categorical Feature Support
---------------------------

- LightGBM offers good accuracy with integer-encoded categorical features. LightGBM applies
`Fisher (1958) <https://www.tandfonline.com/doi/abs/10.1080/01621459.1958.10501479>`_
to find the optimal split over categories as
`described here <./Features.rst#optimal-split-for-categorical-features>`_. This often performs better than one-hot encoding.
- LightGBM offers good accuracy with integer-encoded categorical features. LightGBM offers the following approaches to deal with categorical features:

- Method 1: Applies `Fisher (1958) <https://www.tandfonline.com/doi/abs/10.1080/01621459.1958.10501479>`__ to find the optimal split over categories as `described here <./Features.rst#optimal-split-for-categorical-features>`__.

- Method 2: Encoding categorical features into numerical values. LightGBM provides two encoding options:

- **Target encoding**: encode the categorical feature value by the mean of labels of data with the same feature value in the training set. It is easy to overfit the training data if the encoded value of a training data point uses the label of that training data point itself. So LightGBM randomly divides the training data into folds, and when calculating the target encoding for data in one fold, only considers data in other folds.

- **Count encoding**: encode the categorical feature value by the total number of data with the same feature value in the training set.

These methods often perform better than one-hot encoding.

- Use ``categorical_feature`` to specify the categorical features.
Refer to the parameter ``categorical_feature`` in `Parameters <./Parameters.rst#categorical_feature>`__.

- Categorical features must be encoded as non-negative integers (``int``) less than ``Int32.MaxValue`` (2147483647).
It is best to use a contiguous range of integers started from zero.

- Use ``min_data_per_group``, ``cat_smooth`` to deal with over-fitting (when ``#data`` is small or ``#category`` is large).
- Use ``category_encoders`` to specify the methods used to deal with categorical features. Use

- ``raw`` to indicate method 1.

- ``target[:prior]`` to indicate target encoding in method 2. The ``prior`` is a real number used to smooth the calculation of encoded values. So ``target[:prior]`` is calculated as: ``(sum_label + prior * prior_weight) / (count + prior_weight)``. Here ``sum_label`` is the sum of labels of data in the training set with the same categorical feature value, ``count`` is the total number of data with the same feature value in the training set (the value of count encoding), and ``prior_weight`` is a hyper-parameter. If the prior value is missing, we use the mean of all labels of training data as default prior.

- ``count`` to indicate count encoding in method 2.

Note that the aforementioned methods can be used simultaneously. Different methods are separated by commas.
For example ``category_encoders=target:0.5,target,count,raw`` will enable using splits with method 1, and in addition, convert each categorical feature into 3 numerical features. The first one uses target encoding with prior ``0.5``. The second one uses target encoding with default prior, which is the mean of labels of the training data. The third one uses count encoding.
When ``category_encoders`` is empty, ``raw`` will be used by default. The numbers and names of features will be changed when ``category_encoders`` is not ``raw``.
Suppose the original name of a feature is ``NAME``, the naming rules of its target and count encoding features are:

- For the encoder ``target`` (without user specified prior), it will be named as ``NAME_label_mean_prior_target_encoding_<label_mean>``, where ``<label_mean>`` is the mean of all labels in the training set.

- For the encoder ``target:<prior>`` (with user specified prior), it will be named as ``NAME_target_encoding_<prior>``.

- For the encoder ``count``, it will be named as ``NAME_count_encoding``.

Use ``get_feature_name()`` of Python Booster class or ``feature_name()`` of Python Dataset class after training to get the actual feature names used when ``category_encoders`` is set.

- Use ``num_target_encoding_folds`` to specify the number of folds to divide the training data when using target encoding.

- Use ``prior_weight`` to specify the weight of prior in target encoding calculation. Higher value will enforce more regularization on target encoding.

- For a categorical feature with high cardinality (``#category`` is large), it often works best to
treat the feature as numeric, either by simply ignoring the categorical interpretation of the integers or
by embedding the categories in a low-dimensional numeric space.
- When using method 1 (in other words, ``raw`` is enabled in ``category_encoders``), use ``min_data_per_group``, ``cat_smooth`` to deal with over-fitting (when ``#data`` is small or ``#category`` is large).

LambdaRank
----------
Expand Down
Loading