Skip to content

Commit

Permalink
Refine categorical features (#993)
Browse files Browse the repository at this point in the history
* many fixes for categorical feature

* add l2 to categorcial split.

* remove useless file

* update version

* add cat_l2

* update appveyor verison

* remove file

* fix tests.

* change default cat_l2 value

* fix a bug in bin finder

* change default cat_smooth_ratio
  • Loading branch information
guolinke committed Oct 16, 2017
1 parent c44507b commit eadc7b9
Show file tree
Hide file tree
Showing 11 changed files with 71 additions and 70 deletions.
4 changes: 2 additions & 2 deletions R-package/DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
Package: lightgbm
Type: Package
Title: Light Gradient Boosting Machine
Version: 2.0.8
Date: 2017-10-14
Version: 2.0.9
Date: 2017-10-15
Author: Guolin Ke <guolin.ke@microsoft.com>
Maintainer: Guolin Ke <guolin.ke@microsoft.com>
Description: LightGBM is a gradient boosting framework that uses tree based learning algorithms.
Expand Down
2 changes: 1 addition & 1 deletion VERSION.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2.0.8
2.0.9
2 changes: 1 addition & 1 deletion appveyor.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
version: 2.0.7.{build}
version: 2.0.9.{build}

configuration: # A trick to construct a build matrix
- 3.5
Expand Down
2 changes: 1 addition & 1 deletion docs/Advanced-Topics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Categorical Feature Support
- Converting to ``int`` type is needed first, and there is support for non-negative numbers only.
It is better to convert into continues ranges.

- Use ``max_cat_group``, ``cat_smooth_ratio`` to deal with over-fitting
- Use ``min_data_per_group``, ``cat_smooth_ratio`` to deal with over-fitting
(when ``#data`` is small or ``#category`` is large).

- For categorical features with high cardinality (``#category`` is large), it is better to convert it to numerical features.
Expand Down
17 changes: 6 additions & 11 deletions docs/Parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -249,18 +249,11 @@ Learning Control Parameters

- only used in ``goss``, the retain ratio of small gradient data

- ``max_cat_group``, default=\ ``64``, type=int

- use for the categorical features

- when ``#catogory`` is large, finding the split point on it is easily over-fitting.
So LightGBM merges them into ``max_cat_group`` groups, and finds the split points on the group boundaries

- ``min_data_per_group``, default=\ ``10``, type=int
- ``min_data_per_group``, default=\ ``100``, type=int

- min number of data per categorical group

- ``max_cat_threshold``, default=\ ``256``, type=int
- ``max_cat_threshold``, default=\ ``128``, type=int

- use for the categorical features

Expand All @@ -272,7 +265,7 @@ Learning Control Parameters

- refer to the descrption of the paramater ``cat_smooth_ratio``

- ``max_cat_smooth``, default=\ ``100``, type=double
- ``max_cat_smooth``, default=\ ``50``, type=double

- use for the categorical features

Expand All @@ -286,7 +279,9 @@ Learning Control Parameters

- the smooth denominator is ``a = min(max_cat_smooth, max(min_cat_smooth, num_data / num_category * cat_smooth_ratio))``

- the smooth numerator is ``b = a * sum_gradient / sum_hessian``
- ``cat_l2``, default=\ ``1``, type=double

- L2 regularization in categorcial split

IO Parameters
-------------
Expand Down
10 changes: 5 additions & 5 deletions include/LightGBM/config.h
Original file line number Diff line number Diff line change
Expand Up @@ -225,12 +225,12 @@ struct TreeConfig: public ConfigBase {
int gpu_device_id = -1;
/*! \brief Set to true to use double precision math on GPU (default using single precision) */
bool gpu_use_dp = false;
int max_cat_group = 64;
int min_data_per_group = 10;
int max_cat_threshold = 256;
int min_data_per_group = 100;
int max_cat_threshold = 128;
double cat_smooth_ratio = 0.01;
double cat_l2 = 1;
double min_cat_smooth = 5;
double max_cat_smooth = 100;
double max_cat_smooth = 50;
LIGHTGBM_EXPORT void Set(const std::unordered_map<std::string, std::string>& params) override;
};

Expand Down Expand Up @@ -473,7 +473,7 @@ struct ParameterAlias {
"max_conflict_rate", "poisson_max_delta_step", "gaussian_eta",
"histogram_pool_size", "output_freq", "is_provide_training_metric", "machine_list_filename", "machines",
"zero_as_missing", "init_score_file", "valid_init_score_file", "is_predict_contrib",
"max_cat_threshold", "max_cat_group", "cat_smooth_ratio", "min_cat_smooth", "max_cat_smooth", "min_data_per_group"
"max_cat_threshold", "cat_smooth_ratio", "min_cat_smooth", "max_cat_smooth", "min_data_per_group", "cat_l2"
});
std::unordered_map<std::string, std::string> tmp_map;
for (const auto& pair : *params) {
Expand Down
3 changes: 3 additions & 0 deletions python-package/lightgbm/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -620,6 +620,9 @@ def _lazy_init(self, data, label=None, max_bin=255, reference=None,
if data is None:
self.handle = None
return
if reference is not None:
self.pandas_categorical = reference.pandas_categorical
categorical_feature = reference.categorical_feature
data, feature_name, categorical_feature, self.pandas_categorical = _data_from_pandas(data, feature_name, categorical_feature, self.pandas_categorical)
label = _label_from_pandas(label)
self.data_has_header = False
Expand Down
36 changes: 17 additions & 19 deletions src/io/bin.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -293,20 +293,24 @@ namespace LightGBM {
// convert to int type first
std::vector<int> distinct_values_int;
std::vector<int> counts_int;
distinct_values_int.push_back(static_cast<int>(distinct_values[0]));
counts_int.push_back(counts[0]);
for (size_t i = 1; i < distinct_values.size(); ++i) {
if (static_cast<int>(distinct_values[i]) != distinct_values_int.back()) {
distinct_values_int.push_back(static_cast<int>(distinct_values[i]));
counts_int.push_back(counts[i]);
for (size_t i = 0; i < distinct_values.size(); ++i) {
int val = static_cast<int>(distinct_values[i]);
if (val < 0) {
na_cnt += counts[i];
Log::Warning("Met negative value in categorical features, will convert it to NaN");
} else {
counts_int.back() += counts[i];
if (distinct_values_int.empty() || val != distinct_values_int.back()) {
distinct_values_int.push_back(val);
counts_int.push_back(counts[i]);
} else {
counts_int.back() += counts[i];
}
}
}
// sort by counts
Common::SortForPair<int, int>(counts_int, distinct_values_int, 0, true);
// avoid first bin is zero
if (distinct_values_int[0] == 0 || (counts_int.size() == 1 && na_cnt > 0)) {
if (distinct_values_int[0] == 0) {
if (counts_int.size() == 1) {
counts_int.push_back(0);
distinct_values_int.push_back(distinct_values_int[0] + 1);
Expand All @@ -325,17 +329,11 @@ namespace LightGBM {
cnt_in_bin.clear();
while (cur_cat < distinct_values_int.size()
&& (used_cnt < cut_cnt || num_bin_ < max_bin)) {
if (distinct_values_int[cur_cat] < 0) {
na_cnt += counts_int[cur_cat];
cut_cnt -= counts_int[cur_cat];
Log::Warning("Met negative value in categorical features, will convert it to NaN");
} else {
bin_2_categorical_.push_back(distinct_values_int[cur_cat]);
categorical_2_bin_[distinct_values_int[cur_cat]] = static_cast<unsigned int>(num_bin_);
used_cnt += counts_int[cur_cat];
cnt_in_bin.push_back(counts_int[cur_cat]);
++num_bin_;
}
bin_2_categorical_.push_back(distinct_values_int[cur_cat]);
categorical_2_bin_[distinct_values_int[cur_cat]] = static_cast<unsigned int>(num_bin_);
used_cnt += counts_int[cur_cat];
cnt_in_bin.push_back(counts_int[cur_cat]);
++num_bin_;
++cur_cat;
}
// need an additional bin for NaN
Expand Down
4 changes: 2 additions & 2 deletions src/io/config.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -380,15 +380,15 @@ void TreeConfig::Set(const std::unordered_map<std::string, std::string>& params)
GetInt(params, "gpu_platform_id", &gpu_platform_id);
GetInt(params, "gpu_device_id", &gpu_device_id);
GetBool(params, "gpu_use_dp", &gpu_use_dp);
GetInt(params, "max_cat_group", &max_cat_group);
GetInt(params, "max_cat_threshold", &max_cat_threshold);
GetDouble(params, "cat_smooth_ratio", &cat_smooth_ratio);
GetDouble(params, "cat_l2", &cat_l2);
GetDouble(params, "min_cat_smooth", &min_cat_smooth);
GetDouble(params, "max_cat_smooth", &max_cat_smooth);
GetInt(params, "min_data_per_group", &min_data_per_group);
CHECK(max_cat_group > 1);
CHECK(max_cat_threshold > 0);
CHECK(cat_smooth_ratio >= 0);
CHECK(cat_l2 >= 0.0f);
CHECK(min_cat_smooth >= 1);
CHECK(max_cat_smooth > min_cat_smooth);
CHECK(min_data_per_group > 0);
Expand Down
55 changes: 27 additions & 28 deletions src/treelearner/feature_histogram.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -109,49 +109,53 @@ class FeatureHistogram {
double best_sum_left_gradient = 0;
double best_sum_left_hessian = 0;
double gain_shift = GetLeafSplitGain(sum_gradient, sum_hessian, meta_->tree_config->lambda_l1, meta_->tree_config->lambda_l2);

double min_gain_shift = gain_shift + meta_->tree_config->min_gain_to_split;
bool is_full_categorical = meta_->missing_type == MissingType::None;
int used_bin = meta_->num_bin - 1;

if (is_full_categorical) ++used_bin;

std::vector<int> sorted_idx(used_bin);
for (int i = 0; i < used_bin; ++i) sorted_idx[i] = i;
const double smooth_hess = std::max(meta_->tree_config->min_cat_smooth,
std::min(meta_->tree_config->cat_smooth_ratio * num_data, meta_->tree_config->max_cat_smooth));

const int min_data_per_cat = static_cast<int>(smooth_hess);
std::vector<int> sorted_idx;
for (int i = 0; i < used_bin; ++i) {
if (data_[i].cnt >= min_data_per_cat) {
sorted_idx.push_back(i);
}
}
used_bin = static_cast<int>(sorted_idx.size());

double smooth_hess = meta_->tree_config->cat_smooth_ratio * num_data;
smooth_hess = std::min(meta_->tree_config->max_cat_smooth, std::max(smooth_hess, meta_->tree_config->min_cat_smooth));
const double smooth_grad = smooth_hess * sum_gradient / sum_hessian;
const double l2 = meta_->tree_config->lambda_l2 + meta_->tree_config->cat_l2;

auto ctr_fun = [&smooth_hess, &smooth_grad](double sum_grad, double sum_hess) {
return (sum_grad + smooth_grad) / (sum_hess + smooth_hess);
auto ctr_fun = [&smooth_hess](double sum_grad, double sum_hess) {
return (sum_grad) / (sum_hess + smooth_hess);
};
std::sort(sorted_idx.begin(), sorted_idx.end(),
[this, &ctr_fun](int i, int j) {
return ctr_fun(data_[i].sum_gradients, data_[i].sum_hessians) < ctr_fun(data_[j].sum_gradients, data_[j].sum_hessians);
return ctr_fun(data_[i].sum_gradients, data_[i].cnt) < ctr_fun(data_[j].sum_gradients, data_[j].cnt);
});

std::vector<int> find_direction(1, 1);
std::vector<int> start_position(1, 0);
if (!is_full_categorical
|| meta_->tree_config->max_cat_threshold * 2 < meta_->num_bin) {
find_direction.push_back(-1);
start_position.push_back(used_bin - 1);
}
find_direction.push_back(-1);
start_position.push_back(used_bin - 1);
const int max_num_cat = std::min(meta_->tree_config->max_cat_threshold, (used_bin + 1) / 2);

is_splittable_ = false;
int best_threshold = -1;
int best_dir = 1;
for (size_t out_i = 0; out_i < find_direction.size(); ++out_i) {
auto dir = find_direction[out_i];
auto start_pos = start_position[out_i];
data_size_t rest_group = meta_->tree_config->max_cat_group;
data_size_t min_data_per_group = std::max(meta_->tree_config->min_data_per_group, num_data / rest_group);
data_size_t min_data_per_group = meta_->tree_config->min_data_per_group;
data_size_t cnt_cur_group = 0;
double sum_left_gradient = 0.0f;
double sum_left_hessian = kEpsilon;
data_size_t left_count = 0;
for (int i = 0; i < used_bin && i < meta_->tree_config->max_cat_threshold; ++i) {
for (int i = 0; i < used_bin && i < max_num_cat; ++i) {
auto t = sorted_idx[start_pos];
start_pos += dir;

Expand All @@ -171,11 +175,10 @@ class FeatureHistogram {
if (cnt_cur_group < min_data_per_group) continue;

cnt_cur_group = 0;
if (--rest_group > 0) min_data_per_group = std::max(meta_->tree_config->min_data_per_group, right_count / rest_group);

double sum_right_gradient = sum_gradient - sum_left_gradient;
double current_gain = GetLeafSplitGain(sum_left_gradient, sum_left_hessian, meta_->tree_config->lambda_l1, meta_->tree_config->lambda_l2)
+ GetLeafSplitGain(sum_right_gradient, sum_right_hessian, meta_->tree_config->lambda_l1, meta_->tree_config->lambda_l2);
double current_gain = GetLeafSplitGain(sum_left_gradient, sum_left_hessian, meta_->tree_config->lambda_l1, l2)
+ GetLeafSplitGain(sum_right_gradient, sum_right_hessian, meta_->tree_config->lambda_l1, l2);
if (current_gain <= min_gain_shift) continue;
is_splittable_ = true;
if (current_gain > best_gain) {
Expand All @@ -191,13 +194,13 @@ class FeatureHistogram {

if (is_splittable_) {
output->left_output = CalculateSplittedLeafOutput(best_sum_left_gradient, best_sum_left_hessian,
meta_->tree_config->lambda_l1, meta_->tree_config->lambda_l2);
meta_->tree_config->lambda_l1, l2);
output->left_count = best_left_count;
output->left_sum_gradient = best_sum_left_gradient;
output->left_sum_hessian = best_sum_left_hessian - kEpsilon;
output->right_output = CalculateSplittedLeafOutput(sum_gradient - best_sum_left_gradient,
sum_hessian - best_sum_left_hessian,
meta_->tree_config->lambda_l1, meta_->tree_config->lambda_l2);
meta_->tree_config->lambda_l1, l2);
output->right_count = num_data - best_left_count;
output->right_sum_gradient = sum_gradient - best_sum_left_gradient;
output->right_sum_hessian = sum_hessian - best_sum_left_hessian - kEpsilon;
Expand All @@ -207,16 +210,12 @@ class FeatureHistogram {
if (best_dir == 1) {
for (int i = 0; i < output->num_cat_threshold; ++i) {
auto t = sorted_idx[i];
if (data_[t].cnt > 0) {
output->cat_threshold[i] = t;
}
output->cat_threshold[i] = t;
}
} else {
for (int i = 0; i < output->num_cat_threshold; ++i) {
auto t = sorted_idx[used_bin - 1 - i];
if (data_[t].cnt > 0) {
output->cat_threshold[i] = t;
}
output->cat_threshold[i] = t;
}
}
}
Expand Down
6 changes: 6 additions & 0 deletions tests/python_package_test/test_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,8 @@ def test_categorical_handle(self):
'learning_rate': 1,
'min_data_in_bin': 1,
'min_data_per_group': 1,
'min_cat_smooth': 1,
'cat_l2': 0,
'zero_as_missing': True,
'categorical_column': 0
}
Expand Down Expand Up @@ -260,6 +262,8 @@ def test_categorical_handle2(self):
'learning_rate': 1,
'min_data_in_bin': 1,
'min_data_per_group': 1,
'min_cat_smooth': 1,
'cat_l2': 0,
'zero_as_missing': False,
'categorical_column': 0
}
Expand Down Expand Up @@ -291,6 +295,8 @@ def test_categorical_handle3(self):
'learning_rate': 1,
'min_data_in_bin': 1,
'min_data_per_group': 1,
'min_cat_smooth': 1,
'cat_l2': 0,
'zero_as_missing': False,
'categorical_column': 0
}
Expand Down

0 comments on commit eadc7b9

Please sign in to comment.