Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix percentile computation for regression objectives #5848

Merged
merged 6 commits into from Aug 15, 2023

Conversation

zachary62
Copy link
Contributor

@zachary62 zachary62 commented Apr 24, 2023

Fixes #5847

@jameslamb
Copy link
Collaborator

@shiyu1994 @guolinke can you review this please?

Maybe still need a (1.0 - alpha)
@Gso-1

This comment was marked as spam.

@guolinke
Copy link
Collaborator

sorry for missing this PR. I think this PR indeed fixes the bug.
cc @shiyu1994 to confirm it.

@jameslamb jameslamb added the fix label Aug 10, 2023
@jameslamb
Copy link
Collaborator

@zachary62 please merge latest master into this branch and push a new commit, so we can see the CI run again.

In the future, I recommend creating a new branch on your fork when contributing to open source projects here on GitHub, instead of committing to master / main, as that makes that operation a bit easier. Let us know if you need help.

@guolinke
Copy link
Collaborator

It seems we need to fix some tests

@jameslamb jameslamb changed the title Simple one-line fix to correct the percentile fix initial score computation for regression objectives Aug 13, 2023
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just renamed this. Pull request titles automatically become release note items in this project. Users of LightGBM reading the release notes won't care that the fix was "simple" or how many lines the change was.

I have a few questions for you @guolinke @shiyu1994

  1. Is my reworded title correct?
  2. If it is... should this be labeled breaking? If I understand correctly, won't this change the results of any models using regression objectives? (as indicated by some of those breaking tests )
  3. Are you sure this fix is correct? I understand that the initial score being chosen right now is not exactly the median and is supposed to be...but the failing tests are showing that as of this change lightgbm produces a worse in-sample fit. Isn't that a problem?

e.g.

ret = mean_squared_error(y_test, gbm.predict(X_test))
if objective == 'huber':
assert ret < 430
elif objective == 'fair':
assert ret < 296
elif objective == 'poisson':
assert ret < 193
elif objective == 'quantile':
assert ret < 1311
else:
assert ret < 338
assert evals_result['valid_0']['l2'][-1] == pytest.approx(ret)

FAILED tests/python_package_test/test_engine.py::test_regression[regression_l1] - assert 368.9313884458552 < 338
FAILED tests/python_package_test/test_engine.py::test_regression[quantile] - assert 1314.2383346017639 < 1311

(build link)

@shiyu1994 shiyu1994 changed the title fix initial score computation for regression objectives fix percentile computation for regression objectives Aug 14, 2023
@shiyu1994
Copy link
Collaborator

shiyu1994 commented Aug 14, 2023

I have pushed a fix for the positions in percentile computation. Also, I increase the threshold value for l1 objective in the test_regression. Note that what we test in the test cases are the metrics on test set, not training set. So, it is possible that after the fix the performance on test set would downgrade. Also, the metrics are MSE, instead of MAE. For the l1_regression, it is optimizing the MAE. So, fix a bug used by the l1_regression doesn't ensure that we will get a better MSE value.

To show that the fix makes sense, the MAE on the training set of test_regression for l1_regression before the fix is 12.631088946445049, and after the fix is 12.588430387696558.

@jameslamb jameslamb self-requested a review August 14, 2023 04:23
@jameslamb
Copy link
Collaborator

Ok thanks @shiyu1994 . I've dismissed my review so it doesn't prevent merging.

For my own understanding, can you explain what was incorrect about my proposed title fix initial score computation for regression objectives?

Is it not true that this affects the initial score that is set at the beginning of boosting? I thought that's what was happening based on this log line in #5847:

[LightGBM] [Info] Start training from score 3.500000

@shiyu1994
Copy link
Collaborator

Is it not true that this affects the initial score that is set at the beginning of boosting?

Yes, it is true. But not only the initial score. In l1_regression and quantile regression. The leaf values are refit after the tree is built. And the refit is according to the percentile calculation.

double RenewTreeOutput(double, std::function<double(const label_t*, int)> residual_getter,
const data_size_t* index_mapper,
const data_size_t* bagging_mapper,
data_size_t num_data_in_leaf) const override {
const double alpha = 0.5;
if (weights_ == nullptr) {
if (bagging_mapper == nullptr) {
#define data_reader(i) (residual_getter(label_, index_mapper[i]))
PercentileFun(double, data_reader, num_data_in_leaf, alpha);
#undef data_reader
} else {
#define data_reader(i) (residual_getter(label_, bagging_mapper[index_mapper[i]]))
PercentileFun(double, data_reader, num_data_in_leaf, alpha);
#undef data_reader
}
} else {
if (bagging_mapper == nullptr) {
#define data_reader(i) (residual_getter(label_, index_mapper[i]))
#define weight_reader(i) (weights_[index_mapper[i]])
WeightedPercentileFun(double, data_reader, weight_reader, num_data_in_leaf, alpha);
#undef data_reader
#undef weight_reader
} else {
#define data_reader(i) (residual_getter(label_, bagging_mapper[index_mapper[i]]))
#define weight_reader(i) (weights_[bagging_mapper[index_mapper[i]]])
WeightedPercentileFun(double, data_reader, weight_reader, num_data_in_leaf, alpha);
#undef data_reader
#undef weight_reader
}
}
}

@shiyu1994 shiyu1994 merged commit 37c3d3f into microsoft:master Aug 15, 2023
41 checks passed
@jameslamb
Copy link
Collaborator

ohhhh I see. Ok thanks for explaining that!

Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Median wrongly computed
5 participants