fix percentile computation for regression objectives #5848

zachary62 · 2023-04-24T23:17:11Z

Fixes #5847

jameslamb · 2023-05-09T01:57:34Z

@shiyu1994 @guolinke can you review this please?

Maybe still need a (1.0 - alpha)

guolinke · 2023-08-10T04:24:20Z

sorry for missing this PR. I think this PR indeed fixes the bug.
cc @shiyu1994 to confirm it.

jameslamb · 2023-08-11T13:46:25Z

@zachary62 please merge latest master into this branch and push a new commit, so we can see the CI run again.

In the future, I recommend creating a new branch on your fork when contributing to open source projects here on GitHub, instead of committing to master / main, as that makes that operation a bit easier. Let us know if you need help.

guolinke · 2023-08-12T00:17:05Z

It seems we need to fix some tests

jameslamb

I just renamed this. Pull request titles automatically become release note items in this project. Users of LightGBM reading the release notes won't care that the fix was "simple" or how many lines the change was.

I have a few questions for you @guolinke @shiyu1994

Is my reworded title correct?
If it is... should this be labeled breaking? If I understand correctly, won't this change the results of any models using regression objectives? (as indicated by some of those breaking tests )
Are you sure this fix is correct? I understand that the initial score being chosen right now is not exactly the median and is supposed to be...but the failing tests are showing that as of this change lightgbm produces a worse in-sample fit. Isn't that a problem?

e.g.

LightGBM/tests/python_package_test/test_engine.py

Lines 135 to 146 in 5c9e61d

    
           ret = mean_squared_error(y_test, gbm.predict(X_test)) 
        
           if objective == 'huber': 
        
               assert ret < 430 
        
           elif objective == 'fair': 
        
               assert ret < 296 
        
           elif objective == 'poisson': 
        
               assert ret < 193 
        
           elif objective == 'quantile': 
        
               assert ret < 1311 
        
           else: 
        
               assert ret < 338 
        
           assert evals_result['valid_0']['l2'][-1] == pytest.approx(ret)

FAILED tests/python_package_test/test_engine.py::test_regression[regression_l1] - assert 368.9313884458552 < 338
FAILED tests/python_package_test/test_engine.py::test_regression[quantile] - assert 1314.2383346017639 < 1311

(build link)

shiyu1994 · 2023-08-14T04:02:21Z

I have pushed a fix for the positions in percentile computation. Also, I increase the threshold value for l1 objective in the test_regression. Note that what we test in the test cases are the metrics on test set, not training set. So, it is possible that after the fix the performance on test set would downgrade. Also, the metrics are MSE, instead of MAE. For the l1_regression, it is optimizing the MAE. So, fix a bug used by the l1_regression doesn't ensure that we will get a better MSE value.

To show that the fix makes sense, the MAE on the training set of test_regression for l1_regression before the fix is 12.631088946445049, and after the fix is 12.588430387696558.

addressed

jameslamb · 2023-08-14T04:25:07Z

Ok thanks @shiyu1994 . I've dismissed my review so it doesn't prevent merging.

For my own understanding, can you explain what was incorrect about my proposed title fix initial score computation for regression objectives?

Is it not true that this affects the initial score that is set at the beginning of boosting? I thought that's what was happening based on this log line in #5847:

[LightGBM] [Info] Start training from score 3.500000

shiyu1994 · 2023-08-15T02:10:24Z

Is it not true that this affects the initial score that is set at the beginning of boosting?

Yes, it is true. But not only the initial score. In l1_regression and quantile regression. The leaf values are refit after the tree is built. And the refit is according to the percentile calculation.

LightGBM/src/objective/regression_objective.hpp

Lines 253 to 283 in 5c9e61d

    
           double RenewTreeOutput(double, std::function<double(const label_t*, int)> residual_getter, 
        
                                  const data_size_t* index_mapper, 
        
                                  const data_size_t* bagging_mapper, 
        
                                  data_size_t num_data_in_leaf) const override { 
        
             const double alpha = 0.5; 
        
             if (weights_ == nullptr) { 
        
               if (bagging_mapper == nullptr) { 
        
                 #define data_reader(i) (residual_getter(label_, index_mapper[i])) 
        
                 PercentileFun(double, data_reader, num_data_in_leaf, alpha); 
        
                 #undef data_reader 
        
               } else { 
        
                 #define data_reader(i) (residual_getter(label_, bagging_mapper[index_mapper[i]])) 
        
                 PercentileFun(double, data_reader, num_data_in_leaf, alpha); 
        
                 #undef data_reader 
        
               } 
        
             } else { 
        
               if (bagging_mapper == nullptr) { 
        
                 #define data_reader(i) (residual_getter(label_, index_mapper[i])) 
        
                 #define weight_reader(i) (weights_[index_mapper[i]]) 
        
                 WeightedPercentileFun(double, data_reader, weight_reader, num_data_in_leaf, alpha); 
        
                 #undef data_reader 
        
                 #undef weight_reader 
        
               } else { 
        
                 #define data_reader(i) (residual_getter(label_, bagging_mapper[index_mapper[i]])) 
        
                 #define weight_reader(i) (weights_[bagging_mapper[index_mapper[i]]]) 
        
                 WeightedPercentileFun(double, data_reader, weight_reader, num_data_in_leaf, alpha); 
        
                 #undef data_reader 
        
                 #undef weight_reader 
        
               } 
        
             } 
        
           }

jameslamb · 2023-08-15T02:32:10Z

ohhhh I see. Ok thanks for explaining that!

github-actions · 2023-11-15T00:20:14Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

Update regression_objective.hpp

8655aa1

zachary62 requested review from guolinke and shiyu1994 as code owners April 24, 2023 23:17

jameslamb added the awaiting review label May 9, 2023

Update regression_objective.hpp

d19e4b7

Maybe still need a (1.0 - alpha)

This comment was marked as spam.

Sign in to view

jameslamb added the fix label Aug 10, 2023

shiyu1994 approved these changes Aug 11, 2023

View reviewed changes

Merge branch 'master' into master

fc093a7

shiyu1994 requested review from jameslamb and jmoralez as code owners August 11, 2023 22:19

jameslamb changed the title ~~Simple one-line fix to correct the percentile~~ fix initial score computation for regression objectives Aug 13, 2023

jameslamb previously requested changes Aug 13, 2023

View reviewed changes

jameslamb added the in progress label Aug 13, 2023

shiyu1994 changed the title ~~fix initial score computation for regression objectives~~ fix percentile computation for regression objectives Aug 14, 2023

shiyu1994 added 2 commits August 14, 2023 03:45

fix position in percentile calculation

80a2a98

fix regression metric threshold for l1

bd1ae3c

Merge branch 'master' into master

bdcbfb8

jameslamb self-requested a review August 14, 2023 04:23

jameslamb removed the awaiting review label Aug 14, 2023

shiyu1994 removed the in progress label Aug 15, 2023

shiyu1994 merged commit 37c3d3f into microsoft:master Aug 15, 2023
41 checks passed

github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix percentile computation for regression objectives #5848

fix percentile computation for regression objectives #5848

zachary62 commented Apr 24, 2023 •

edited by jameslamb

jameslamb commented May 9, 2023

This comment was marked as spam.

guolinke commented Aug 10, 2023

jameslamb commented Aug 11, 2023

guolinke commented Aug 12, 2023

jameslamb left a comment •

edited

shiyu1994 commented Aug 14, 2023 •

edited

jameslamb commented Aug 14, 2023

shiyu1994 commented Aug 15, 2023

jameslamb commented Aug 15, 2023

github-actions bot commented Nov 15, 2023

	ret = mean_squared_error(y_test, gbm.predict(X_test))
	if objective == 'huber':
	assert ret < 430
	elif objective == 'fair':
	assert ret < 296
	elif objective == 'poisson':
	assert ret < 193
	elif objective == 'quantile':
	assert ret < 1311
	else:
	assert ret < 338
	assert evals_result['valid_0']['l2'][-1] == pytest.approx(ret)

fix percentile computation for regression objectives #5848

fix percentile computation for regression objectives #5848

Conversation

zachary62 commented Apr 24, 2023 • edited by jameslamb

jameslamb commented May 9, 2023

This comment was marked as spam.

guolinke commented Aug 10, 2023

jameslamb commented Aug 11, 2023

guolinke commented Aug 12, 2023

jameslamb left a comment • edited

Choose a reason for hiding this comment

shiyu1994 commented Aug 14, 2023 • edited

jameslamb commented Aug 14, 2023

shiyu1994 commented Aug 15, 2023

jameslamb commented Aug 15, 2023

github-actions bot commented Nov 15, 2023

zachary62 commented Apr 24, 2023 •

edited by jameslamb

jameslamb left a comment •

edited

shiyu1994 commented Aug 14, 2023 •

edited