[dask] Include support for raw_score in predict (fixes #3793) #4024

jmoralez · 2021-02-26T03:58:14Z

This attempts to solve #3793 by testing the predict method for DaskLGBMRegressor and DaskLGBMClassifier with raw_score=True. The tests for DaskLGBMClassifier are performed with binary and multi-class classification since predict_proba=True and raw_score=True for binary classification return a 1d array/pandas Series as opposed to the multi-class variants (which return the same number of columns as classes).

I found that using drop_axis=1 for dask.array.map_blocks works everytime and avoids having many if statements testing the conditions described above but I'm happy to discuss the best way to do this. I also replaced the decision to return a pd.Series or pd.DataFrame by checking the shape of the input array, which I think should be discussed as well.

jameslamb

Thanks for taking this on! Please see my first round of comments

python-package/lightgbm/dask.py

jameslamb · 2021-02-26T17:03:31Z

tests/python_package_test/test_dask.py

@@ -1255,3 +1255,59 @@ def test_sklearn_integration(estimator, check, client):
 def test_parameters_default_constructible(estimator):
    name, Estimator = estimator.__class__.__name__, estimator.__class__
    sklearn_checks.check_parameters_default_constructible(name, Estimator)
+
+
+@pytest.mark.parametrize('task', ['binary_classification', 'multi-class_classification', 'regression'])


can you please adding ranking tests as well?

actually this change also makes me realize that the other tests using @pytest.mark.parametrize('task', tasks) are not currently testing multi-class classification.

Are you interested in submitting a separate PR to add multi-class-classification to tasks?

LightGBM/tests/python_package_test/test_dask.py

Line 41 in 6356e65

tasks = ['classification', 'regression', 'ranking']

Doing that would improve our test coverage AND reduce the risk of mistakes like this where a task if forgotten.

Sure. Should this modify the _create_data function to add an objective of multi-class-classification?

yes please! based on this discussion: #4024 (comment)

when you do that, I think it would make sense to remove the centers argument from that function, and just use the objective to set centers=2 or centers=3. We don't have any tests right now that use more than 3 classes or that really care about the exact numeric values of the cluster centers coming out of make_blobs().

Hi James. I'm having trouble with the tests for ranking. The check for the expected mean is because if you have 25 samples in one leaf with value 0 and 75 samples in another leaf with value 1 you expect your mean to be (25 * 0 + 75 * 1) / (25 + 75), right? Which I guess would be roughly equal to checking that there are 25 predictions with the value 0 and 75 with the value 1. But for ranking I'm seeing that these don't match:

Or I could compare directly the count column vs the unique counts of the predictions. It's not very clear to me why we don't use that here, why are we using the weights?

you know what now that I think about it...it would be enough to compare the unique values of the leaf nodes to the unique values of the preds. If those are the same, you know you got the raw scores

So the tests would check set(raw_predictions) == set(trees_df['value'])?

done. please let me know what you think

tests/python_package_test/test_dask.py

jameslamb

Looks great to me, thanks very much! I'm going to wait to merge this until after LightGBM 3.2.0 is released (#3872 ), assuming that happens in the next day or two.

jmoralez · 2021-03-18T23:50:42Z

Sounds good. What would be a good next contribution? I'd like to work on the dataset but I'm happy to contribute something else if you consider it more important

jameslamb · 2021-03-19T04:20:44Z

Sounds good. What would be a good next contribution? I'd like to work on the dataset but I'm happy to contribute something else if you consider it more important

thanks very much! Could you try adding tests for voting_parallel learning? As part of #3834. I added some notes on how to get started at #3834 (comment).

I don't want to start on DaskLGBMDataset yet (#3944) since I think that will take a while and will probably conflict with @ffineis 's work on adding eval sets (#3952 (comment)).

jmoralez · 2021-03-20T00:36:34Z

Wow I hadn't seen the voting parallel learning, it looks awesome. Will definitely work on that and hope that it doesn't suffer from what I've seen in #4026.

StrikerRUS · 2021-03-26T13:25:41Z

@jmoralez Could you please sync with latest master branch?

github-actions · 2023-08-23T22:52:13Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jmoralez added 2 commits February 25, 2021 21:43

include test for prediction with raw_score

f7f7906

close client

39b6e8d

jmoralez requested a review from jameslamb as a code owner February 26, 2021 03:58

jameslamb requested changes Feb 26, 2021

View reviewed changes

StrikerRUS added the feature label Feb 27, 2021

jameslamb added the in progress label Mar 2, 2021

jmoralez added 7 commits March 2, 2021 20:54

initial comments

f3495d5

merge master

06f0938

update data creation and include ranking task

4b65e19

linting

8e3a7f2

merge master

8eda95f

update _create_data

521d492

compare unique raw_predictions with values in leaves_df

b3e18dd

jameslamb self-requested a review March 18, 2021 21:08

jameslamb approved these changes Mar 18, 2021

View reviewed changes

jameslamb removed the in progress label Mar 18, 2021

jameslamb changed the title ~~[dask] Include support for raw_score in predict~~ [dask] Include support for raw_score in predict (fixes #3793) Mar 18, 2021

merge master

f416184

StrikerRUS merged commit fe1b80a into microsoft:master Mar 27, 2021

jmoralez deleted the raw_score branch March 27, 2021 16:10

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] Include support for raw_score in predict (fixes #3793) #4024

[dask] Include support for raw_score in predict (fixes #3793) #4024

jmoralez commented Feb 26, 2021

jameslamb left a comment

jameslamb Feb 26, 2021

jameslamb Feb 26, 2021

jmoralez Mar 3, 2021

jameslamb Mar 3, 2021

jmoralez Mar 12, 2021

jmoralez Mar 12, 2021

jameslamb Mar 18, 2021

jmoralez Mar 18, 2021

jameslamb Mar 18, 2021

jmoralez Mar 18, 2021

jameslamb left a comment

jmoralez commented Mar 18, 2021

jameslamb commented Mar 19, 2021

jmoralez commented Mar 20, 2021

StrikerRUS commented Mar 26, 2021

github-actions bot commented Aug 23, 2023

[dask] Include support for raw_score in predict (fixes #3793) #4024

[dask] Include support for raw_score in predict (fixes #3793) #4024

Conversation

jmoralez commented Feb 26, 2021

jameslamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameslamb left a comment

Choose a reason for hiding this comment

jmoralez commented Mar 18, 2021

jameslamb commented Mar 19, 2021

jmoralez commented Mar 20, 2021

StrikerRUS commented Mar 26, 2021

github-actions bot commented Aug 23, 2023