Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] test training when a worker has no data #3897

Merged
merged 8 commits into from
Feb 9, 2021

Conversation

jmoralez
Copy link
Collaborator

@jmoralez jmoralez commented Feb 3, 2021

This includes a test to check that the training process succeeds even when a worker has no data, e.g. verify that #3882 is solved.

I wanted to persist the collections to a specific worker with something like:

for worker in workers:
    dX, dy, dw = client.persist([dX, dy, dw], workers=worker)

However I got: TypeError: unhashable type: 'Array'.

So in the test I just create a collection with one chunk, persist it and check that it is only in one worker. Any feedback to make this test more robust is welcome.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much! I have some ideas to simplify this, without needing to introduce a new function to check that the data is only on one worker.

  1. Assert that client.nworkers > 1, just to be sure the client fixture is for a multi-worker cluster.
  2. Just repartition the data to 1 partition (.repartition(npartitions=1) for data frame, .rechunk((X.shape[0], X.shape[1])) for array). assert that dX.npartitions == 1. If you have only 1 partition, you know for sure only one worker has data!

other requests

Please make sure this test is run for ranking, classification, and regression. You can borrow from

@pytest.mark.parametrize('task', ['classification', 'regression', 'ranking'])
def test_training_works_if_client_not_provided_or_set_after_construction(task, listen_port, client):
if task == 'ranking':
_, _, _, _, dX, dy, _, dg = _create_ranking_data(
output='array',
group=None
)
model_factory = lgb.DaskLGBMRanker
else:
_, _, _, dX, dy, _ = _create_data(
objective=task,
output='array',
)
dg = None
if task == 'classification':
model_factory = lgb.DaskLGBMClassifier
elif task == 'regression':
model_factory = lgb.DaskLGBMRegressor
params = {
"time_out": 5,
"local_listen_port": listen_port,
"n_estimators": 1,
"num_leaves": 2
}
to see how to do that.

For training on a single worker, the result should be identical to non-Dask training. So please also train a regular lightgbm.sklearn.LGBM[Classifier/Regressor/Ranker] and then check that assert_eq(dask_model.predict(dX).compute(), local_model.predict(X)). You can see the other tests in this file for references on how to do that, or @ me if you have any questions.

@jmoralez
Copy link
Collaborator Author

jmoralez commented Feb 4, 2021

Hi James. I've included the different tasks and outputs. I just saw the PR you made for the categoricals and saw the new output type, I'll wait til that gets merged to include it.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much! I like the changes you've made. We're starting to have enough of these tests where tasks is parameterized, makes sense to concentrate it at the top of tetst_dask.py like you did.

I left a few minor suggestions. Other than those, I agree with your proposal to wait until #3908 is merged.

tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
@jmoralez
Copy link
Collaborator Author

jmoralez commented Feb 9, 2021

Hi James. I've included your comments and the dataframe-with-categorical data output. Looking forward to your review.

@jameslamb jameslamb self-requested a review February 9, 2021 02:47
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look great! Thanks very much for the help

@jameslamb jameslamb merged commit 7b47ab8 into microsoft:master Feb 9, 2021
@jmoralez jmoralez deleted the workers-without-data branch February 9, 2021 02:51
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants