Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] Add support for early stopping in Dask interface #3712

Closed
jameslamb opened this issue Jan 3, 2021 · 15 comments
Closed

[dask] Add support for early stopping in Dask interface #3712

jameslamb opened this issue Jan 3, 2021 · 15 comments

Comments

@jameslamb
Copy link
Collaborator

Summary

DaskLGBMClassifier and DaskLGBMRegressor in the Python package should support early stopping.

Motivation

Early stopping is generally useful with gradient boosting algorithms, to avoid wasted training iterations or unnecessary growth in the model size once desirable performance has been achieved. This feature is available in the non-Dask interfaces for LightGBM, and should be available with the Dask one.

Description

This should mimic the approach XGBoost took (https://github.com/dmlc/xgboost/blob/516a93d25c3b6899558700430ffc99a29ea21e1a/python-package/xgboost/dask.py#L1386), where eval_set contains Dask collection s(Dask Array or Dask DataFrame).

References

@jameslamb
Copy link
Collaborator Author

Closing in favor of being in #2302 with other feature requests. Please leave a comment here if you'd like to work on this.

@ffineis
Copy link
Contributor

ffineis commented Jan 12, 2021

Hey I'd like to take this if it's cool - planning mainly base this off of changes called out in #3515 and try to bring this in line with xgboost.dask.

@jameslamb
Copy link
Collaborator Author

sure, thank you! I'm really close to having a small reproducible example for the random "cannot bind to port XXXX" issue, will link that here when I've written it up.

@ffineis
Copy link
Contributor

ffineis commented Jan 12, 2021

sure, thank you! I'm really close to having a small reproducible example for the random "cannot bind to port XXXX" issue, will link that here when I've written it up.

Omigod lifesaver, thank you!

jameslamb referenced this issue Jan 17, 2021
…twork (fixes #3753) (#3766)

* starting work

* fixed port-binding issue on localhost

* minor cleanup

* updates

* getting closer

* definitely working for LocalCluster

* it works, it works

* docs

* add tests

* removing testing-only files

* linting

* Apply suggestions from code review

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* remove duplicated code

* remove unnecessary listen()

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
@jameslamb
Copy link
Collaborator Author

@ffineis do you think you'll have time to work on this this week? We're planning to do a 3.2.0 release of LightGBM in the next week or two. I didn't include this in my list of must-have Dask features for the next release (#3872 (comment)), but I'd love to try to get this change in if we can since it can have such a big impact on training runtime.

If you don't have time this week, could I take this back from you and try it out?

Thanks so much for all your help with the Dask module so far!!

@ffineis
Copy link
Contributor

ffineis commented Feb 6, 2021 via email

@jameslamb
Copy link
Collaborator Author

oh yeah no problem, thanks!

@jmoralez
Copy link
Collaborator

jmoralez commented Feb 8, 2021

Hi. I've been playing around with this and have a working (although horrible) implementation of this. The approach I took was using the futures of the persisted collections instead of turning the collections to lists of delayeds, so this avoids recomputing the training set in case it is in the eval_set as well (because both futures have the same key). Would you guys be interested in discussing this approach?

@jameslamb
Copy link
Collaborator Author

Interesting! I'll leave it to @ffineis to comment on that. Right now, I think our highest priority is supporting early stopping, and it would be ok if the first implementation of that merged to master computes the training data twice.

@ffineis
Copy link
Contributor

ffineis commented Feb 8, 2021

Hey @jmoralez, thanks for the ideation! Honestly, don't feel bad, I think any implementation of ES will be pretty hairy given that we're using lists of delayed partitions instead of distributed lgbm.Datasets. I'm a fan of how xgboost.dask attempts to accomplish what you've mention via id - I'm just going to check whether each X in a (X, y) eval set shares the same id as that of data. If so, this will instruct _train_part (local worker training function) to just use the local data in for any eval set's parts that had originally called for data, so hoping we'll avoid re-computing the entire trainset.

This method works if the entirety of an eval X equals data, but I'm guessing your approach goes even further in that it would also include cases when an eval X is a subset of partitions of training data? I'd say in the spirit of how PR's have been going (modular, one change at a time) - hold off until a naive early stopping approach gets merged, but then open an issue and follow up with an improvement PR - does that work?

@jmoralez
Copy link
Collaborator

jmoralez commented Feb 8, 2021

Yeah that sounds fair. Is there a plan to create an lgb.DaskDataset?

@ffineis
Copy link
Contributor

ffineis commented Feb 12, 2021

@jmoralez my guess is yes - I think check out #2302? Jlamb has a running list of dask features somewhere.

PR for this coming tomorrow night.

@jameslamb
Copy link
Collaborator Author

yes but I haven't written it up. Will do that right now. I like what xgboost did with DaskDMatrix. https://github.com/dmlc/xgboost/blob/9a0399e8981b2279d921fe2312f7ab1b880fd3c3/python-package/xgboost/dask.py#L185

@jameslamb
Copy link
Collaborator Author

I've added #3944 to track the idea of a DaskDataset. That's a fairly invasive change so I don't think it should be done before the 3.2.0 release (#3872)

@StrikerRUS
Copy link
Collaborator

Closing for now due to the lack of active work on this feature and open PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants