Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn on multinode XGBoost support in AutoML by default #9128

Closed
exalate-issue-sync bot opened this issue May 12, 2023 · 7 comments
Closed

Turn on multinode XGBoost support in AutoML by default #9128

exalate-issue-sync bot opened this issue May 12, 2023 · 7 comments

Comments

@exalate-issue-sync
Copy link

Let's turn on multinode XGBoost support by default in AutoML. Currently, you have to start H2O with special options to enable it: setting the environment variable -Dsys.ai.h2o.automl.xgboost.multinode.enabled=true (when launching the H2O process from the command line) for every node of the H2O cluster.

We also need to update the documentation to reflect this:

@exalate-issue-sync
Copy link
Author

Sebastien Poirier commented: Activation on multinode seems to work most of the time on Linux, but we still need more intensive testing to check its reliability:

  • after running it like 8-10 times locally on docker, it hanged at least once (apparently communication issue between nodes) with a “larger” dataset like {{diabetes}} : larger here, meaning larger than the datasets commonly used in our basic test suite.
  • further tests are then required to guarantee this reliability: we can think about using the {{automlbenchmark}} app, configure it for multinode and run several larger datasets against it on AWS.

If it passes the tests and reliability is confirmed, than nothing will prevents us from activating it by default.

Note that users willing to use XGB on multinode always have the possibility to activate it when starting the nodes using the jvm param

{{-Dsys.ai.h2o.automl.xgboost.multinode.enabled=true}}

@exalate-issue-sync
Copy link
Author

Ruslan Dautkhanov commented: Please let us know once that multi-node xgboost instability is fixed

Thank you Sebastien

@exalate-issue-sync
Copy link
Author

Jan Sterba commented: fixed in PUBDEV-6793

@exalate-issue-sync
Copy link
Author

Angela Bartz commented: FYI, the documentation task listed in the description was not addressed. I will ask [~accountid:5d1185d4f46aa30c271c7cc6] to fix this as part of her fix to PUBDEV-7141.

@exalate-issue-sync
Copy link
Author

Sebastien Poirier commented: Reopened as we had to re-disable it for {{3.28.0.1}} due to issues with XGBoost/rabit.

@exalate-issue-sync
Copy link
Author

Michal Kurka commented: The reason this was reopened is because we discovered an issue in our automated tests where XGBoost was crashing one of the H2O nodes. The bug seems to be buried in XGBoost codebase, not H2O codebase. We do need to investigate more before we can make the switch. We are targeting one of the fix releases in 3.28.0.x.

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

JIRA Issue Migration Info

Jira Issue: PUBDEV-6502
Assignee: Jan Sterba
Reporter: Erin LeDell
State: Resolved
Fix Version: 3.28.0.3
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

#3547
#4242
#3853

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant