Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First invocation of XGBoost in external cluster mode fails when CV is enabled #6962

Closed
exalate-issue-sync bot opened this issue May 11, 2023 · 2 comments

Comments

@exalate-issue-sync
Copy link

It seems like this is caused by a concurrency issue: 2 CV models compete to start the cluster - one of them fails:

{noformat}07-05 23:08:20.457 10.1.31.71:54321 1 FJ-2-11 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Starting external cluster for model xgboost_fit1_cv_2.
07-05 23:08:21.291 10.1.31.71:54321 1 4236636-20 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:25.481 10.1.31.71:54321 1 4236636-21 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Received message response 10.1.31.71:54321_startXGBoost_response
07-05 23:08:25.482 10.1.31.71:54321 1 FJ-2-11 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Continuing to wait for external cluster to start.
07-05 23:08:25.693 10.1.31.71:54321 1 4236636-21 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:27.488 10.1.31.71:54321 1 4236636-21 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:29.501 10.1.31.71:54321 1 4236636-20 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:30.490 10.1.31.71:54321 1 4236636-20 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Received message response 10.1.31.71:54321_startXGBoost_response
07-05 23:08:30.491 10.1.31.71:54321 1 FJ-2-11 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Continuing to wait for external cluster to start.
07-05 23:08:31.487 10.1.31.71:54321 1 4236636-21 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:33.486 10.1.31.71:54321 1 4236636-20 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:35.489 10.1.31.71:54321 1 4236636-21 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:35.499 10.1.31.71:54321 1 4236636-20 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Received message response 10.1.31.71:54321_startXGBoost_response
07-05 23:08:35.499 10.1.31.71:54321 1 FJ-2-11 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Continuing to wait for external cluster to start.
07-05 23:08:38.252 10.1.31.71:54321 1 4236636-21 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Received stop request check-if-xgboost-needed
07-05 23:08:48.289 10.1.31.71:54321 1 4236636-20 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:55.500 10.1.31.71:54321 1 4236636-21 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Responding to stop request with allowed=true
07-05 23:08:55.501 10.1.31.71:54321 1 FJ-2-15 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Starting external cluster for model xgboost_fit1_cv_1.
07-05 23:08:55.501 10.1.31.71:54321 1 4236636-21 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Received message response 10.1.31.71:54321_startXGBoost_response
07-05 23:08:55.501 10.1.31.71:54321 1 FJ-2-15 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Continuing to wait for external cluster to start.
07-05 23:08:55.501 10.1.31.71:54321 1 4236636-21 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Received message response 10.1.31.71:54321_startXGBoost_response
07-05 23:08:55.502 10.1.31.71:54321 1 FJ-2-15 INFO hex.tree.xgboost.remote.SteamExecutorStarter: External cluster started at 10.1.179.98:54321/proxy/h2o-k8s/344.
07-05 23:08:55.502 10.1.31.71:54321 1 FJ-2-11 INFO water.default: Completing model xgboost_fit1_cv_2{noformat}

{noformat}java.lang.RuntimeException: Error while training XGBoost model
at hex.tree.xgboost.XGBoost$XGBoostDriver.buildModelImpl(XGBoost.java:451) ~[h2o.jar:?]
at hex.tree.xgboost.XGBoost$XGBoostDriver.buildModel(XGBoost.java:403) ~[h2o.jar:?]
at hex.tree.xgboost.XGBoost$XGBoostDriver.computeImpl(XGBoost.java:389) ~[h2o.jar:?]
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:252) ~[h2o.jar:?]
at water.H2O$H2OCountedCompleter.compute(H2O.java:1677) ~[h2o.jar:?]
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) ~[h2o.jar:?]
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) [h2o.jar:?]
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976) [h2o.jar:?]
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) [h2o.jar:?]
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) [h2o.jar:?]
Caused by: java.lang.IllegalStateException: No response received from Steam.
at hex.tree.xgboost.remote.SteamExecutorStarter.startCluster(SteamExecutorStarter.java:88) ~[h2o.jar:?]
at hex.tree.xgboost.remote.SteamExecutorStarter.getRemoteExecutor(SteamExecutorStarter.java:56) ~[h2o.jar:?]
at hex.tree.xgboost.XGBoost$XGBoostDriver.makeExecutor(XGBoost.java:409) ~[h2o.jar:?]
at hex.tree.xgboost.XGBoost$XGBoostDriver.buildModelImpl(XGBoost.java:447) ~[h2o.jar:?]
... 9 more{noformat}

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

JIRA Issue Details

Jira Issue: PUBDEV-8757
Assignee: Michal Kurka
Reporter: Michal Kurka
State: Resolved
Fix Version: 3.36.1.4
Attachments: N/A
Development PRs: Available

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

Linked PRs from JIRA

#6242

@h2o-ops h2o-ops closed this as completed May 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant