You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems like this is caused by a concurrency issue: 2 CV models compete to start the cluster - one of them fails:
{noformat}07-05 23:08:20.457 10.1.31.71:54321 1 FJ-2-11 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Starting external cluster for model xgboost_fit1_cv_2.
07-05 23:08:21.291 10.1.31.71:54321 1 4236636-20 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:25.481 10.1.31.71:54321 1 4236636-21 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Received message response 10.1.31.71:54321_startXGBoost_response
07-05 23:08:25.482 10.1.31.71:54321 1 FJ-2-11 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Continuing to wait for external cluster to start.
07-05 23:08:25.693 10.1.31.71:54321 1 4236636-21 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:27.488 10.1.31.71:54321 1 4236636-21 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:29.501 10.1.31.71:54321 1 4236636-20 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:30.490 10.1.31.71:54321 1 4236636-20 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Received message response 10.1.31.71:54321_startXGBoost_response
07-05 23:08:30.491 10.1.31.71:54321 1 FJ-2-11 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Continuing to wait for external cluster to start.
07-05 23:08:31.487 10.1.31.71:54321 1 4236636-21 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:33.486 10.1.31.71:54321 1 4236636-20 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:35.489 10.1.31.71:54321 1 4236636-21 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:35.499 10.1.31.71:54321 1 4236636-20 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Received message response 10.1.31.71:54321_startXGBoost_response
07-05 23:08:35.499 10.1.31.71:54321 1 FJ-2-11 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Continuing to wait for external cluster to start.
07-05 23:08:38.252 10.1.31.71:54321 1 4236636-21 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Received stop request check-if-xgboost-needed
07-05 23:08:48.289 10.1.31.71:54321 1 4236636-20 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:55.500 10.1.31.71:54321 1 4236636-21 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Responding to stop request with allowed=true
07-05 23:08:55.501 10.1.31.71:54321 1 FJ-2-15 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Starting external cluster for model xgboost_fit1_cv_1.
07-05 23:08:55.501 10.1.31.71:54321 1 4236636-21 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Received message response 10.1.31.71:54321_startXGBoost_response
07-05 23:08:55.501 10.1.31.71:54321 1 FJ-2-15 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Continuing to wait for external cluster to start.
07-05 23:08:55.501 10.1.31.71:54321 1 4236636-21 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Received message response 10.1.31.71:54321_startXGBoost_response
07-05 23:08:55.502 10.1.31.71:54321 1 FJ-2-15 INFO hex.tree.xgboost.remote.SteamExecutorStarter: External cluster started at 10.1.179.98:54321/proxy/h2o-k8s/344.
07-05 23:08:55.502 10.1.31.71:54321 1 FJ-2-11 INFO water.default: Completing model xgboost_fit1_cv_2{noformat}
{noformat}java.lang.RuntimeException: Error while training XGBoost model
at hex.tree.xgboost.XGBoost$XGBoostDriver.buildModelImpl(XGBoost.java:451) ~[h2o.jar:?]
at hex.tree.xgboost.XGBoost$XGBoostDriver.buildModel(XGBoost.java:403) ~[h2o.jar:?]
at hex.tree.xgboost.XGBoost$XGBoostDriver.computeImpl(XGBoost.java:389) ~[h2o.jar:?]
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:252) ~[h2o.jar:?]
at water.H2O$H2OCountedCompleter.compute(H2O.java:1677) ~[h2o.jar:?]
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) ~[h2o.jar:?]
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) [h2o.jar:?]
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976) [h2o.jar:?]
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) [h2o.jar:?]
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) [h2o.jar:?]
Caused by: java.lang.IllegalStateException: No response received from Steam.
at hex.tree.xgboost.remote.SteamExecutorStarter.startCluster(SteamExecutorStarter.java:88) ~[h2o.jar:?]
at hex.tree.xgboost.remote.SteamExecutorStarter.getRemoteExecutor(SteamExecutorStarter.java:56) ~[h2o.jar:?]
at hex.tree.xgboost.XGBoost$XGBoostDriver.makeExecutor(XGBoost.java:409) ~[h2o.jar:?]
at hex.tree.xgboost.XGBoost$XGBoostDriver.buildModelImpl(XGBoost.java:447) ~[h2o.jar:?]
... 9 more{noformat}
The text was updated successfully, but these errors were encountered:
Jira Issue: PUBDEV-8757
Assignee: Michal Kurka
Reporter: Michal Kurka
State: Resolved
Fix Version: 3.36.1.4
Attachments: N/A
Development PRs: Available
It seems like this is caused by a concurrency issue: 2 CV models compete to start the cluster - one of them fails:
{noformat}07-05 23:08:20.457 10.1.31.71:54321 1 FJ-2-11 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Starting external cluster for model xgboost_fit1_cv_2.
07-05 23:08:21.291 10.1.31.71:54321 1 4236636-20 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:25.481 10.1.31.71:54321 1 4236636-21 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Received message response 10.1.31.71:54321_startXGBoost_response
07-05 23:08:25.482 10.1.31.71:54321 1 FJ-2-11 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Continuing to wait for external cluster to start.
07-05 23:08:25.693 10.1.31.71:54321 1 4236636-21 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:27.488 10.1.31.71:54321 1 4236636-21 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:29.501 10.1.31.71:54321 1 4236636-20 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:30.490 10.1.31.71:54321 1 4236636-20 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Received message response 10.1.31.71:54321_startXGBoost_response
07-05 23:08:30.491 10.1.31.71:54321 1 FJ-2-11 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Continuing to wait for external cluster to start.
07-05 23:08:31.487 10.1.31.71:54321 1 4236636-21 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:33.486 10.1.31.71:54321 1 4236636-20 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:35.489 10.1.31.71:54321 1 4236636-21 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:35.499 10.1.31.71:54321 1 4236636-20 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Received message response 10.1.31.71:54321_startXGBoost_response
07-05 23:08:35.499 10.1.31.71:54321 1 FJ-2-11 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Continuing to wait for external cluster to start.
07-05 23:08:38.252 10.1.31.71:54321 1 4236636-21 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Received stop request check-if-xgboost-needed
07-05 23:08:48.289 10.1.31.71:54321 1 4236636-20 INFO water.default: GET /3/SteamMetrics, parms: {}
07-05 23:08:55.500 10.1.31.71:54321 1 4236636-21 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Responding to stop request with allowed=true
07-05 23:08:55.501 10.1.31.71:54321 1 FJ-2-15 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Starting external cluster for model xgboost_fit1_cv_1.
07-05 23:08:55.501 10.1.31.71:54321 1 4236636-21 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Received message response 10.1.31.71:54321_startXGBoost_response
07-05 23:08:55.501 10.1.31.71:54321 1 FJ-2-15 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Continuing to wait for external cluster to start.
07-05 23:08:55.501 10.1.31.71:54321 1 4236636-21 INFO hex.tree.xgboost.remote.SteamExecutorStarter: Received message response 10.1.31.71:54321_startXGBoost_response
07-05 23:08:55.502 10.1.31.71:54321 1 FJ-2-15 INFO hex.tree.xgboost.remote.SteamExecutorStarter: External cluster started at 10.1.179.98:54321/proxy/h2o-k8s/344.
07-05 23:08:55.502 10.1.31.71:54321 1 FJ-2-11 INFO water.default: Completing model xgboost_fit1_cv_2{noformat}
{noformat}java.lang.RuntimeException: Error while training XGBoost model
at hex.tree.xgboost.XGBoost$XGBoostDriver.buildModelImpl(XGBoost.java:451) ~[h2o.jar:?]
at hex.tree.xgboost.XGBoost$XGBoostDriver.buildModel(XGBoost.java:403) ~[h2o.jar:?]
at hex.tree.xgboost.XGBoost$XGBoostDriver.computeImpl(XGBoost.java:389) ~[h2o.jar:?]
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:252) ~[h2o.jar:?]
at water.H2O$H2OCountedCompleter.compute(H2O.java:1677) ~[h2o.jar:?]
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) ~[h2o.jar:?]
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) [h2o.jar:?]
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976) [h2o.jar:?]
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) [h2o.jar:?]
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) [h2o.jar:?]
Caused by: java.lang.IllegalStateException: No response received from Steam.
at hex.tree.xgboost.remote.SteamExecutorStarter.startCluster(SteamExecutorStarter.java:88) ~[h2o.jar:?]
at hex.tree.xgboost.remote.SteamExecutorStarter.getRemoteExecutor(SteamExecutorStarter.java:56) ~[h2o.jar:?]
at hex.tree.xgboost.XGBoost$XGBoostDriver.makeExecutor(XGBoost.java:409) ~[h2o.jar:?]
at hex.tree.xgboost.XGBoost$XGBoostDriver.buildModelImpl(XGBoost.java:447) ~[h2o.jar:?]
... 9 more{noformat}
The text was updated successfully, but these errors were encountered: