Ensure external backend (rest api) is stopped in automatic mode if the spark app is killed ( avoid zombie clusters) #5374

exalate-issue-sync · 2023-05-22T19:22:33Z

Basically re-implement the behaviour used by watchdog client but via rest.

exalate-issue-sync · 2023-05-22T19:22:36Z

Ruslan Dautkhanov commented: Do I understand correctly, that this Jira implies we would we need to restart the backend cluster every time a SW client reconnects?

We have multiple SW connections to the same backend/external h2o cluster; each Spark application with its own lifecycle. If I understand this new change that’s coming up, this will break some of workflows how we work with h2o / SW.

exalate-issue-sync · 2023-05-22T19:22:37Z

Jakub Hava commented: No, it means when spark is killed, external h2o backend is stopped as well to avoid running zombie h2o clusters. It affects only automatic cluster start, not manual. Users of automatic mode want to have the spark & h2o apps tight together and want to ensure that if one part is killed ( like kill -9) the second is stopped as well

exalate-issue-sync · 2023-05-22T19:22:39Z

Ruslan Dautkhanov commented: Thanks Kuba. I understand that now

Is the “automatic cluster start” a new feature that’s coming up in 3.28? That seems interesting. Where I can read more on it?

exalate-issue-sync · 2023-05-22T19:22:41Z

Jakub Hava commented: No problem, nope, it has been there almost from the begging of the external backend. We are just trying to ensure feature parity with original solution via REST api. More info can be found here [http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/deployment/backends.html?highlight=backends#automatic-mode-of-external-backend|http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/deployment/backends.html?highlight=backends#automatic-mode-of-external-backend]

exalate-issue-sync · 2023-05-22T19:22:42Z

Ruslan Dautkhanov commented: Got it. Thanks for the link. Now I remember why we can’t use automatic mode. One thing is we only allow non-preemptable YARN resource queues only for service accounts, and not for regular users. H2O cluster doesn’t like YARN preemption… Also, we normally run multi-tenant H2O cluster (multiple SW users connect to the same H2O backend cluster). It would have been much easier if SW supported dynamic allocation one day, and perhaps H2O would survive loosing some of its nodes/ yarn containers from yarn preemption.

exalate-issue-sync · 2023-05-22T19:22:44Z

Jakub Hava commented: Depends on [https://0xdata.atlassian.net/browse/PUBDEV-7096|https://0xdata.atlassian.net/browse/PUBDEV-7096|smart-link]

DinukaH2O · 2023-05-23T12:30:54Z

JIRA Issue Migration Info

Jira Issue: SW-1722
Assignee: Jakub Hava
Reporter: Jakub Hava
State: Resolved
Fix Version: 3.28.0.1-1
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

#1646

hasithjp · 2023-05-29T15:48:10Z

JIRA Issue Migration Info Cont'd

Jira Issue Created Date: 2019-11-19T15:41:32.118-0800

DinukaH2O assigned jakubhava May 23, 2023

DinukaH2O closed this as completed May 23, 2023

DinukaH2O added the fixVersion/3.28.0.1-1 label May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure external backend (rest api) is stopped in automatic mode if the spark app is killed ( avoid zombie clusters) #5374

Ensure external backend (rest api) is stopped in automatic mode if the spark app is killed ( avoid zombie clusters) #5374

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

DinukaH2O commented May 23, 2023

hasithjp commented May 29, 2023

Ensure external backend (rest api) is stopped in automatic mode if the spark app is killed ( avoid zombie clusters) #5374

Ensure external backend (rest api) is stopped in automatic mode if the spark app is killed ( avoid zombie clusters) #5374

Comments

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

DinukaH2O commented May 23, 2023

hasithjp commented May 29, 2023