Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure external backend (rest api) is stopped in automatic mode if the spark app is killed ( avoid zombie clusters) #5374

Closed
exalate-issue-sync bot opened this issue May 22, 2023 · 8 comments
Assignees

Comments

@exalate-issue-sync
Copy link

Basically re-implement the behaviour used by watchdog client but via rest.

@exalate-issue-sync
Copy link
Author

Ruslan Dautkhanov commented: Do I understand correctly, that this Jira implies we would we need to restart the backend cluster every time a SW client reconnects?

We have multiple SW connections to the same backend/external h2o cluster; each Spark application with its own lifecycle. If I understand this new change that’s coming up, this will break some of workflows how we work with h2o / SW.

@exalate-issue-sync
Copy link
Author

Jakub Hava commented: No, it means when spark is killed, external h2o backend is stopped as well to avoid running zombie h2o clusters. It affects only automatic cluster start, not manual. Users of automatic mode want to have the spark & h2o apps tight together and want to ensure that if one part is killed ( like kill -9) the second is stopped as well

@exalate-issue-sync
Copy link
Author

Ruslan Dautkhanov commented: Thanks Kuba. I understand that now

Is the “automatic cluster start” a new feature that’s coming up in 3.28? That seems interesting. Where I can read more on it?

@exalate-issue-sync
Copy link
Author

Jakub Hava commented: No problem, nope, it has been there almost from the begging of the external backend. We are just trying to ensure feature parity with original solution via REST api. More info can be found here [http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/deployment/backends.html?highlight=backends#automatic-mode-of-external-backend|http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/deployment/backends.html?highlight=backends#automatic-mode-of-external-backend]

@exalate-issue-sync
Copy link
Author

Ruslan Dautkhanov commented: Got it. Thanks for the link. Now I remember why we can’t use automatic mode. One thing is we only allow non-preemptable YARN resource queues only for service accounts, and not for regular users. H2O cluster doesn’t like YARN preemption… Also, we normally run multi-tenant H2O cluster (multiple SW users connect to the same H2O backend cluster). It would have been much easier if SW supported dynamic allocation one day, and perhaps H2O would survive loosing some of its nodes/ yarn containers from yarn preemption.

@exalate-issue-sync
Copy link
Author

Jakub Hava commented: Depends on [https://0xdata.atlassian.net/browse/PUBDEV-7096|https://0xdata.atlassian.net/browse/PUBDEV-7096|smart-link]

@DinukaH2O
Copy link

JIRA Issue Migration Info

Jira Issue: SW-1722
Assignee: Jakub Hava
Reporter: Jakub Hava
State: Resolved
Fix Version: 3.28.0.1-1
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

#1646

@hasithjp
Copy link
Member

JIRA Issue Migration Info Cont'd

Jira Issue Created Date: 2019-11-19T15:41:32.118-0800

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants