New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure external backend (rest api) is stopped in automatic mode if the spark app is killed ( avoid zombie clusters) #5374
Comments
Ruslan Dautkhanov commented: Do I understand correctly, that this Jira implies we would we need to restart the backend cluster every time a SW client reconnects? We have multiple SW connections to the same backend/external h2o cluster; each Spark application with its own lifecycle. If I understand this new change that’s coming up, this will break some of workflows how we work with h2o / SW. |
Jakub Hava commented: No, it means when spark is killed, external h2o backend is stopped as well to avoid running zombie h2o clusters. It affects only automatic cluster start, not manual. Users of automatic mode want to have the spark & h2o apps tight together and want to ensure that if one part is killed ( like kill -9) the second is stopped as well |
Ruslan Dautkhanov commented: Thanks Kuba. I understand that now Is the “automatic cluster start” a new feature that’s coming up in 3.28? That seems interesting. Where I can read more on it? |
Jakub Hava commented: No problem, nope, it has been there almost from the begging of the external backend. We are just trying to ensure feature parity with original solution via REST api. More info can be found here [http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/deployment/backends.html?highlight=backends#automatic-mode-of-external-backend|http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/deployment/backends.html?highlight=backends#automatic-mode-of-external-backend] |
Ruslan Dautkhanov commented: Got it. Thanks for the link. Now I remember why we can’t use automatic mode. One thing is we only allow non-preemptable YARN resource queues only for service accounts, and not for regular users. H2O cluster doesn’t like YARN preemption… Also, we normally run multi-tenant H2O cluster (multiple SW users connect to the same H2O backend cluster). It would have been much easier if SW supported dynamic allocation one day, and perhaps H2O would survive loosing some of its nodes/ yarn containers from yarn preemption. |
Jakub Hava commented: Depends on [https://0xdata.atlassian.net/browse/PUBDEV-7096|https://0xdata.atlassian.net/browse/PUBDEV-7096|smart-link] |
JIRA Issue Migration Info Jira Issue: SW-1722 Linked PRs from JIRA |
JIRA Issue Migration Info Cont'd Jira Issue Created Date: 2019-11-19T15:41:32.118-0800 |
Basically re-implement the behaviour used by watchdog client but via rest.
The text was updated successfully, but these errors were encountered: