Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting h2o cluster makes all Spark Sessions connected to it unusable #4422

Closed
exalate-issue-sync bot opened this issue May 22, 2023 · 10 comments
Closed
Assignees

Comments

@exalate-issue-sync
Copy link

This is a long-standing problem .

If we have a Spark session (SW) connected to H2O (backend cluster), and then cluster restarts,
it fails that spark sessions (or multiple spark sessions ) even though they may not need
H2O / SW functionality for some downstream processing.

All users are currently forced to restart their Spark sessions, again even though they don't need
SW / H2O connectivity.

This issue always reproduces for us.

Users get
{code:java}
org.apache.thrift.transport.TTransportException – Broken Pipe
{code}
error when this happens.

It would be nice if shutting down the H2O cluster didn’t result in an error for all H2O connected users.

@exalate-issue-sync
Copy link
Author

Ruslan Dautkhanov commented: It seems specific for Zeppelin.

We can’t reproduce this outside of Zeppelin.

@exalate-issue-sync
Copy link
Author

Jakub Hava commented: Just for reference (even though this is opposite direction):
We worked on with Michal Kurka on solution which helps with opposite direction. H2O external cluster is in consistent state when a new spark session (h2o client) connects to it. That is in the released versions for a while already.

This is opposite direction → we want to use the same H2O client to connect to restarted H2O external cluster.

I did a simple test. When external backed is killed, the client discovers the unhealthy state of the cluster and kills the cluster on the spark session with “Exiting! External H2O cloud not healthy!!”

I think that what we are looking for here is the option
{{spark.ext.h2o.external.kill.on.unhealthy=false}}

This ensures that if external H2O cloud is stopped he spark session can continue to work. You can later start the cluster nodes and client will be able to reconnect to them

@exalate-issue-sync
Copy link
Author

Jakub Hava commented: Also note: running h2oContext.stop() in external manual mode does not kill the external cluster, but h2o client and therefore also Spark driver ( H2O client can’t be restarted). If you need to stop external backend in manual mode we need to manage the cluster directly ( the point of manual mode → not managed by SW, but SW can connect to it)

@exalate-issue-sync
Copy link
Author

Ruslan Dautkhanov commented: Thank you @jakub - we will give spark.ext.h2o.external.kill.on.unhealthy=false a try and let you know if this fixes things.

@exalate-issue-sync
Copy link
Author

Ruslan Dautkhanov commented: Users confirm that spark.ext.h2o.external.kill.on.unhealthy=false solves the problem

Would be great to make it default as false.

Thank you.

@exalate-issue-sync
Copy link
Author

Jakub Hava commented: Hi Ruslan,
great to hear this fixed the issue. I was thinking about the default value.

We can’t change the default value, but I think that disabling this check completely for manual start mode makes sense. In Manual cluster the Spark & H2O are really separate and kill of one should not affect the other. We will still produce warn logs that external backend is not healthy as we need to log this information, but the user won’t be affected.

@exalate-issue-sync
Copy link
Author

Ruslan Dautkhanov commented: Awesome - that’s even better!

Thank you Kuba

@exalate-issue-sync
Copy link
Author

Ruslan Dautkhanov commented: Jakub, the history hasn’t finished here for us..

When we started using spark.ext.h2o.external.kill.on.unhealthy=false, we see that if a user is connected
to a cluster, we can’t restart that cluster

I will send more details in an email

Thanks.

@DinukaH2O
Copy link

JIRA Issue Migration Info

Jira Issue: SW-1337
Assignee: Jakub Hava
Reporter: Ruslan Dautkhanov
State: Resolved
Fix Version: 3.26.2
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

#1303

@hasithjp
Copy link
Member

JIRA Issue Migration Info Cont'd

Jira Issue Created Date: 2019-06-11T21:40:31.823-0700

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants