Restarting h2o cluster makes all Spark Sessions connected to it unusable #4422

exalate-issue-sync · 2023-05-22T16:07:24Z

This is a long-standing problem .

If we have a Spark session (SW) connected to H2O (backend cluster), and then cluster restarts,
it fails that spark sessions (or multiple spark sessions ) even though they may not need
H2O / SW functionality for some downstream processing.

All users are currently forced to restart their Spark sessions, again even though they don't need
SW / H2O connectivity.

This issue always reproduces for us.

Users get
{code:java}
org.apache.thrift.transport.TTransportException – Broken Pipe
{code}
error when this happens.

It would be nice if shutting down the H2O cluster didn’t result in an error for all H2O connected users.

exalate-issue-sync · 2023-05-22T16:07:26Z

Ruslan Dautkhanov commented: It seems specific for Zeppelin.

We can’t reproduce this outside of Zeppelin.

exalate-issue-sync · 2023-05-22T16:07:28Z

Jakub Hava commented: Just for reference (even though this is opposite direction):
We worked on with Michal Kurka on solution which helps with opposite direction. H2O external cluster is in consistent state when a new spark session (h2o client) connects to it. That is in the released versions for a while already.

This is opposite direction → we want to use the same H2O client to connect to restarted H2O external cluster.

I did a simple test. When external backed is killed, the client discovers the unhealthy state of the cluster and kills the cluster on the spark session with “Exiting! External H2O cloud not healthy!!”

I think that what we are looking for here is the option
{{spark.ext.h2o.external.kill.on.unhealthy=false}}

This ensures that if external H2O cloud is stopped he spark session can continue to work. You can later start the cluster nodes and client will be able to reconnect to them

exalate-issue-sync · 2023-05-22T16:07:30Z

Jakub Hava commented: Also note: running h2oContext.stop() in external manual mode does not kill the external cluster, but h2o client and therefore also Spark driver ( H2O client can’t be restarted). If you need to stop external backend in manual mode we need to manage the cluster directly ( the point of manual mode → not managed by SW, but SW can connect to it)

exalate-issue-sync · 2023-05-22T16:07:31Z

Ruslan Dautkhanov commented: Thank you @jakub - we will give spark.ext.h2o.external.kill.on.unhealthy=false a try and let you know if this fixes things.

exalate-issue-sync · 2023-05-22T16:07:33Z

Ruslan Dautkhanov commented: Users confirm that spark.ext.h2o.external.kill.on.unhealthy=false solves the problem

Would be great to make it default as false.

Thank you.

exalate-issue-sync · 2023-05-22T16:07:35Z

Jakub Hava commented: Hi Ruslan,
great to hear this fixed the issue. I was thinking about the default value.

We can’t change the default value, but I think that disabling this check completely for manual start mode makes sense. In Manual cluster the Spark & H2O are really separate and kill of one should not affect the other. We will still produce warn logs that external backend is not healthy as we need to log this information, but the user won’t be affected.

exalate-issue-sync · 2023-05-22T16:07:37Z

Ruslan Dautkhanov commented: Awesome - that’s even better!

Thank you Kuba

exalate-issue-sync · 2023-05-22T16:07:39Z

Ruslan Dautkhanov commented: Jakub, the history hasn’t finished here for us..

When we started using spark.ext.h2o.external.kill.on.unhealthy=false, we see that if a user is connected
to a cluster, we can’t restart that cluster

I will send more details in an email

Thanks.

DinukaH2O · 2023-05-23T11:13:39Z

JIRA Issue Migration Info

Jira Issue: SW-1337
Assignee: Jakub Hava
Reporter: Ruslan Dautkhanov
State: Resolved
Fix Version: 3.26.2
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

#1303

hasithjp · 2023-05-29T14:25:25Z

JIRA Issue Migration Info Cont'd

Jira Issue Created Date: 2019-06-11T21:40:31.823-0700

exalate-issue-sync bot added h2o-node labels May 22, 2023

DinukaH2O assigned jakubhava May 23, 2023

DinukaH2O closed this as completed May 23, 2023

DinukaH2O added the fixVersion/3.26.2 label May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarting h2o cluster makes all Spark Sessions connected to it unusable #4422

Restarting h2o cluster makes all Spark Sessions connected to it unusable #4422

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

DinukaH2O commented May 23, 2023

hasithjp commented May 29, 2023

Restarting h2o cluster makes all Spark Sessions connected to it unusable #4422

Restarting h2o cluster makes all Spark Sessions connected to it unusable #4422

Comments

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

DinukaH2O commented May 23, 2023

hasithjp commented May 29, 2023