New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restarting h2o cluster makes all Spark Sessions connected to it unusable #4422
Comments
Ruslan Dautkhanov commented: It seems specific for Zeppelin. We can’t reproduce this outside of Zeppelin. |
Jakub Hava commented: Just for reference (even though this is opposite direction): This is opposite direction → we want to use the same H2O client to connect to restarted H2O external cluster. I did a simple test. When external backed is killed, the client discovers the unhealthy state of the cluster and kills the cluster on the spark session with “Exiting! External H2O cloud not healthy!!” I think that what we are looking for here is the option This ensures that if external H2O cloud is stopped he spark session can continue to work. You can later start the cluster nodes and client will be able to reconnect to them |
Jakub Hava commented: Also note: running h2oContext.stop() in external manual mode does not kill the external cluster, but h2o client and therefore also Spark driver ( H2O client can’t be restarted). If you need to stop external backend in manual mode we need to manage the cluster directly ( the point of manual mode → not managed by SW, but SW can connect to it) |
Ruslan Dautkhanov commented: Thank you @jakub - we will give spark.ext.h2o.external.kill.on.unhealthy=false a try and let you know if this fixes things. |
Ruslan Dautkhanov commented: Users confirm that spark.ext.h2o.external.kill.on.unhealthy=false solves the problem Would be great to make it default as false. Thank you. |
Jakub Hava commented: Hi Ruslan, We can’t change the default value, but I think that disabling this check completely for manual start mode makes sense. In Manual cluster the Spark & H2O are really separate and kill of one should not affect the other. We will still produce warn logs that external backend is not healthy as we need to log this information, but the user won’t be affected. |
Ruslan Dautkhanov commented: Awesome - that’s even better! Thank you Kuba |
Ruslan Dautkhanov commented: Jakub, the history hasn’t finished here for us.. When we started using spark.ext.h2o.external.kill.on.unhealthy=false, we see that if a user is connected I will send more details in an email Thanks. |
JIRA Issue Migration Info Jira Issue: SW-1337 Linked PRs from JIRA |
JIRA Issue Migration Info Cont'd Jira Issue Created Date: 2019-06-11T21:40:31.823-0700 |
This is a long-standing problem .
If we have a Spark session (SW) connected to H2O (backend cluster), and then cluster restarts,
it fails that spark sessions (or multiple spark sessions ) even though they may not need
H2O / SW functionality for some downstream processing.
All users are currently forced to restart their Spark sessions, again even though they don't need
SW / H2O connectivity.
This issue always reproduces for us.
Users get
{code:java}
org.apache.thrift.transport.TTransportException – Broken Pipe
{code}
error when this happens.
It would be nice if shutting down the H2O cluster didn’t result in an error for all H2O connected users.
The text was updated successfully, but these errors were encountered: