-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ERROR] Executor without H2O instance discovered, killing the cloud! #32
Comments
Hi Dom-nik, In Sparkling Water we try to discover all spark executors at the start of H2OContext and start h2o on them. But if spark for some reason launches new executor, it does not have h2o instance running which then leads to error during computation. So what we do in this case is to throw exception on spark topology changes and kill the cloud. We are working on a new sparkling-water architecture which should solve these issues. |
Hi MadMan0708, Thanks for prompt reply! From what you are saying I get the impression that running H2O on Hadoop is a better idea than treating Sparkling Water as a H2O backend. Am I right? |
Hi Dom-nik, Regarding H2O & Sparkling Water. It depends on your needs. If you don't use spark to do for example some feature engineering of data munging, then there is probably no reason for you to use Sparkling Water. However if you already use spark in your existing application, then I recommend Sparkling Water. In most cases it starts fine ( we have problems on clusters with 60 and more nodes and we are working on a new solution to this problem as fast as we can). Also there is a tuning guide, which should help you to set up Spark in order it works better with Sparkling-Water https://github.com/h2oai/sparkling-water/blob/master/DEVEL.md#SparklingWaterTuning ) Can you please try to start the H2OContext using your spark-submit one or two more times ? Does this happens all over again ? |
Thank you for this answer, but it seems that the command you provided is exactly the same as the one I posted at the beginning :] I'm still willing to test out your solution, but I thought I'd also give you some context: what I actually wanted to achieve was to have a single H2O instance that would serve as a backend for Python- and R-based H2O calls, something like a server for many users. I'm not sure if that's the way H2O was meant to be used. Is it? I was also considering using JupyterHub as main GUI for end users and give them access to H2O via Python and R, instead of Flow, as it seems, there is no multi-user operation inbuilt into it. |
Hi, H2O is perfect for what you want to achieve. You can start h2o cloud of arbitrary size and then access it using our R/Python/Java/Rest api. You can make one call via R API and another via Python api. I'm not main flow developer, let me ask our team regarding the flow question. |
Hi, |
Hi, thanks for trying! |
Ok, there you go. These are the logs for one run, they present the most common type of error I'm getting: Here is also a log from a different run that gave a different error, it occured only once: |
Hello MadMan0708, did you have a moment to take a look at the logs? |
Hi Dom-nik, I'll check the logs today and let you know Thanks for patience, |
Thanks! Looking forward to any news! |
Ho Dom-inik so after looking at the logs this is what I get: The H2O cluster of size 3 is successfully created ( from h2o executors logs in the yarn log) but it seems like the h2o client in the driver is not able to communicate with the rest of the cluster. There are 2 things you can do:
and sparkling water provides configuration property You can set this property for example as spark configuration property when starting sparkling-shell in normal way as Let me please know if that helps! Kuba |
@madman0708. I run into the same issue and I tried all tips provided in this conversation but without the success. I tried it on two different CDH clusters. I even get this error even when I use a single-node Cloudera Quickstart VM (CDH 5.5.0, Spark 1.5.0, Sparkling Water 1.5.14). Could you @madman0708 confirm Sparkling Water 1.5.14 works fine with CDH 5.5.X or Spark 1.5.X? Alternatively can you provide the versions that should integrate smoothly? |
We have some valuable debugging results. It seems that H2O doesn't support multihoming, which is quite a typical thing, as it is not supported by Hadoop in general. Context: we have our Cloudera Hadoop cluster deployed on specialized hardware, called Big Data Aplliance (BDA), an Oracle product. Multihoming is used in Big Data Appliance, as cluster nodes communicate with each other via InfiniBand using their internal network, using INTERNAL IP addresses and they communicate with the rest of P&G intranet using EXTERNAL IP addresses. CDH (and Hadoop in general) doesn’t support multihoming (cluster nodes belonging to multiple networks). Multihoming is supported for some appliances (BDA being one of them), but our edge nodes are not within the BDA, which is a non-standard setup. So when you add non-BDA nodes you are out of the supported/recommended configuration from both the Oracle side and Cloudera side. It is not sub-optimal setup. It is just that Hadoop and related technologies (unfortunately) have not really been designed with multi-homed networking in mind. This causes connectivity issues, as (according to a Cloudera expert):
This hypothesis was confirmed by running Sparkling Water directly on one of the cluster nodes: Do you have any comments to add? Do you plan to dig deeper into a case like this or is it totally outside your scope? |
Hi Dominik, is it possible to share privately logs from Spark run? My point is that if Spark is communicating (can see executors and send/receive messages), then in H2O we should follow the same communication paths. If not, we need to help H2O to share the same IP/port. You can try to specify |
Hi Michal, Thanks for your reply. It seems that it got cut in the middle :] You can find new batch of YARN logs here: sparkling.yarn.logs.27062016.tar.gz I tried running the applciation with
but it behaved exactly the same. I'm not 100% sure if the mask was specified correctly. |
Just to close the case with some relevant info. There was some debugging that we've done with H2O and a custom patch was developed (released with Sparkling Water 1.5.16). It enables to use a new parameter Here's a way to run the tool so that it works:
e.g. |
Hi @Dom-nik, thank you again from writing the outcome! |
No matter of applying all these settings, I am receiving the same error:
Here are my configs in the Notebook's kernel:
I am using Spark 2.2.0 with sparkling water 2.2.2. I the Spark app I clearly see that it started one driver and 10 executors, and (as you see) the amount of executors is explicitly configured. No matter of that, this annoying error simply doesn't allow to run the H2O. I'll be very grateful for any ideas of how to run it. |
Hello @jakubhava , Does sparkling water already support spark.dynamicAllocation.enabled=true. We want to use it on Spark, but scaling the cluster up and down is very important for us. Thanks. |
Hi @idoshichor, In the internal backend, this option is not allowed and we think that it won't be available there because of several technical reasons. If you need to use the dynamic allocation, I would advise looking at the external backend solution. |
I'm getting the error mentioned in the title. No clue why.
The command I use to run Sparkling Water is:
spark-submit --class water.SparklingWaterDriver --master yarn-client --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 /opt/sparkling-water/sparkling-water-1.5.14/assembly/build/libs/*.jar
Full error stacktrace looks like this:
The text was updated successfully, but these errors were encountered: