[ERROR] Executor without H2O instance discovered, killing the cloud! #32

Dom-nik · 2016-05-16T13:29:04Z

I'm getting the error mentioned in the title. No clue why.

The command I use to run Sparkling Water is: spark-submit --class water.SparklingWaterDriver --master yarn-client --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 /opt/sparkling-water/sparkling-water-1.5.14/assembly/build/libs/*.jar

Full error stacktrace looks like this:

16/05/16 09:24:15 ERROR LiveListenerBus: Listener anon1 threw an exception
java.lang.IllegalArgumentException: Executor without H2O instance discovered, killing the cloud!
at org.apache.spark.h2o.H2OContext$$anon$1.onExecutorAdded(H2OContext.scala:180)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:58)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
16/05/16 09:24:16 INFO BlockManagerMasterEndpoint: Registering block manager bda1node05.na.pg.com:17644 with 1060.0 MB RAM, BlockManagerId(4, bda1node05.na.pg.com, 17644)
Exception in thread "main" java.lang.RuntimeException: Cloud size under 3
at water.H2O.waitForCloudSize(H2O.java:1547)
at org.apache.spark.h2o.H2OContext.start(H2OContext.scala:223)
at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:337)
at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:363)
at water.SparklingWaterDriver$.main(SparklingWaterDriver.scala:38)
at water.SparklingWaterDriver.main(SparklingWaterDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The text was updated successfully, but these errors were encountered:

jakubhava · 2016-05-16T14:05:44Z

Hi Dom-nik,
it's caused by spark launching new executor after the initialisation of H2OContext.

In Sparkling Water we try to discover all spark executors at the start of H2OContext and start h2o on them. But if spark for some reason launches new executor, it does not have h2o instance running which then leads to error during computation.

So what we do in this case is to throw exception on spark topology changes and kill the cloud.
You can turn this listener off by setting spark.ext.h2o.topology.change.listener.enabled to false, but it it's still won't prevent from the problem I described earlier ( it's also explained here #4)

We are working on a new sparkling-water architecture which should solve these issues.

Dom-nik · 2016-05-16T15:34:39Z

Hi MadMan0708,

Thanks for prompt reply!
How can I set this parameter? Can it be treated as a valid workaround?

From what you are saying I get the impression that running H2O on Hadoop is a better idea than treating Sparkling Water as a H2O backend. Am I right?

jakubhava · 2016-05-16T16:10:28Z

Hi Dom-nik,
you can set the property like this: spark-submit --class water.SparklingWaterDriver --conf "spark.ext.h2o.topology.change.listener.enabled=false" --master yarn-client --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 /opt/sparkling-water/sparkling-water-1.5.14/assembly/build/libs/*.jar

Regarding H2O & Sparkling Water.

It depends on your needs. If you don't use spark to do for example some feature engineering of data munging, then there is probably no reason for you to use Sparkling Water.

However if you already use spark in your existing application, then I recommend Sparkling Water. In most cases it starts fine ( we have problems on clusters with 60 and more nodes and we are working on a new solution to this problem as fast as we can).

Also there is a tuning guide, which should help you to set up Spark in order it works better with Sparkling-Water https://github.com/h2oai/sparkling-water/blob/master/DEVEL.md#SparklingWaterTuning )

Can you please try to start the H2OContext using your spark-submit one or two more times ? Does this happens all over again ?
I see that you use 3 executors so it should work perfectly ( sorry, I forgot to write this in the first reply - the problem explained there occurs generally for higher number of executors). It could be also caused by another problem - can you share your yarn and h2o logs ?

Dom-nik · 2016-05-16T17:33:09Z

Thank you for this answer, but it seems that the command you provided is exactly the same as the one I posted at the beginning :]

I'm still willing to test out your solution, but I thought I'd also give you some context: what I actually wanted to achieve was to have a single H2O instance that would serve as a backend for Python- and R-based H2O calls, something like a server for many users. I'm not sure if that's the way H2O was meant to be used. Is it?

I was also considering using JupyterHub as main GUI for end users and give them access to H2O via Python and R, instead of Flow, as it seems, there is no multi-user operation inbuilt into it.
Is there? Can you have any authentication on Flow side?

jakubhava · 2016-05-16T21:14:03Z

Hi,
the command is different, please have a look on it one more time :)

H2O is perfect for what you want to achieve. You can start h2o cloud of arbitrary size and then access it using our R/Python/Java/Rest api. You can make one call via R API and another via Python api.

I'm not main flow developer, let me ask our team regarding the flow question.

Dom-nik · 2016-05-17T08:00:15Z

Hi,
I've tried the updated command and it still gives me the:
Exception in thread "main" java.lang.RuntimeException: Cloud size under 3 after some time and I'm not able to view the GUI at 54321 port. Do you have any other suggestions? :]

jakubhava · 2016-05-17T08:02:38Z

Hi, thanks for trying!
In order to further debug your problem it would be great to see your YARN and H2O logs. Can you share them here ? I'll have a look on them and than we can decide were to go next.

Dom-nik · 2016-05-17T08:05:59Z

Ok, there you go. These are the logs for one run, they present the most common type of error I'm getting:
sparkling-water.yarn.log.zip
sparkling-water.zip

Here is also a log from a different run that gave a different error, it occured only once:
sparkling-water.log.2.zip

Dom-nik · 2016-05-23T09:06:27Z

Hello MadMan0708, did you have a moment to take a look at the logs?

jakubhava · 2016-05-23T10:02:17Z

Hi Dom-nik,
really sorry for the delay. I'm trying to finish few changes in the last few weeks which takes most of my time.

I'll check the logs today and let you know

Thanks for patience,
Kuba

Dom-nik · 2016-05-23T16:04:10Z

Thanks! Looking forward to any news!

jakubhava · 2016-05-23T21:34:34Z

Ho Dom-inik

so after looking at the logs this is what I get:

The H2O cluster of size 3 is successfully created ( from h2o executors logs in the yarn log) but it seems like the h2o client in the driver is not able to communicate with the rest of the cluster.

There are 2 things you can do:

check that your firewall allows h2o communication. It can be the case that your firewall rules are very strict and allow only spark communication
H2O set the `-network

-network [, ...]: Specify a range (where applicable) of IP addresses (where represents the first interface, represents the second, and so on). The IP address discovery code binds to the first interface that matches one of the networks in the comma-separated list. For example, 10.1.2.0/24 supports 256 possibilities.

and sparkling water provides configuration property spark.ext.h2o.network.mask where you can set the desired configuration. This value is then set as value for -network when starting h2o nodes inside spark.

You can set this property for example as spark configuration property when starting sparkling-shell in normal way as ./bin/sparkling-shell --conf spark.ext.h2o.network.mask=10.1.2.0/24

Let me please know if that helps!

Kuba

kawaa · 2016-05-30T16:12:05Z

@madman0708. I run into the same issue and I tried all tips provided in this conversation but without the success. I tried it on two different CDH clusters.

I even get this error even when I use a single-node Cloudera Quickstart VM (CDH 5.5.0, Spark 1.5.0, Sparkling Water 1.5.14). Could you @madman0708 confirm Sparkling Water 1.5.14 works fine with CDH 5.5.X or Spark 1.5.X? Alternatively can you provide the versions that should integrate smoothly?

Dom-nik · 2016-06-22T10:04:58Z

We have some valuable debugging results. It seems that H2O doesn't support multihoming, which is quite a typical thing, as it is not supported by Hadoop in general.

Context: we have our Cloudera Hadoop cluster deployed on specialized hardware, called Big Data Aplliance (BDA), an Oracle product. Multihoming is used in Big Data Appliance, as cluster nodes communicate with each other via InfiniBand using their internal network, using INTERNAL IP addresses and they communicate with the rest of P&G intranet using EXTERNAL IP addresses.

CDH (and Hadoop in general) doesn’t support multihoming (cluster nodes belonging to multiple networks). Multihoming is supported for some appliances (BDA being one of them), but our edge nodes are not within the BDA, which is a non-standard setup. So when you add non-BDA nodes you are out of the supported/recommended configuration from both the Oracle side and Cloudera side. It is not sub-optimal setup. It is just that Hadoop and related technologies (unfortunately) have not really been designed with multi-homed networking in mind.

This causes connectivity issues, as (according to a Cloudera expert):

Historically we have had issues running pyspark from non-BDA nodes because of similar issues. We have also had issues running spark shell that we have worked around by specifying IP addresses instead of hostnames.

This hypothesis was confirmed by running Sparkling Water directly on one of the cluster nodes:
We tried to run Sparkling Water on a BDA node and it seems to work fine. We used sparkling-water-1.5.6 and steps from http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.5/6/index.html (the RUN ON HADOOP tab). An example command like:
$ spark-submit --class water.SparklingWaterDriver --master yarn-client --num-executors 8 --driver-memory 8g --executor-memory 4g --executor-cores 1 assembly/build/libs/*.jar
worked fine.

Do you have any comments to add? Do you plan to dig deeper into a case like this or is it totally outside your scope?

mmalohlava · 2016-06-24T17:56:29Z

Hi Dominik,

is it possible to share privately logs from Spark run?

My point is that if Spark is communicating (can see executors and send/receive messages), then in H2O we should follow the same communication paths. If not, we need to help H2O to share the same IP/port.
My theory is that the driver H2O (living in the same JVM as Spark driver)

You can try to specify spark.ext.h2o.network.mask to force the H2O driver (living in Spark driver) to select the right IP on right interface...

Dom-nik · 2016-06-27T11:14:22Z

Hi Michal,

Thanks for your reply. It seems that it got cut in the middle :]

You can find new batch of YARN logs here: sparkling.yarn.logs.27062016.tar.gz
(Sparkling water failed with the following error after I tried to connect to Flow on 54321 port):
Exception in thread "main" java.lang.RuntimeException: Cloud size under 3

I tried running the applciation with spark.ext.h2o.network.mask

spark-submit \
--class water.SparklingWaterDriver \
--master yarn-client \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 /opt/sparkling-water/sparkling-water-1.5.14/assembly/build/libs/*.jar \
--conf "spark.scheduler.minRegisteredResourcesRatio=1" "spark.ext.h2o.topology.change.listener.enabled=false" "spark.ext.h2o.network.mask=192.168.7.0/255"

but it behaved exactly the same.
The YARN logs from this run are here:
sparkling.yarn.logs.27062016.2.tar.gz

I'm not 100% sure if the mask was specified correctly.
EDIT: I know it was not, but I've tried with 192.168.7.0/24 too and it failed.

Dom-nik · 2016-08-18T14:31:26Z

Just to close the case with some relevant info. There was some debugging that we've done with H2O and a custom patch was developed (released with Sparkling Water 1.5.16). It enables to use a new parameter spark.ext.h2o.node.network.mask to specify a mask for internal IPs.

Here's a way to run the tool so that it works:

spark-submit \
--class water.SparklingWaterDriver \
--master yarn-client \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--conf "spark.scheduler.minRegisteredResourcesRatio=1" \
--conf "spark.ext.h2o.topology.change.listener.enabled=false" \
--conf "spark.ext.h2o.node.network.mask=<IP_NUMBER>/<MASK>" \
--conf "spark.ext.h2o.fail.on.unsupported.spark.param=false" \
/opt/sparkling-water/sparkling-water-1.5.16/assembly/build/libs/*.jar

e.g. "spark.ext.h2o.node.network.mask=10.0.00/24"

jakubhava · 2016-08-29T09:07:16Z

Hi @Dom-nik,

thank you again from writing the outcome!

ibobak · 2017-10-24T09:07:49Z

No matter of applying all these settings, I am receiving the same error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.h2o.JavaH2OContext.getOrCreate.
: java.lang.RuntimeException: Cloud size under 11
	at water.H2O.waitForCloudSize(H2O.java:1689)
	at org.apache.spark.h2o.backends.internal.InternalH2OBackend.init(InternalH2OBackend.scala:117)
	at org.apache.spark.h2o.H2OContext.init(H2OContext.scala:121)
	at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:355)
	at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:371)
	at org.apache.spark.h2o.H2OContext.getOrCreate(H2OContext.scala)
	at org.apache.spark.h2o.JavaH2OContext.getOrCreate(JavaH2OContext.java:228)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)

Here are my configs in the Notebook's kernel:

"PYSPARK_SUBMIT_ARGS":" --py-files /usr/local/share/jupyter/kernels/sparkling-water-2.2.2/py/build/dist/h2o_pysparkling_2.2-2.2.2.zip 
  --conf \"spark.scheduler.minRegisteredResourcesRatio=1\" 
  --conf \"spark.ext.h2o.topology.change.listener.enabled=false\" 
  --conf \"spark.ext.h2o.fail.on.unsupported.spark.param=false\" 
  --conf \"spark.ext.h2o.node.network.mask=10.5.33.0/24\" 
  --jars /usr/local/share/jupyter/kernels/aws-lib/hadoop-aws-2.7.3.jar,/usr/local/share/jupyter/kernels/aws-lib/aws-java-sdk-1.7.4.jar   
  --driver-memory 8G 
  --executor-memory 24G   
  --conf \"spark.dynamicAllocation.enabled=false\" 
  --num-executors 10 
  --executor-cores 2 
  --master spark://10.5.33.36:7077 pyspark-shell"

I am using Spark 2.2.0 with sparkling water 2.2.2.

I the Spark app I clearly see that it started one driver and 10 executors, and (as you see) the amount of executors is explicitly configured. No matter of that, this annoying error simply doesn't allow to run the H2O.

I'll be very grateful for any ideas of how to run it.

idoshichor · 2018-02-07T08:53:27Z

Hello @jakubhava ,

Does sparkling water already support spark.dynamicAllocation.enabled=true.

We want to use it on Spark, but scaling the cluster up and down is very important for us.

Thanks.

jakubhava · 2018-02-07T09:37:44Z

Hi @idoshichor,
There are two backends in SParkling Water - internal and external. In external backend, you can use the spark.dynamicAllocation.enabled=true option and Spark can kill or join new executors without affecting H2O.

In the internal backend, this option is not allowed and we think that it won't be available there because of several technical reasons. If you need to use the dynamic allocation, I would advise looking at the external backend solution.

Dom-nik closed this as completed Aug 18, 2016

ibobak mentioned this issue Oct 26, 2017

java.lang.RuntimeException: Cloud size under 11 - again (on Spark 2.2, Sparkling water 2.2.2) #433

Closed

exalate-issue-sync bot mentioned this issue May 22, 2023

This H2O node couldn't find the file(s) to parse #3906

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ERROR] Executor without H2O instance discovered, killing the cloud! #32

[ERROR] Executor without H2O instance discovered, killing the cloud! #32

Dom-nik commented May 16, 2016

jakubhava commented May 16, 2016

Dom-nik commented May 16, 2016

jakubhava commented May 16, 2016 •

edited

Loading

Dom-nik commented May 16, 2016 •

edited

Loading

jakubhava commented May 16, 2016

Dom-nik commented May 17, 2016

jakubhava commented May 17, 2016

Dom-nik commented May 17, 2016 •

edited

Loading

Dom-nik commented May 23, 2016

jakubhava commented May 23, 2016

Dom-nik commented May 23, 2016

jakubhava commented May 23, 2016 •

edited

Loading

kawaa commented May 30, 2016

Dom-nik commented Jun 22, 2016

mmalohlava commented Jun 24, 2016

Dom-nik commented Jun 27, 2016 •

edited

Loading

Dom-nik commented Aug 18, 2016

jakubhava commented Aug 29, 2016

ibobak commented Oct 24, 2017

idoshichor commented Feb 7, 2018

jakubhava commented Feb 7, 2018

[ERROR] Executor without H2O instance discovered, killing the cloud! #32

[ERROR] Executor without H2O instance discovered, killing the cloud! #32

Comments

Dom-nik commented May 16, 2016

jakubhava commented May 16, 2016

Dom-nik commented May 16, 2016

jakubhava commented May 16, 2016 • edited Loading

Dom-nik commented May 16, 2016 • edited Loading

jakubhava commented May 16, 2016

Dom-nik commented May 17, 2016

jakubhava commented May 17, 2016

Dom-nik commented May 17, 2016 • edited Loading

Dom-nik commented May 23, 2016

jakubhava commented May 23, 2016

Dom-nik commented May 23, 2016

jakubhava commented May 23, 2016 • edited Loading

kawaa commented May 30, 2016

Dom-nik commented Jun 22, 2016

mmalohlava commented Jun 24, 2016

Dom-nik commented Jun 27, 2016 • edited Loading

Dom-nik commented Aug 18, 2016

jakubhava commented Aug 29, 2016

ibobak commented Oct 24, 2017

idoshichor commented Feb 7, 2018

jakubhava commented Feb 7, 2018

jakubhava commented May 16, 2016 •

edited

Loading

Dom-nik commented May 16, 2016 •

edited

Loading

Dom-nik commented May 17, 2016 •

edited

Loading

jakubhava commented May 23, 2016 •

edited

Loading

Dom-nik commented Jun 27, 2016 •

edited

Loading