Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ERROR] Executor without H2O instance discovered, killing the cloud! #32

Closed
Dom-nik opened this issue May 16, 2016 · 21 comments
Closed

Comments

@Dom-nik
Copy link

Dom-nik commented May 16, 2016

I'm getting the error mentioned in the title. No clue why.

The command I use to run Sparkling Water is: spark-submit --class water.SparklingWaterDriver --master yarn-client --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 /opt/sparkling-water/sparkling-water-1.5.14/assembly/build/libs/*.jar

Full error stacktrace looks like this:

16/05/16 09:24:15 ERROR LiveListenerBus: Listener anon1 threw an exception
java.lang.IllegalArgumentException: Executor without H2O instance discovered, killing the cloud!
at org.apache.spark.h2o.H2OContext$$anon$1.onExecutorAdded(H2OContext.scala:180)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:58)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
16/05/16 09:24:16 INFO BlockManagerMasterEndpoint: Registering block manager bda1node05.na.pg.com:17644 with 1060.0 MB RAM, BlockManagerId(4, bda1node05.na.pg.com, 17644)
Exception in thread "main" java.lang.RuntimeException: Cloud size under 3
at water.H2O.waitForCloudSize(H2O.java:1547)
at org.apache.spark.h2o.H2OContext.start(H2OContext.scala:223)
at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:337)
at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:363)
at water.SparklingWaterDriver$.main(SparklingWaterDriver.scala:38)
at water.SparklingWaterDriver.main(SparklingWaterDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

@jakubhava
Copy link
Contributor

Hi Dom-nik,
it's caused by spark launching new executor after the initialisation of H2OContext.

In Sparkling Water we try to discover all spark executors at the start of H2OContext and start h2o on them. But if spark for some reason launches new executor, it does not have h2o instance running which then leads to error during computation.

So what we do in this case is to throw exception on spark topology changes and kill the cloud.
You can turn this listener off by setting spark.ext.h2o.topology.change.listener.enabled to false, but it it's still won't prevent from the problem I described earlier ( it's also explained here #4)

We are working on a new sparkling-water architecture which should solve these issues.

@Dom-nik
Copy link
Author

Dom-nik commented May 16, 2016

Hi MadMan0708,

Thanks for prompt reply!
How can I set this parameter? Can it be treated as a valid workaround?

From what you are saying I get the impression that running H2O on Hadoop is a better idea than treating Sparkling Water as a H2O backend. Am I right?

@jakubhava
Copy link
Contributor

jakubhava commented May 16, 2016

Hi Dom-nik,
you can set the property like this: spark-submit --class water.SparklingWaterDriver --conf "spark.ext.h2o.topology.change.listener.enabled=false" --master yarn-client --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 /opt/sparkling-water/sparkling-water-1.5.14/assembly/build/libs/*.jar

Regarding H2O & Sparkling Water.

It depends on your needs. If you don't use spark to do for example some feature engineering of data munging, then there is probably no reason for you to use Sparkling Water.

However if you already use spark in your existing application, then I recommend Sparkling Water. In most cases it starts fine ( we have problems on clusters with 60 and more nodes and we are working on a new solution to this problem as fast as we can).

Also there is a tuning guide, which should help you to set up Spark in order it works better with Sparkling-Water https://github.com/h2oai/sparkling-water/blob/master/DEVEL.md#SparklingWaterTuning )

Can you please try to start the H2OContext using your spark-submit one or two more times ? Does this happens all over again ?
I see that you use 3 executors so it should work perfectly ( sorry, I forgot to write this in the first reply - the problem explained there occurs generally for higher number of executors). It could be also caused by another problem - can you share your yarn and h2o logs ?

@Dom-nik
Copy link
Author

Dom-nik commented May 16, 2016

Thank you for this answer, but it seems that the command you provided is exactly the same as the one I posted at the beginning :]

I'm still willing to test out your solution, but I thought I'd also give you some context: what I actually wanted to achieve was to have a single H2O instance that would serve as a backend for Python- and R-based H2O calls, something like a server for many users. I'm not sure if that's the way H2O was meant to be used. Is it?

I was also considering using JupyterHub as main GUI for end users and give them access to H2O via Python and R, instead of Flow, as it seems, there is no multi-user operation inbuilt into it.
Is there? Can you have any authentication on Flow side?

@jakubhava
Copy link
Contributor

Hi,
the command is different, please have a look on it one more time :)

H2O is perfect for what you want to achieve. You can start h2o cloud of arbitrary size and then access it using our R/Python/Java/Rest api. You can make one call via R API and another via Python api.

I'm not main flow developer, let me ask our team regarding the flow question.

@Dom-nik
Copy link
Author

Dom-nik commented May 17, 2016

Hi,
I've tried the updated command and it still gives me the:
Exception in thread "main" java.lang.RuntimeException: Cloud size under 3 after some time and I'm not able to view the GUI at 54321 port. Do you have any other suggestions? :]

@jakubhava
Copy link
Contributor

Hi, thanks for trying!
In order to further debug your problem it would be great to see your YARN and H2O logs. Can you share them here ? I'll have a look on them and than we can decide were to go next.

@Dom-nik
Copy link
Author

Dom-nik commented May 17, 2016

Ok, there you go. These are the logs for one run, they present the most common type of error I'm getting:
sparkling-water.yarn.log.zip
sparkling-water.zip

Here is also a log from a different run that gave a different error, it occured only once:
sparkling-water.log.2.zip

@Dom-nik
Copy link
Author

Dom-nik commented May 23, 2016

Hello MadMan0708, did you have a moment to take a look at the logs?

@jakubhava
Copy link
Contributor

Hi Dom-nik,
really sorry for the delay. I'm trying to finish few changes in the last few weeks which takes most of my time.

I'll check the logs today and let you know

Thanks for patience,
Kuba

@Dom-nik
Copy link
Author

Dom-nik commented May 23, 2016

Thanks! Looking forward to any news!

@jakubhava
Copy link
Contributor

jakubhava commented May 23, 2016

Ho Dom-inik

so after looking at the logs this is what I get:

The H2O cluster of size 3 is successfully created ( from h2o executors logs in the yarn log) but it seems like the h2o client in the driver is not able to communicate with the rest of the cluster.

There are 2 things you can do:

  1. check that your firewall allows h2o communication. It can be the case that your firewall rules are very strict and allow only spark communication
  2. H2O set the `-network

-network [, ...]: Specify a range (where applicable) of IP addresses (where represents the first interface, represents the second, and so on). The IP address discovery code binds to the first interface that matches one of the networks in the comma-separated list. For example, 10.1.2.0/24 supports 256 possibilities.

and sparkling water provides configuration property spark.ext.h2o.network.mask where you can set the desired configuration. This value is then set as value for -network when starting h2o nodes inside spark.

You can set this property for example as spark configuration property when starting sparkling-shell in normal way as ./bin/sparkling-shell --conf spark.ext.h2o.network.mask=10.1.2.0/24

Let me please know if that helps!

Kuba

@kawaa
Copy link

kawaa commented May 30, 2016

@madman0708. I run into the same issue and I tried all tips provided in this conversation but without the success. I tried it on two different CDH clusters.

I even get this error even when I use a single-node Cloudera Quickstart VM (CDH 5.5.0, Spark 1.5.0, Sparkling Water 1.5.14). Could you @madman0708 confirm Sparkling Water 1.5.14 works fine with CDH 5.5.X or Spark 1.5.X? Alternatively can you provide the versions that should integrate smoothly?

@Dom-nik
Copy link
Author

Dom-nik commented Jun 22, 2016

We have some valuable debugging results. It seems that H2O doesn't support multihoming, which is quite a typical thing, as it is not supported by Hadoop in general.

Context: we have our Cloudera Hadoop cluster deployed on specialized hardware, called Big Data Aplliance (BDA), an Oracle product. Multihoming is used in Big Data Appliance, as cluster nodes communicate with each other via InfiniBand using their internal network, using INTERNAL IP addresses and they communicate with the rest of P&G intranet using EXTERNAL IP addresses.

CDH (and Hadoop in general) doesn’t support multihoming (cluster nodes belonging to multiple networks). Multihoming is supported for some appliances (BDA being one of them), but our edge nodes are not within the BDA, which is a non-standard setup. So when you add non-BDA nodes you are out of the supported/recommended configuration from both the Oracle side and Cloudera side. It is not sub-optimal setup. It is just that Hadoop and related technologies (unfortunately) have not really been designed with multi-homed networking in mind.

This causes connectivity issues, as (according to a Cloudera expert):

Historically we have had issues running pyspark from non-BDA nodes because of similar issues. We have also had issues running spark shell that we have worked around by specifying IP addresses instead of hostnames.

This hypothesis was confirmed by running Sparkling Water directly on one of the cluster nodes:
We tried to run Sparkling Water on a BDA node and it seems to work fine. We used sparkling-water-1.5.6 and steps from http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.5/6/index.html (the RUN ON HADOOP tab). An example command like:
$ spark-submit --class water.SparklingWaterDriver --master yarn-client --num-executors 8 --driver-memory 8g --executor-memory 4g --executor-cores 1 assembly/build/libs/*.jar
worked fine.

Do you have any comments to add? Do you plan to dig deeper into a case like this or is it totally outside your scope?

@mmalohlava
Copy link
Member

Hi Dominik,

is it possible to share privately logs from Spark run?

My point is that if Spark is communicating (can see executors and send/receive messages), then in H2O we should follow the same communication paths. If not, we need to help H2O to share the same IP/port.
My theory is that the driver H2O (living in the same JVM as Spark driver)

You can try to specify spark.ext.h2o.network.mask to force the H2O driver (living in Spark driver) to select the right IP on right interface...

@Dom-nik
Copy link
Author

Dom-nik commented Jun 27, 2016

Hi Michal,

Thanks for your reply. It seems that it got cut in the middle :]

You can find new batch of YARN logs here: sparkling.yarn.logs.27062016.tar.gz
(Sparkling water failed with the following error after I tried to connect to Flow on 54321 port):
Exception in thread "main" java.lang.RuntimeException: Cloud size under 3

I tried running the applciation with spark.ext.h2o.network.mask

spark-submit \
--class water.SparklingWaterDriver \
--master yarn-client \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 /opt/sparkling-water/sparkling-water-1.5.14/assembly/build/libs/*.jar \
--conf "spark.scheduler.minRegisteredResourcesRatio=1" "spark.ext.h2o.topology.change.listener.enabled=false" "spark.ext.h2o.network.mask=192.168.7.0/255"

but it behaved exactly the same.
The YARN logs from this run are here:
sparkling.yarn.logs.27062016.2.tar.gz

I'm not 100% sure if the mask was specified correctly.
EDIT: I know it was not, but I've tried with 192.168.7.0/24 too and it failed.

@Dom-nik
Copy link
Author

Dom-nik commented Aug 18, 2016

Just to close the case with some relevant info. There was some debugging that we've done with H2O and a custom patch was developed (released with Sparkling Water 1.5.16). It enables to use a new parameter spark.ext.h2o.node.network.mask to specify a mask for internal IPs.

Here's a way to run the tool so that it works:

spark-submit \
--class water.SparklingWaterDriver \
--master yarn-client \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--conf "spark.scheduler.minRegisteredResourcesRatio=1" \
--conf "spark.ext.h2o.topology.change.listener.enabled=false" \
--conf "spark.ext.h2o.node.network.mask=<IP_NUMBER>/<MASK>" \
--conf "spark.ext.h2o.fail.on.unsupported.spark.param=false" \
/opt/sparkling-water/sparkling-water-1.5.16/assembly/build/libs/*.jar

e.g. "spark.ext.h2o.node.network.mask=10.0.00/24"

@Dom-nik Dom-nik closed this as completed Aug 18, 2016
@jakubhava
Copy link
Contributor

Hi @Dom-nik,

thank you again from writing the outcome!

@ibobak
Copy link

ibobak commented Oct 24, 2017

No matter of applying all these settings, I am receiving the same error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.h2o.JavaH2OContext.getOrCreate.
: java.lang.RuntimeException: Cloud size under 11
	at water.H2O.waitForCloudSize(H2O.java:1689)
	at org.apache.spark.h2o.backends.internal.InternalH2OBackend.init(InternalH2OBackend.scala:117)
	at org.apache.spark.h2o.H2OContext.init(H2OContext.scala:121)
	at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:355)
	at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:371)
	at org.apache.spark.h2o.H2OContext.getOrCreate(H2OContext.scala)
	at org.apache.spark.h2o.JavaH2OContext.getOrCreate(JavaH2OContext.java:228)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)

Here are my configs in the Notebook's kernel:

"PYSPARK_SUBMIT_ARGS":" --py-files /usr/local/share/jupyter/kernels/sparkling-water-2.2.2/py/build/dist/h2o_pysparkling_2.2-2.2.2.zip 
  --conf \"spark.scheduler.minRegisteredResourcesRatio=1\" 
  --conf \"spark.ext.h2o.topology.change.listener.enabled=false\" 
  --conf \"spark.ext.h2o.fail.on.unsupported.spark.param=false\" 
  --conf \"spark.ext.h2o.node.network.mask=10.5.33.0/24\" 
  --jars /usr/local/share/jupyter/kernels/aws-lib/hadoop-aws-2.7.3.jar,/usr/local/share/jupyter/kernels/aws-lib/aws-java-sdk-1.7.4.jar   
  --driver-memory 8G 
  --executor-memory 24G   
  --conf \"spark.dynamicAllocation.enabled=false\" 
  --num-executors 10 
  --executor-cores 2 
  --master spark://10.5.33.36:7077 pyspark-shell"    

I am using Spark 2.2.0 with sparkling water 2.2.2.

I the Spark app I clearly see that it started one driver and 10 executors, and (as you see) the amount of executors is explicitly configured. No matter of that, this annoying error simply doesn't allow to run the H2O.

I'll be very grateful for any ideas of how to run it.

@idoshichor
Copy link

Hello @jakubhava ,

Does sparkling water already support spark.dynamicAllocation.enabled=true.

We want to use it on Spark, but scaling the cluster up and down is very important for us.

Thanks.

@jakubhava
Copy link
Contributor

Hi @idoshichor,
There are two backends in SParkling Water - internal and external. In external backend, you can use the spark.dynamicAllocation.enabled=true option and Spark can kill or join new executors without affecting H2O.

In the internal backend, this option is not allowed and we think that it won't be available there because of several technical reasons. If you need to use the dynamic allocation, I would advise looking at the external backend solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants