IPs are not equal" error when starting H2OContext with Spark Context in Zeppelin #291

lordlinus · 2017-06-02T04:01:05Z

Hi, Similar to #37 trying to start H2o context via zeppelin using latest sparkling water assembly (sparkling-water-assembly_2.10-1.6.11-all.jar) got the below error.

spark version -> 1.6.2
sparkling water version -> 1.6.11
deployment type ( spark MASTER variable - local, yarn ) -> Spark yarn client mode ( zeppelin )
data on which this exception happened -> today
reproducible code ->
import org.apache.spark.h2o._ val h2oContext = H2OContext.getOrCreate(sc) import h2oContext._ import h2oContext.implicits._

Appreciate any help

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 58, datanode-045.domain.com): java.lang.AssertionError: assertion failed: SpreadRDD failure - IPs are not equal: (1,datanode-060.domain.com,-1) != (2, datanode-045.domain.com)
	at scala.Predef$.assert(Predef.scala:179)
	at org.apache.spark.h2o.backends.internal.InternalBackendUtils$$anonfun$7.apply(InternalBackendUtils.scala:103)
	at org.apache.spark.h2o.backends.internal.InternalBackendUtils$$anonfun$7.apply(InternalBackendUtils.scala:102)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
	at scala.collection.AbstractIterator.to(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:934)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:934)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1882)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1882)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1433)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1421)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1420)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1420)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
	at scala.Option.foreach(Option.scala:236)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:801)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1642)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1590)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:622)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1856)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1869)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1882)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1953)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:934)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:933)
	at org.apache.spark.h2o.backends.internal.InternalBackendUtils$class.startH2O(InternalBackendUtils.scala:165)
	at org.apache.spark.h2o.backends.internal.InternalBackendUtils$.startH2O(InternalBackendUtils.scala:263)
	at org.apache.spark.h2o.backends.internal.InternalH2OBackend.init(InternalH2OBackend.scala:103)
	at org.apache.spark.h2o.H2OContext.init(H2OContext.scala:112)
	at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:294)
	at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:316)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:32)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:37)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:41)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:43)
	at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:45)
	at $iwC$$iwC$$iwC$$iwC.<init>(<console>:47)
	at $iwC$$iwC$$iwC.<init>(<console>:49)
	at $iwC$$iwC.<init>(<console>:51)
	at $iwC.<init>(<console>:53)
	at <init>(<console>:55)
	at .<init>(<console>:59)
	at .<clinit>(<console>)
	at .<init>(<console>:7)
	at .<clinit>(<console>)
	at $print(<console>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38)
	at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:972)
	at org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:1198)
	at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:1144)
	at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:1137)
	at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:95)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:490)
	at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
	at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.AssertionError: assertion failed: SpreadRDD failure - IPs are not equal: (1,datanode-060.domain.com,-1) != (2, datanode-045.domain.com)
	at scala.Predef$.assert(Predef.scala:179)
	at org.apache.spark.h2o.backends.internal.InternalBackendUtils$$anonfun$7.apply(InternalBackendUtils.scala:103)
	at org.apache.spark.h2o.backends.internal.InternalBackendUtils$$anonfun$7.apply(InternalBackendUtils.scala:102)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala
.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
	at scala.collection.AbstractIterator.to(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:934)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:934)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1882)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1882)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
	... 3 more

The text was updated successfully, but these errors were encountered:

jakubhava · 2017-06-02T07:42:23Z

HI @lordlinus, thanks for the report! Quick question, are your nodes on which you're trying to start Sparkling Water connected to multiple network interfaces ?

lordlinus · 2017-06-04T22:30:05Z

@jakubhava only one network interface for the datanodes.

jakubhava · 2017-06-12T15:48:52Z

@lordlinus and for the master node ?

This can also happen in cases the h2o running in spark picks up a wrong IP address ( based on spark address ) however that address is not in the same network as the rest of the workers.

Can you please verify if the driver node has single/multiple network interfaces ? Thanks!

lordlinus · 2017-06-13T03:55:44Z

@jakubhava master node has a single network interface too... i have also specified the below options but i get the same error

spark.ext.h2o.client.network.mask="10.0.0.0/8"
spark.ext.h2o.node.network.mask="10.0.0.0/8"

jakubhava · 2017-06-13T07:48:17Z

oki, thanks for sharing. In that case we would need to have full yarn and h2o logs for this run to be able to see what might have gone wrong.

Yarn logs can be obtained using the following shell command: yarn logs -applicationId <application ID>

jakubhava · 2017-06-13T08:28:14Z

Also it would help a lot if you can try the same code on sparkling water for spark 2.1 or spark 2.0 and see if the error occurs there as well. Those are 2 majors versions we support at the moment ( very critical fixes still can go to sparkling water for spark 1.6 as well)

jakubhava · 2017-06-20T09:19:05Z

@lordlinus Just pinging about the progress - have you tried sparkling water 2.1.x ? Do you need any help with obtaining the logs ?

jakubhava · 2017-06-20T21:20:48Z

Closing this issue for now. @lordlinus feel free to re-open with the logs attached please.

raveeram · 2017-10-18T22:39:09Z

Hi @jakubhava, I'm hitting the same exception on YARN and I think I have an understanding of why it's happening. But I'd like to confirm it. Is there a private channel through which I can send you logs and all related information?

jakubhava · 2017-10-19T07:16:45Z

Hi @raveeram , you can either share it here or send all the relevant information to work mail which is jakub[at]h2o[dot]ai

Please also share what is your sparkling water version, spark version and deployment mode.

Thanks! Kuba

Du-Li · 2017-10-23T18:25:03Z

@jakubhava I am also hitting this issue when running my sparkling water job on yarn. I tried several different versions from spark 1.6.2+ and water 1.6.8+. Do you know how to fix this problem? Thanks.

mmalohlava · 2017-10-23T18:31:49Z

@Du-Li in short, it is our assertion referencing locality of job scheduling in Spark. Do you have same problem with Spark 2.x?

Du-Li · 2017-10-23T18:37:03Z

@mmalohlava yes. I tried spark 1.6.2, 2.1.0 and 2.2.0 and they all had this unequal IPs error on SpreadRDD.

mmalohlava · 2017-10-23T18:57:28Z

Do you have elasticity enabled for the Yarn queue you are submitting into? We have a tuning guide for Yarn here: https://github.com/h2oai/sparkling-water/blob/master/doc/configuration/internal_backend_tuning.rst

The most interesting options for you could be:

spark.scheduler.minRegisteredResourcesRatio=1 - make sure that all SPark resources are available
spark.yarn.max.executor.failures=1 - fail early if SPark executor dies

Du-Li · 2017-10-23T20:13:53Z

@mmalohlava I tried these two options in the spark-submit CLI (--conf) but still got the same error. The same scala script (without creating SparkContext) worked with sparkling-shell though. The spark and water version I tried was 2.2.

Do you have further suggestion? Thanks.

jakubhava · 2017-10-27T10:48:37Z

Ho @Du-Li so this happened also with the dynamic allocation disabled ?

Could you please share a bit more information about your environment ? Are you running on yarn ( cluster|client mode ) ? Are your physical nodes connected to multiple network interfaces ?

Thanks! Kuba

Mageswaran1989 · 2017-10-31T12:36:59Z

Hi Team,

I am also facing same issue, when I hit the line here

Versions used:
H20: http://search.maven.org/#artifactdetails%7Cai.h2o%7Csparkling-water-ml_2.11%7C2.2.2%7Cjar
Spark: Spark 2.2
Zeppelien: 0.7.3

17/10/31 17:58:30 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 6.2 GB)
17/10/31 17:58:30 INFO NativeLibrary: Loaded XGBoost library from lib/linux_64/libxgboost4j.so (/tmp/libxgboost4j4983128618196471307.so)
17/10/31 17:58:31 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[H2O Launcher thread,5,main]
java.lang.NoClassDefFoundError: org/eclipse/jetty/io/EofException
	at water.init.AbstractBuildVersion.getLatestH2OVersion(AbstractBuildVersion.java:84)
	at water.H2O.printAndLogVersion(H2O.java:1430)
	at water.H2O.main(H2O.java:1892)
	at water.H2OStarter.start(H2OStarter.java:21)
	at water.H2OStarter.start(H2OStarter.java:46)
	at org.apache.spark.h2o.backends.internal.InternalBackendUtils$$anonfun$7$$anon$1.run(InternalBackendUtils.scala:138)
Caused by: java.lang.ClassNotFoundException: org.eclipse.jetty.io.EofException
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 6 more
17/10/31 17:58:31 INFO DiskBlockManager: Shutdown hook called
17/10/31 17:58:31 INFO ShutdownHookManager: Shutdown hook called
17/10/31 17:58:31 INFO ShutdownHookManager: Deleting directory /tmp/spark-5c64550c-e5cc-4213-9292-fbee323b6cbd/executor-4c0ad4d4-d21f-44ea-b06a-c239eb812959/spark-9ff393e4-9ca6-47e1-8a42-013a6a5028e2

Thanks.

jakubhava · 2017-10-31T12:40:08Z

Hi @Mageswaran1989 , this seems to be different error.

Can you please tell us more about your environment ( YARN client, YARN master, standalone, local ) ?
Are you using Vanilla Spark or CDH/HDP/MAPR ?
Can you please share the code you use to start sparkling water app/shell ?

Thanks, Kuba

Mageswaran1989 · 2017-10-31T13:36:59Z

@jakubhava
Works with Local Mode!

On Vanilla Spark, with standalone cluster mode, it throws above error.

When I first encountered the IP mismatch, I observed old executors went down and new executors came up for my zeppelien application from the cluster.

In Zeppelien dependencies, I have also added "no.priv.garshol.duke:duke:1.2", as directed by the H2O logs.

I was following the example code from example folder and stops at https://github.com/h2oai/sparkling-water/blob/master/examples/pipelines/hamOrSpam.script.scala#L41

PF the logs.
h2o_sparkling_error_log.txt

jakubhava · 2017-10-31T13:46:46Z

Thanks for the info @Mageswaran1989.

One additional question - can you please share the shell code you use to start your Sparkling Water example ? That would help a lot, especially we could see how you Sparkling Water artefacts to the cluster. In particular, using sparkling water using the --packages or from maven package is currently broken ( cc: @mmalohlava just for reference ) because we use different jetty then spark. The error Caused by: java.lang.ClassNotFoundException: org.eclipse.jetty.io.EofException is probably caused by Spark adding a new jetty version on the class-path.

The solution for now would be to use the artefacts from the sparkling water distribution downloadable from our web page - https://www.h2o.ai/download/. It contains the JAR with all correct dependencies and also with correct ( shadowed ) jetty version

Mageswaran1989 · 2017-10-31T17:54:33Z

@jakubhava
I used Zeppelien for this try out.

I added the maven coordinates of sparkling ml and examples along with "no.priv.garshol.duke:duke:1.2".

can over ridding the existing jetty version on Zeppelien solve this problem? if so please provide the maven coordinates of the right one.

jakubhava · 2017-10-31T17:57:03Z

oki, this is the problem. You need to use sparkling water assembly jar available in sparkling water distribution downloadable from our page. The artefacts on maven currently contain a wrong jetty version which prevents using sparkling water via maven artefacts

Mageswaran1989 · 2017-11-01T08:24:06Z

@jakubhava it works with assembly jar! Thank you very much for quick replies!

jakubhava · 2017-11-01T08:25:50Z

Thanks for update! We already have PR in progress which should unlock using sparkling water via maven - #352. You can track this PR to see the progress.

jakubhava closed this as completed Jun 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPs are not equal" error when starting H2OContext with Spark Context in Zeppelin #291

IPs are not equal" error when starting H2OContext with Spark Context in Zeppelin #291

lordlinus commented Jun 2, 2017 •

edited

jakubhava commented Jun 2, 2017

lordlinus commented Jun 4, 2017 •

edited

jakubhava commented Jun 12, 2017

lordlinus commented Jun 13, 2017

jakubhava commented Jun 13, 2017 •

edited

jakubhava commented Jun 13, 2017

jakubhava commented Jun 20, 2017

jakubhava commented Jun 20, 2017

raveeram commented Oct 18, 2017

jakubhava commented Oct 19, 2017

Du-Li commented Oct 23, 2017

mmalohlava commented Oct 23, 2017

Du-Li commented Oct 23, 2017

mmalohlava commented Oct 23, 2017

Du-Li commented Oct 23, 2017

jakubhava commented Oct 27, 2017

Mageswaran1989 commented Oct 31, 2017

jakubhava commented Oct 31, 2017

Mageswaran1989 commented Oct 31, 2017

jakubhava commented Oct 31, 2017 •

edited

Mageswaran1989 commented Oct 31, 2017 •

edited

jakubhava commented Oct 31, 2017

Mageswaran1989 commented Nov 1, 2017

jakubhava commented Nov 1, 2017

IPs are not equal" error when starting H2OContext with Spark Context in Zeppelin #291

IPs are not equal" error when starting H2OContext with Spark Context in Zeppelin #291

Comments

lordlinus commented Jun 2, 2017 • edited

jakubhava commented Jun 2, 2017

lordlinus commented Jun 4, 2017 • edited

jakubhava commented Jun 12, 2017

lordlinus commented Jun 13, 2017

jakubhava commented Jun 13, 2017 • edited

jakubhava commented Jun 13, 2017

jakubhava commented Jun 20, 2017

jakubhava commented Jun 20, 2017

raveeram commented Oct 18, 2017

jakubhava commented Oct 19, 2017

Du-Li commented Oct 23, 2017

mmalohlava commented Oct 23, 2017

Du-Li commented Oct 23, 2017

mmalohlava commented Oct 23, 2017

Du-Li commented Oct 23, 2017

jakubhava commented Oct 27, 2017

Mageswaran1989 commented Oct 31, 2017

jakubhava commented Oct 31, 2017

Mageswaran1989 commented Oct 31, 2017

jakubhava commented Oct 31, 2017 • edited

Mageswaran1989 commented Oct 31, 2017 • edited

jakubhava commented Oct 31, 2017

Mageswaran1989 commented Nov 1, 2017

jakubhava commented Nov 1, 2017

lordlinus commented Jun 2, 2017 •

edited

lordlinus commented Jun 4, 2017 •

edited

jakubhava commented Jun 13, 2017 •

edited

jakubhava commented Oct 31, 2017 •

edited

Mageswaran1989 commented Oct 31, 2017 •

edited