Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPs are not equal" error when starting H2OContext with Spark Context in Zeppelin #291

Closed
lordlinus opened this issue Jun 2, 2017 · 24 comments

Comments

@lordlinus
Copy link

lordlinus commented Jun 2, 2017

Hi, Similar to #37 trying to start H2o context via zeppelin using latest sparkling water assembly (sparkling-water-assembly_2.10-1.6.11-all.jar) got the below error.

spark version -> 1.6.2
sparkling water version -> 1.6.11
deployment type ( spark MASTER variable - local, yarn ) -> Spark yarn client mode ( zeppelin )
data on which this exception happened -> today
reproducible code ->
import org.apache.spark.h2o._ val h2oContext = H2OContext.getOrCreate(sc) import h2oContext._ import h2oContext.implicits._

Appreciate any help

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 58, datanode-045.domain.com): java.lang.AssertionError: assertion failed: SpreadRDD failure - IPs are not equal: (1,datanode-060.domain.com,-1) != (2, datanode-045.domain.com)
	at scala.Predef$.assert(Predef.scala:179)
	at org.apache.spark.h2o.backends.internal.InternalBackendUtils$$anonfun$7.apply(InternalBackendUtils.scala:103)
	at org.apache.spark.h2o.backends.internal.InternalBackendUtils$$anonfun$7.apply(InternalBackendUtils.scala:102)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
	at scala.collection.AbstractIterator.to(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:934)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:934)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1882)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1882)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1433)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1421)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1420)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1420)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
	at scala.Option.foreach(Option.scala:236)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:801)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1642)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1590)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:622)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1856)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1869)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1882)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1953)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:934)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:933)
	at org.apache.spark.h2o.backends.internal.InternalBackendUtils$class.startH2O(InternalBackendUtils.scala:165)
	at org.apache.spark.h2o.backends.internal.InternalBackendUtils$.startH2O(InternalBackendUtils.scala:263)
	at org.apache.spark.h2o.backends.internal.InternalH2OBackend.init(InternalH2OBackend.scala:103)
	at org.apache.spark.h2o.H2OContext.init(H2OContext.scala:112)
	at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:294)
	at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:316)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:32)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:37)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:41)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:43)
	at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:45)
	at $iwC$$iwC$$iwC$$iwC.<init>(<console>:47)
	at $iwC$$iwC$$iwC.<init>(<console>:49)
	at $iwC$$iwC.<init>(<console>:51)
	at $iwC.<init>(<console>:53)
	at <init>(<console>:55)
	at .<init>(<console>:59)
	at .<clinit>(<console>)
	at .<init>(<console>:7)
	at .<clinit>(<console>)
	at $print(<console>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38)
	at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:972)
	at org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:1198)
	at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:1144)
	at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:1137)
	at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:95)
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:490)
	at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
	at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.AssertionError: assertion failed: SpreadRDD failure - IPs are not equal: (1,datanode-060.domain.com,-1) != (2, datanode-045.domain.com)
	at scala.Predef$.assert(Predef.scala:179)
	at org.apache.spark.h2o.backends.internal.InternalBackendUtils$$anonfun$7.apply(InternalBackendUtils.scala:103)
	at org.apache.spark.h2o.backends.internal.InternalBackendUtils$$anonfun$7.apply(InternalBackendUtils.scala:102)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala
.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
	at scala.collection.AbstractIterator.to(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:934)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:934)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1882)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1882)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
	... 3 more
@jakubhava
Copy link
Contributor

HI @lordlinus, thanks for the report! Quick question, are your nodes on which you're trying to start Sparkling Water connected to multiple network interfaces ?

@lordlinus
Copy link
Author

lordlinus commented Jun 4, 2017

@jakubhava only one network interface for the datanodes.

@jakubhava
Copy link
Contributor

@lordlinus and for the master node ?

This can also happen in cases the h2o running in spark picks up a wrong IP address ( based on spark address ) however that address is not in the same network as the rest of the workers.

Can you please verify if the driver node has single/multiple network interfaces ? Thanks!

@lordlinus
Copy link
Author

@jakubhava master node has a single network interface too... i have also specified the below options but i get the same error

spark.ext.h2o.client.network.mask="10.0.0.0/8"
spark.ext.h2o.node.network.mask="10.0.0.0/8"

@jakubhava
Copy link
Contributor

jakubhava commented Jun 13, 2017

oki, thanks for sharing. In that case we would need to have full yarn and h2o logs for this run to be able to see what might have gone wrong.

Yarn logs can be obtained using the following shell command: yarn logs -applicationId <application ID>

@jakubhava
Copy link
Contributor

Also it would help a lot if you can try the same code on sparkling water for spark 2.1 or spark 2.0 and see if the error occurs there as well. Those are 2 majors versions we support at the moment ( very critical fixes still can go to sparkling water for spark 1.6 as well)

@jakubhava
Copy link
Contributor

@lordlinus Just pinging about the progress - have you tried sparkling water 2.1.x ? Do you need any help with obtaining the logs ?

@jakubhava
Copy link
Contributor

Closing this issue for now. @lordlinus feel free to re-open with the logs attached please.

@raveeram
Copy link

Hi @jakubhava, I'm hitting the same exception on YARN and I think I have an understanding of why it's happening. But I'd like to confirm it. Is there a private channel through which I can send you logs and all related information?

@jakubhava
Copy link
Contributor

Hi @raveeram , you can either share it here or send all the relevant information to work mail which is jakub[at]h2o[dot]ai

Please also share what is your sparkling water version, spark version and deployment mode.

Thanks! Kuba

@Du-Li
Copy link

Du-Li commented Oct 23, 2017

@jakubhava I am also hitting this issue when running my sparkling water job on yarn. I tried several different versions from spark 1.6.2+ and water 1.6.8+. Do you know how to fix this problem? Thanks.

@mmalohlava
Copy link
Member

@Du-Li in short, it is our assertion referencing locality of job scheduling in Spark. Do you have same problem with Spark 2.x?

@Du-Li
Copy link

Du-Li commented Oct 23, 2017

@mmalohlava yes. I tried spark 1.6.2, 2.1.0 and 2.2.0 and they all had this unequal IPs error on SpreadRDD.

@mmalohlava
Copy link
Member

Do you have elasticity enabled for the Yarn queue you are submitting into? We have a tuning guide for Yarn here: https://github.com/h2oai/sparkling-water/blob/master/doc/configuration/internal_backend_tuning.rst

The most interesting options for you could be:

  • spark.scheduler.minRegisteredResourcesRatio=1 - make sure that all SPark resources are available
  • spark.yarn.max.executor.failures=1 - fail early if SPark executor dies

@Du-Li
Copy link

Du-Li commented Oct 23, 2017

@mmalohlava I tried these two options in the spark-submit CLI (--conf) but still got the same error. The same scala script (without creating SparkContext) worked with sparkling-shell though. The spark and water version I tried was 2.2.

Do you have further suggestion? Thanks.

@jakubhava
Copy link
Contributor

Ho @Du-Li so this happened also with the dynamic allocation disabled ?

Could you please share a bit more information about your environment ? Are you running on yarn ( cluster|client mode ) ? Are your physical nodes connected to multiple network interfaces ?

Thanks! Kuba

@Mageswaran1989
Copy link

Hi Team,

I am also facing same issue, when I hit the line here

Versions used:
H20: http://search.maven.org/#artifactdetails%7Cai.h2o%7Csparkling-water-ml_2.11%7C2.2.2%7Cjar
Spark: Spark 2.2
Zeppelien: 0.7.3

17/10/31 17:58:30 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 6.2 GB)
17/10/31 17:58:30 INFO NativeLibrary: Loaded XGBoost library from lib/linux_64/libxgboost4j.so (/tmp/libxgboost4j4983128618196471307.so)
17/10/31 17:58:31 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[H2O Launcher thread,5,main]
java.lang.NoClassDefFoundError: org/eclipse/jetty/io/EofException
	at water.init.AbstractBuildVersion.getLatestH2OVersion(AbstractBuildVersion.java:84)
	at water.H2O.printAndLogVersion(H2O.java:1430)
	at water.H2O.main(H2O.java:1892)
	at water.H2OStarter.start(H2OStarter.java:21)
	at water.H2OStarter.start(H2OStarter.java:46)
	at org.apache.spark.h2o.backends.internal.InternalBackendUtils$$anonfun$7$$anon$1.run(InternalBackendUtils.scala:138)
Caused by: java.lang.ClassNotFoundException: org.eclipse.jetty.io.EofException
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 6 more
17/10/31 17:58:31 INFO DiskBlockManager: Shutdown hook called
17/10/31 17:58:31 INFO ShutdownHookManager: Shutdown hook called
17/10/31 17:58:31 INFO ShutdownHookManager: Deleting directory /tmp/spark-5c64550c-e5cc-4213-9292-fbee323b6cbd/executor-4c0ad4d4-d21f-44ea-b06a-c239eb812959/spark-9ff393e4-9ca6-47e1-8a42-013a6a5028e2

Thanks.

@jakubhava
Copy link
Contributor

Hi @Mageswaran1989 , this seems to be different error.

Can you please tell us more about your environment ( YARN client, YARN master, standalone, local ) ?
Are you using Vanilla Spark or CDH/HDP/MAPR ?
Can you please share the code you use to start sparkling water app/shell ?

Thanks, Kuba

@Mageswaran1989
Copy link

@jakubhava
Works with Local Mode!

On Vanilla Spark, with standalone cluster mode, it throws above error.

When I first encountered the IP mismatch, I observed old executors went down and new executors came up for my zeppelien application from the cluster.

In Zeppelien dependencies, I have also added "no.priv.garshol.duke:duke:1.2", as directed by the H2O logs.

I was following the example code from example folder and stops at https://github.com/h2oai/sparkling-water/blob/master/examples/pipelines/hamOrSpam.script.scala#L41

PF the logs.
h2o_sparkling_error_log.txt

@jakubhava
Copy link
Contributor

jakubhava commented Oct 31, 2017

Thanks for the info @Mageswaran1989.

One additional question - can you please share the shell code you use to start your Sparkling Water example ? That would help a lot, especially we could see how you Sparkling Water artefacts to the cluster. In particular, using sparkling water using the --packages or from maven package is currently broken ( cc: @mmalohlava just for reference ) because we use different jetty then spark. The error Caused by: java.lang.ClassNotFoundException: org.eclipse.jetty.io.EofException is probably caused by Spark adding a new jetty version on the class-path.

The solution for now would be to use the artefacts from the sparkling water distribution downloadable from our web page - https://www.h2o.ai/download/. It contains the JAR with all correct dependencies and also with correct ( shadowed ) jetty version

@Mageswaran1989
Copy link

Mageswaran1989 commented Oct 31, 2017

@jakubhava
I used Zeppelien for this try out.

I added the maven coordinates of sparkling ml and examples along with "no.priv.garshol.duke:duke:1.2".

can over ridding the existing jetty version on Zeppelien solve this problem? if so please provide the maven coordinates of the right one.

@jakubhava
Copy link
Contributor

oki, this is the problem. You need to use sparkling water assembly jar available in sparkling water distribution downloadable from our page. The artefacts on maven currently contain a wrong jetty version which prevents using sparkling water via maven artefacts

@Mageswaran1989
Copy link

@jakubhava it works with assembly jar! Thank you very much for quick replies!

@jakubhava
Copy link
Contributor

Thanks for update! We already have PR in progress which should unlock using sparkling water via maven - #352. You can track this PR to see the progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants