New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPs are not equal" error when starting H2OContext with Spark Context in Zeppelin #291
Comments
HI @lordlinus, thanks for the report! Quick question, are your nodes on which you're trying to start Sparkling Water connected to multiple network interfaces ? |
@jakubhava only one network interface for the datanodes. |
@lordlinus and for the master node ? This can also happen in cases the h2o running in spark picks up a wrong IP address ( based on spark address ) however that address is not in the same network as the rest of the workers. Can you please verify if the driver node has single/multiple network interfaces ? Thanks! |
@jakubhava master node has a single network interface too... i have also specified the below options but i get the same error
|
oki, thanks for sharing. In that case we would need to have full yarn and h2o logs for this run to be able to see what might have gone wrong. Yarn logs can be obtained using the following shell command: |
Also it would help a lot if you can try the same code on sparkling water for spark 2.1 or spark 2.0 and see if the error occurs there as well. Those are 2 majors versions we support at the moment ( very critical fixes still can go to sparkling water for spark 1.6 as well) |
@lordlinus Just pinging about the progress - have you tried sparkling water 2.1.x ? Do you need any help with obtaining the logs ? |
Closing this issue for now. @lordlinus feel free to re-open with the logs attached please. |
Hi @jakubhava, I'm hitting the same exception on YARN and I think I have an understanding of why it's happening. But I'd like to confirm it. Is there a private channel through which I can send you logs and all related information? |
Hi @raveeram , you can either share it here or send all the relevant information to work mail which is jakub[at]h2o[dot]ai Please also share what is your sparkling water version, spark version and deployment mode. Thanks! Kuba |
@jakubhava I am also hitting this issue when running my sparkling water job on yarn. I tried several different versions from spark 1.6.2+ and water 1.6.8+. Do you know how to fix this problem? Thanks. |
@Du-Li in short, it is our assertion referencing locality of job scheduling in Spark. Do you have same problem with Spark 2.x? |
@mmalohlava yes. I tried spark 1.6.2, 2.1.0 and 2.2.0 and they all had this unequal IPs error on SpreadRDD. |
Do you have elasticity enabled for the Yarn queue you are submitting into? We have a tuning guide for Yarn here: https://github.com/h2oai/sparkling-water/blob/master/doc/configuration/internal_backend_tuning.rst The most interesting options for you could be:
|
@mmalohlava I tried these two options in the spark-submit CLI (--conf) but still got the same error. The same scala script (without creating SparkContext) worked with sparkling-shell though. The spark and water version I tried was 2.2. Do you have further suggestion? Thanks. |
Ho @Du-Li so this happened also with the dynamic allocation disabled ? Could you please share a bit more information about your environment ? Are you running on yarn ( cluster|client mode ) ? Are your physical nodes connected to multiple network interfaces ? Thanks! Kuba |
Hi Team, I am also facing same issue, when I hit the line here Versions used:
Thanks. |
Hi @Mageswaran1989 , this seems to be different error. Can you please tell us more about your environment ( YARN client, YARN master, standalone, local ) ? Thanks, Kuba |
@jakubhava On Vanilla Spark, with standalone cluster mode, it throws above error. When I first encountered the IP mismatch, I observed old executors went down and new executors came up for my zeppelien application from the cluster. In Zeppelien dependencies, I have also added "no.priv.garshol.duke:duke:1.2", as directed by the H2O logs. I was following the example code from example folder and stops at https://github.com/h2oai/sparkling-water/blob/master/examples/pipelines/hamOrSpam.script.scala#L41 PF the logs. |
Thanks for the info @Mageswaran1989. One additional question - can you please share the shell code you use to start your Sparkling Water example ? That would help a lot, especially we could see how you Sparkling Water artefacts to the cluster. In particular, using sparkling water using the The solution for now would be to use the artefacts from the sparkling water distribution downloadable from our web page - https://www.h2o.ai/download/. It contains the JAR with all correct dependencies and also with correct ( shadowed ) jetty version |
@jakubhava I added the maven coordinates of sparkling ml and examples along with "no.priv.garshol.duke:duke:1.2". can over ridding the existing jetty version on Zeppelien solve this problem? if so please provide the maven coordinates of the right one. |
oki, this is the problem. You need to use sparkling water assembly jar available in sparkling water distribution downloadable from our page. The artefacts on maven currently contain a wrong jetty version which prevents using sparkling water via maven artefacts |
@jakubhava it works with assembly jar! Thank you very much for quick replies! |
Thanks for update! We already have PR in progress which should unlock using sparkling water via maven - #352. You can track this PR to see the progress. |
Hi, Similar to #37 trying to start H2o context via zeppelin using latest sparkling water assembly (sparkling-water-assembly_2.10-1.6.11-all.jar) got the below error.
spark version ->
1.6.2
sparkling water version ->
1.6.11
deployment type ( spark MASTER variable - local, yarn ) ->
Spark yarn client mode ( zeppelin )
data on which this exception happened ->
today
reproducible code ->
import org.apache.spark.h2o._ val h2oContext = H2OContext.getOrCreate(sc) import h2oContext._ import h2oContext.implicits._
Appreciate any help
The text was updated successfully, but these errors were encountered: