Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparkling Water fails to create h2oContext in simple spark project #466

Closed
renegademonkey opened this issue Nov 17, 2017 · 17 comments
Closed

Comments

@renegademonkey
Copy link

I am setting up for the first time Sparkling Water on a standalone cluster running spark 2.2. I have run Sparkling Water on such a cluster before via R (using rsparkling + sparklyr + h2o), but am having issues setting this up as a spark application (in scala).

The app is built with Maven, so I have added the latest sparkling water dependancy:

<dependency>
    <groupId>ai.h2o</groupId>
    <artifactId>sparkling-water-core_2.11</artifactId>
    <version>2.2.2</version>
</dependency>

Then the app code is as follows:

package com.me.app

import org.apache.spark.sql.{DataFrame, SparkSession}        
import org.apache.spark.h2o._
import water.Key
import water.fvec.Frame

object sparklingWaterH2o {

  def sparklingWaterH2o(): Unit = {


    val sparkSession = SparkSession
      .builder()
      .master("spark://clsuter.address:0077")
      .appName("sparklingWaterH2o")
      .config("spark.executor.memory", "32G")
      .config("spark.executor.cores", "5")
      .config("spark.cores.max", "40")
      .config("spark.ext.h2o.nthreads", "40")
      .config("spark.jars", "/path/to/fat/jar/app-1.0-SNAPSHOT-jar-with-dependencies.jar")
      .getOrCreate()

    val h2oContext = H2OContext.getOrCreate(sparkSession)

    import h2oContext._

    val df = Seq(
      (1, "2014/07/31 23:00:01"),
      (1, "2016/12/09 10:12:43")).toDF("id", "date")

    val h2oTrainFrame = h2oContext.asH2OFrame(df)

    println(s"h2oContext = ${h2oContext.toString()}")

I then compile the fat jar to send to the cluster, however the h2oContext never gets created and the SparkContext gets shut down with exit code 255. The app exiting with no error codes before an h2o context is created - the only potentially useful message is IP address not found on this machine.

I've tried this with Sparkling Water version 2.2.0 and get the same issues, also tried adding dependencies for sparkling-water-ml and sparkling-water-repl, as well as adding all the h2o core dependencies (though assuming these are not needed as they are integrated into sparkling water?).

The strange thing is that I get almost the exact same issue when trying to connect via R (using rsparkling and sparklyr, see here) - and that method worked correctly up until a few weeks ago.

See log below.


objc[39611]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home/bin/java (0x10ab4b4c0) and /Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home/jre/lib/libinstrument.dylib (0x10bb724e0). One of the two will be used. Which one is undefined.
Usinrg Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/username/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/username/.m2/repository/org/apache/logging/log4j/log4j-slf4j-impl/2.6.2/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/11/17 10:16:01 INFO SparkContext: Running Spark version 2.2.0
17/11/17 10:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/17 10:16:02 INFO SparkContext: Submitted application: sparklingWaterH2o
17/11/17 10:16:02 INFO SecurityManager: Changing view acls to: username
17/11/17 10:16:02 INFO SecurityManager: Changing modify acls to: username
17/11/17 10:16:02 INFO SecurityManager: Changing view acls groups to: 
17/11/17 10:16:02 INFO SecurityManager: Changing modify acls groups to: 
17/11/17 10:16:02 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(username); groups with view permissions: Set(); users  with modify permissions: Set(username); groups with modify permissions: Set()
17/11/17 10:16:03 INFO Utils: Successfully started service 'sparkDriver' on port 53775.
17/11/17 10:16:03 INFO SparkEnv: Registering MapOutputTracker
17/11/17 10:16:03 INFO SparkEnv: Registering BlockManagerMaster
17/11/17 10:16:03 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/11/17 10:16:03 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/11/17 10:16:03 INFO DiskBlockManager: Created local directory at /private/var/folders/gl/vgw262w9227cwqvzk595rbvjygdzh8/T/blockmgr-d29de5c5-9116-4abf-812c-04ca680781fe
17/11/17 10:16:03 INFO MemoryStore: MemoryStore started with capacity 1002.3 MB
17/11/17 10:16:03 INFO SparkEnv: Registering OutputCommitCoordinator
17/11/17 10:16:03 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/11/17 10:16:03 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.103.46:4040
17/11/17 10:16:03 INFO SparkContext: Added JAR /path/to/app/target/app-1.0-SNAPSHOT-jar-with-dependencies.jar at spark://192.168.103.46:53775/jars/app-1.0-SNAPSHOT-jar-with-dependencies.jar with timestamp 1510913763424
17/11/17 10:16:03 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://rnd-centos7-ben-31.nominet.org.uk:7077...
17/11/17 10:16:03 INFO TransportClientFactory: Successfully created connection to rnd-centos7-ben-31.nominet.org.uk/XXX.XXX.211.31:7077 after 26 ms (0 ms spent in bootstraps)
17/11/17 10:16:03 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20171117101603-0031
17/11/17 10:16:03 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20171117101603-0031/0 on worker-20171013100055-XXX.XXX.211.30-33565 (XXX.XXX.211.30:33565) with 5 cores
17/11/17 10:16:03 INFO StandaloneSchedulerBackend: Granted executor ID app-20171117101603-0031/0 on hostPort XXX.XXX.211.30:33565 with 5 cores, 32.0 GB RAM
17/11/17 10:16:03 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20171117101603-0031/1 on worker-20171013100055-XXX.XXX.211.33-34424 (XXX.XXX.211.33:34424) with 5 cores
17/11/17 10:16:03 INFO StandaloneSchedulerBackend: Granted executor ID app-20171117101603-0031/1 on hostPort XXX.XXX.211.33:34424 with 5 cores, 32.0 GB RAM
17/11/17 10:16:03 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20171117101603-0031/2 on worker-20171013100055-XXX.XXX.211.31-37513 (XXX.XXX.211.31:37513) with 5 cores
17/11/17 10:16:03 INFO StandaloneSchedulerBackend: Granted executor ID app-20171117101603-0031/2 on hostPort XXX.XXX.211.31:37513 with 5 cores, 32.0 GB RAM
17/11/17 10:16:03 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20171117101603-0031/3 on worker-20171013100054-XXX.XXX.211.32-36797 (XXX.XXX.211.32:36797) with 5 cores
17/11/17 10:16:03 INFO StandaloneSchedulerBackend: Granted executor ID app-20171117101603-0031/3 on hostPort XXX.XXX.211.32:36797 with 5 cores, 32.0 GB RAM
17/11/17 10:16:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20171117101603-0031/2 is now RUNNING
17/11/17 10:16:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20171117101603-0031/1 is now RUNNING
17/11/17 10:16:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20171117101603-0031/3 is now RUNNING
17/11/17 10:16:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20171117101603-0031/0 is now RUNNING
17/11/17 10:16:03 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 53777.
17/11/17 10:16:03 INFO NettyBlockTransferService: Server created on 192.168.103.46:53777
17/11/17 10:16:03 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/11/17 10:16:03 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.103.46, 53777, None)
17/11/17 10:16:03 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.103.46:53777 with 1002.3 MB RAM, BlockManagerId(driver, 192.168.103.46, 53777, None)
17/11/17 10:16:03 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.103.46, 53777, None)
17/11/17 10:16:03 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.103.46, 53777, None)
17/11/17 10:16:05 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (XXX.XXX.211.31:46906) with ID 2
17/11/17 10:16:05 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (XXX.XXX.211.30:54738) with ID 0
17/11/17 10:16:05 INFO BlockManagerMasterEndpoint: Registering block manager XXX.XXX.211.31:45376 with 8.4 GB RAM, BlockManagerId(2, XXX.XXX.211.31, 45376, None)
17/11/17 10:16:05 INFO BlockManagerMasterEndpoint: Registering block manager XXX.XXX.211.30:34172 with 8.4 GB RAM, BlockManagerId(0, XXX.XXX.211.30, 34172, None)
17/11/17 10:16:05 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (XXX.XXX.211.32:53076) with ID 3
17/11/17 10:16:05 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (XXX.XXX.211.33:47478) with ID 1
17/11/17 10:16:05 INFO BlockManagerMasterEndpoint: Registering block manager XXX.XXX.211.32:34360 with 8.4 GB RAM, BlockManagerId(3, XXX.XXX.211.32, 34360, None)
17/11/17 10:16:05 INFO BlockManagerMasterEndpoint: Registering block manager XXX.XXX.211.33:34342 with 8.4 GB RAM, BlockManagerId(1, XXX.XXX.211.33, 34342, None)
17/11/17 10:16:33 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
17/11/17 10:16:33 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/path/to/app/spark-warehouse/').
17/11/17 10:16:33 INFO SharedState: Warehouse path is 'file:/path/to/app/spark-warehouse/'.
17/11/17 10:16:34 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
17/11/17 10:16:34 WARN InternalH2OBackend: Increasing 'spark.locality.wait' to value 30000
17/11/17 10:16:34 WARN InternalH2OBackend: Due to non-deterministic behavior of Spark broadcast-based joins
We recommend to disable them by
configuring `spark.sql.autoBroadcastJoinThreshold` variable to value `-1`:
sqlContext.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
17/11/17 10:16:34 INFO InternalH2OBackend: Starting H2O services: Sparkling Water configuration:
  backend cluster mode : internal
  workers              : None
  cloudName            : sparkling-water-username_app-20171117101603-0031
  flatfile             : true
  clientBasePort       : 54321
  nodeBasePort         : 54321
  cloudTimeout         : 60000
  h2oNodeLog           : INFO
  h2oClientLog         : WARN
  nthreads             : 40
  drddMulFactor        : 10
17/11/17 10:16:34 INFO SparkContext: Starting job: collect at SpreadRDDBuilder.scala:105
17/11/17 10:16:34 INFO DAGScheduler: Got job 0 (collect at SpreadRDDBuilder.scala:105) with 41 output partitions
17/11/17 10:16:34 INFO DAGScheduler: Final stage: ResultStage 0 (collect at SpreadRDDBuilder.scala:105)
17/11/17 10:16:34 INFO DAGScheduler: Parents of final stage: List()
17/11/17 10:16:34 INFO DAGScheduler: Missing parents: List()
17/11/17 10:16:34 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at mapPartitionsWithIndex at SpreadRDDBuilder.scala:102), which has no missing parents
17/11/17 10:16:34 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.1 KB, free 1002.3 MB)
17/11/17 10:16:34 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1379.0 B, free 1002.3 MB)
17/11/17 10:16:34 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.103.46:53777 (size: 1379.0 B, free: 1002.3 MB)
17/11/17 10:16:34 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
17/11/17 10:16:34 INFO DAGScheduler: Submitting 41 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at mapPartitionsWithIndex at SpreadRDDBuilder.scala:102) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
17/11/17 10:16:34 INFO TaskSchedulerImpl: Adding task set 0.0 with 41 tasks
17/11/17 10:16:34 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, XXX.XXX.211.31, executor 2, partition 0, PROCESS_LOCAL, 4829 bytes)
17/11/17 10:16:34 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, XXX.XXX.211.30, executor 0, partition 1, PROCESS_LOCAL, 4829 bytes)
...
17/11/17 10:16:34 INFO TaskSetManager: Starting task 19.0 in stage 0.0 (TID 19, XXX.XXX.211.33, executor 1, partition 19, PROCESS_LOCAL, 4829 bytes)
17/11/17 10:16:43 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on XXX.XXX.211.30:34172 (size: 1379.0 B, free: 8.4 GB)
17/11/17 10:16:43 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on XXX.XXX.211.32:34360 (size: 1379.0 B, free: 8.4 GB)
17/11/17 10:16:43 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on XXX.XXX.211.33:34342 (size: 1379.0 B, free: 8.4 GB)
17/11/17 10:16:43 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on XXX.XXX.211.31:45376 (size: 1379.0 B, free: 8.4 GB)
17/11/17 10:16:43 INFO BlockManagerInfo: Added rdd_0_13 in memory on XXX.XXX.211.30:34172 (size: 32.0 B, free: 8.4 GB)
...
17/11/17 10:16:43 INFO TaskSetManager: Finished task 40.0 in stage 0.0 (TID 40) in 29 ms on XXX.XXX.211.33 (executor 1) (41/41)
17/11/17 10:16:43 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
17/11/17 10:16:43 INFO DAGScheduler: ResultStage 0 (collect at SpreadRDDBuilder.scala:105) finished in 8.913 s
17/11/17 10:16:43 INFO DAGScheduler: Job 0 finished: collect at SpreadRDDBuilder.scala:105, took 9.072610 s
17/11/17 10:16:43 INFO ParallelCollectionRDD: Removing RDD 0 from persistence list
17/11/17 10:16:43 INFO BlockManager: Removing RDD 0
17/11/17 10:16:43 INFO SpreadRDDBuilder: Detected 4 spark executors for 4 H2O workers!
17/11/17 10:16:43 INFO InternalH2OBackend: Launching H2O on following 4 nodes: (0,XXX.XXX.211.30,-1),(1,XXX.XXX.211.33,-1),(2,XXX.XXX.211.31,-1),(3,XXX.XXX.211.32,-1)
17/11/17 10:16:43 INFO SparkContext: Starting job: collect at InternalBackendUtils.scala:163
17/11/17 10:16:43 INFO DAGScheduler: Got job 1 (collect at InternalBackendUtils.scala:163) with 4 output partitions
17/11/17 10:16:43 INFO DAGScheduler: Final stage: ResultStage 1 (collect at InternalBackendUtils.scala:163)
17/11/17 10:16:43 INFO DAGScheduler: Parents of final stage: List()
17/11/17 10:16:43 INFO DAGScheduler: Missing parents: List()
17/11/17 10:16:43 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[3] at map at InternalBackendUtils.scala:100), which has no missing parents
17/11/17 10:16:43 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 1002.3 MB)
17/11/17 10:16:43 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2029.0 B, free 1002.3 MB)
17/11/17 10:16:43 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.103.46:53777 (size: 2029.0 B, free: 1002.3 MB)
17/11/17 10:16:43 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
17/11/17 10:16:43 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 1 (MapPartitionsRDD[3] at map at InternalBackendUtils.scala:100) (first 15 tasks are for partitions Vector(0, 1, 2, 3))
17/11/17 10:16:43 INFO TaskSchedulerImpl: Adding task set 1.0 with 4 tasks
17/11/17 10:16:43 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 41, XXX.XXX.211.31, executor 2, partition 2, NODE_LOCAL, 4821 bytes)
17/11/17 10:16:43 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 42, XXX.XXX.211.30, executor 0, partition 0, NODE_LOCAL, 4821 bytes)
17/11/17 10:16:43 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 43, XXX.XXX.211.32, executor 3, partition 3, NODE_LOCAL, 4821 bytes)
17/11/17 10:16:43 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 44, XXX.XXX.211.33, executor 1, partition 1, NODE_LOCAL, 4821 bytes)
17/11/17 10:16:43 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on XXX.XXX.211.30:34172 (size: 2029.0 B, free: 8.4 GB)
17/11/17 10:16:43 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on XXX.XXX.211.31:45376 (size: 2029.0 B, free: 8.4 GB)
17/11/17 10:16:43 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on XXX.XXX.211.33:34342 (size: 2029.0 B, free: 8.4 GB)
17/11/17 10:16:43 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on XXX.XXX.211.32:34360 (size: 2029.0 B, free: 8.4 GB)
17/11/17 10:16:43 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 42) in 349 ms on XXX.XXX.211.30 (executor 0) (1/4)
17/11/17 10:16:43 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 41) in 358 ms on XXX.XXX.211.31 (executor 2) (2/4)
17/11/17 10:16:43 INFO TaskSetManager: Finished task 3.0 in stage 1.0 (TID 43) in 394 ms on XXX.XXX.211.32 (executor 3) (3/4)
17/11/17 10:16:43 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 44) in 408 ms on XXX.XXX.211.33 (executor 1) (4/4)
17/11/17 10:16:43 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
17/11/17 10:16:43 INFO DAGScheduler: ResultStage 1 (collect at InternalBackendUtils.scala:163) finished in 0.411 s
17/11/17 10:16:43 INFO DAGScheduler: Job 1 finished: collect at InternalBackendUtils.scala:163, took 0.428038 s
17/11/17 10:16:43 INFO SparkContext: Starting job: foreach at InternalBackendUtils.scala:175
17/11/17 10:16:43 INFO DAGScheduler: Got job 2 (foreach at InternalBackendUtils.scala:175) with 4 output partitions
17/11/17 10:16:43 INFO DAGScheduler: Final stage: ResultStage 2 (foreach at InternalBackendUtils.scala:175)
17/11/17 10:16:43 INFO DAGScheduler: Parents of final stage: List()
17/11/17 10:16:43 INFO DAGScheduler: Missing parents: List()
17/11/17 10:16:43 INFO DAGScheduler: Submitting ResultStage 2 (InvokeOnNodesRDD[2] at RDD at InvokeOnNodesRDD.scala:27), which has no missing parents
17/11/17 10:16:43 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 1832.0 B, free 1002.3 MB)
17/11/17 10:16:43 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1209.0 B, free 1002.3 MB)
17/11/17 10:16:43 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.103.46:53777 (size: 1209.0 B, free: 1002.3 MB)
17/11/17 10:16:43 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
17/11/17 10:16:43 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 2 (InvokeOnNodesRDD[2] at RDD at InvokeOnNodesRDD.scala:27) (first 15 tasks are for partitions Vector(0, 1, 2, 3))
17/11/17 10:16:43 INFO TaskSchedulerImpl: Adding task set 2.0 with 4 tasks
17/11/17 10:16:43 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID 45, XXX.XXX.211.31, executor 2, partition 2, NODE_LOCAL, 4821 bytes)
17/11/17 10:16:43 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 46, XXX.XXX.211.30, executor 0, partition 0, NODE_LOCAL, 4821 bytes)
17/11/17 10:16:43 INFO TaskSetManager: Starting task 3.0 in stage 2.0 (TID 47, XXX.XXX.211.32, executor 3, partition 3, NODE_LOCAL, 4821 bytes)
17/11/17 10:16:43 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 48, XXX.XXX.211.33, executor 1, partition 1, NODE_LOCAL, 4821 bytes)
17/11/17 10:16:43 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on XXX.XXX.211.31:45376 (size: 1209.0 B, free: 8.4 GB)
17/11/17 10:16:43 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on XXX.XXX.211.33:34342 (size: 1209.0 B, free: 8.4 GB)
17/11/17 10:16:43 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on XXX.XXX.211.32:34360 (size: 1209.0 B, free: 8.4 GB)
17/11/17 10:16:43 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on XXX.XXX.211.30:34172 (size: 1209.0 B, free: 8.4 GB)
17/11/17 10:16:43 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 46) in 28 ms on XXX.XXX.211.30 (executor 0) (1/4)
17/11/17 10:16:43 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 48) in 28 ms on XXX.XXX.211.33 (executor 1) (2/4)
17/11/17 10:16:43 INFO TaskSetManager: Finished task 2.0 in stage 2.0 (TID 45) in 30 ms on XXX.XXX.211.31 (executor 2) (3/4)
17/11/17 10:16:43 INFO TaskSetManager: Finished task 3.0 in stage 2.0 (TID 47) in 32 ms on XXX.XXX.211.32 (executor 3) (4/4)
17/11/17 10:16:43 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
17/11/17 10:16:43 INFO DAGScheduler: ResultStage 2 (foreach at InternalBackendUtils.scala:175) finished in 0.034 s
17/11/17 10:16:43 INFO DAGScheduler: Job 2 finished: foreach at InternalBackendUtils.scala:175, took 0.043737 s
17/11/17 10:16:43 INFO InternalH2OBackend: Starting H2O client on the Spark Driver (192.168.103.46): -name sparkling-water-username_app-20171117101603-0031 -nthreads 40 -ga_opt_out -quiet -log_level WARN -log_dir /path/to/app/h2ologs/app-20171117101603-0031 -baseport 54321 -client -ip 192.168.103.46 -flatfile /var/folders/gl/vgw262w9227cwqvzk595rbvjygdzh8/T/1510913803950-0/flatfile.txt
17/11/17 10:16:44 INFO NativeLibrary: Loaded XGBoost library from lib/osx_64/libxgboost4j.dylib (/var/folders/gl/vgw262w9227cwqvzk595rbvjygdzh8/T/libxgboost4j2584224510491657515.dylib)
Found XGBoost backend with library: xgboost4j
Your system supports only minimal version of XGBoost (no GPUs, no multithreading)!
IP address not found on this machine
17/11/17 10:16:45 INFO SparkContext: Invoking stop() from shutdown hook
17/11/17 10:16:45 INFO SparkUI: Stopped Spark web UI at http://192.168.103.46:4040
17/11/17 10:16:45 INFO StandaloneSchedulerBackend: Shutting down all executors
17/11/17 10:16:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
17/11/17 10:16:45 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/11/17 10:16:45 INFO MemoryStore: MemoryStore cleared
17/11/17 10:16:45 INFO BlockManager: BlockManager stopped
17/11/17 10:16:45 INFO BlockManagerMaster: BlockManagerMaster stopped
17/11/17 10:16:45 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/11/17 10:16:45 INFO SparkContext: Successfully stopped SparkContext
17/11/17 10:16:45 INFO ShutdownHookManager: Shutdown hook called
17/11/17 10:16:45 INFO ShutdownHookManager: Deleting directory /private/var/folders/gl/vgw262w9227cwqvzk595rbvjygdzh8/T/spark-51594e29-1ea0-4a4d-9aa0-dd65ef5146dd
@jakubhava
Copy link
Contributor

Hi @renegademonkey , thanks for the report. It seems that for some reason H2O couldn't find IP address on the local machine. It happened to me as well on some networks.

Can we verify this in a little bit simple environment pls ? Could you pls download just standalone H2O from the following link - http://h2o-release.s3.amazonaws.com/h2o/rel-weierstrass/7/index.html and start h2o as instructed on the page ? Single node is just fine. I'm just interested if this is also reproducible from pure h2o without sparkling water.

Thanks!

Kuba

@renegademonkey
Copy link
Author

renegademonkey commented Nov 17, 2017

Hi @jakubhava ! Thanks for the prompt response - this has been driving me mad for days. The strangest thing is that every few days sparkling water WILL work once with the same setup, but then doesn't work again :-/ .

So running h2o in local standalone mode works fine and I can start up flow (see full log below). I can also run h2o from R in local mode without problems. Any ideas what could be causing this or a workaround?

h2o connect log:

INFO: Found XGBoost backend with library: xgboost4j
INFO: Your system supports only minimal version of XGBoost (no GPUs, no multithreading)!
INFO: ----- H2O started  -----
INFO: Build git branch: rel-weierstrass
INFO: Build git hash: f1f65fe382ac4a0dc4631b9ee1468eb8eef7b27e
INFO: Build git describe: jenkins-3.14.0.6-6-gf1f65fe38
INFO: Build project version: 3.14.0.7 (latest version: 3.14.0.7)
INFO: Build age: 27 days
INFO: Built by: 'jenkins'
INFO: Built on: '2017-10-20 18:37:17'
INFO: Watchdog Build git branch: (unknown)
INFO: Watchdog Build git hash: (unknown)
INFO: Watchdog Build git describe: (unknown)
INFO: Watchdog Build project version: (unknown)
INFO: Watchdog Built by: (unknown)
INFO: Watchdog Built on: (unknown)
INFO: XGBoost Build git branch: (unknown)
INFO: XGBoost Build git hash: (unknown)
INFO: XGBoost Build git describe: (unknown)
INFO: XGBoost Build project version: (unknown)
INFO: XGBoost Built by: (unknown)
INFO: XGBoost Built on: (unknown)
INFO: KrbStandalone Build git branch: (unknown)
INFO: KrbStandalone Build git hash: (unknown)
INFO: KrbStandalone Build git describe: (unknown)
INFO: KrbStandalone Build project version: (unknown)
INFO: KrbStandalone Built by: (unknown)
INFO: KrbStandalone Built on: (unknown)
INFO: Processed H2O arguments: []
INFO: Java availableProcessors: 4
INFO: Java heap totalMemory: 245.5 MB
INFO: Java heap maxMemory: 3.56 GB
INFO: Java version: Java 1.8.0_131 (from Oracle Corporation)
INFO: JVM launch parameters: []
INFO: OS version: Mac OS X 10.13.1 (x86_64)
INFO: Machine physical memory: 16.00 GB
INFO: X-h2o-cluster-id: 1510932531000
INFO: User name: 'user'
INFO: IPv6 stack selected: false
INFO: Network address/interface is not reachable in 150ms: /fe80:0:0:0:34c2:b426:f4c3:4752%utun0/name:utun0 (utun0)
INFO: Network address/interface is not reachable in 150ms: /fe80:0:0:0:1462:cde6:7089:b299%en4/name:en4 (en4)
INFO: Possible IP Address: en4 (en4), 192.168.103.95
INFO: Network address/interface is not reachable in 150ms: /fe80:0:0:0:0:0:0:1%lo0/name:lo0 (lo0)
INFO: Network address/interface is not reachable in 150ms: /0:0:0:0:0:0:0:1/name:lo0 (lo0)
INFO: Possible IP Address: lo0 (lo0), 127.0.0.1
INFO: H2O node running in unencrypted mode.
INFO: Internal communication uses port: 54324
INFO: Listening for HTTP and REST traffic on http://192.168.103.95:54323/
INFO: H2O cloud name: 'user' on /192.168.103.95:54323, discovery address /XXX.XXX.117.70:60887
INFO: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555):
INFO:   1. Open a terminal and run 'ssh -L 55555:localhost:54323 user@192.168.103.95'
INFO:   2. Point your browser to http://localhost:55555
INFO: Log dir: '/tmp/h2o-user/h2ologs'
INFO: Cur dir: '/Users/user/h2o-3.14.0.7'
INFO: HDFS subsystem successfully initialized
INFO: S3 subsystem successfully initialized
INFO: Flow dir: '/Users/user/h2oflows'
INFO: Cloud of size 1 formed [/192.168.103.95:54323]
INFO: Registered parsers: [GUESS, ARFF, XLS, SVMLight, AVRO, PARQUET, CSV]
INFO: Watchdog extension initialized
INFO: XGBoost extension initialized
INFO: KrbStandalone extension initialized
INFO: Registered 3 core extensions in: 58ms
INFO: Registered H2O core extensions: [Watchdog, XGBoost, KrbStandalone]
INFO: Registered: 161 REST APIs in: 237ms
INFO: Registered REST API extensions: [XGBoost, Algos, AutoML, Core V3, Core V4]
INFO: Registered: 232 schemas in 109ms
INFO: H2O started in 2230ms
INFO: 
INFO: Open H2O Flow in your web browser: http://192.168.103.95:54323
INFO: 
INFO: GET /, parms: {}
INFO: GET /flow/index.html, parms: {}
INFO: GET /flow/fonts/fontawesome-webfont.woff, parms: {v=4.2.0}
INFO: GET /flow/index.html, parms: {}
INFO: GET /3/Metadata/endpoints, parms: {}
INFO: Locking cloud to new members, because hex.schemas.XGBoostV3
INFO: GET /3/NodePersistentStorage/notebook, parms: {}
INFO: GET /3/NodePersistentStorage/categories/environment/names/clips/exists, parms: {}
INFO: GET /3/About, parms: {}
INFO: GET /flow/help/catalog.json, parms: {}
INFO: GET /3/ModelBuilders, parms: {}

@jakubhava
Copy link
Contributor

Good to know that H2O works. The next step is to try Sparkling Water in local mode.

export MASTER=local and to see if the ip can be found or not ( it's always easier to find the simplest case where it can be reproduced )

In Sparkling Water we use the IP address provided by Spark which in some environments is not what we want. We can override this by h2o client network & mask and h2o nodes network & mask. More info is here https://github.com/h2oai/sparkling-water/blob/master/doc/configuration/configuration_properties.rst . Look for spark.ext.h2o.client.network.mask for client node ( h2o running on spark driver ) and spark.ext.h2o.node.network.mask for h2o nodes in spark executors. So I would try to start H2OContext as this if the issue remains which sets these 2 configuration propetries:

val conf = new H2OConf(sparkSession).setClientNetworkMask("mask_for_h2o_client").setNodeNetworkMask("mask_for_h2o_nodes")
val h2oContext = H2OContext.getOrCreate(sparkSession, conf)

Please let me know if you need more help configuring this or if the issue remains.
thanks!
Kuba

@renegademonkey
Copy link
Author

trying it with the new conf settings leads to several repeated errors:

WARN TransportChannelHandler: Exception in connection from /xxx.xxx.241.183:37564
java.io.IOException: Connection reset by peer

ERROR LiveListenerBus: Listener ExecutorAddNotSupportedListener threw an exception
java.lang.IllegalArgumentException: Executor without H2O instance discovered, killing the cloud!

WARN TaskSetManager: Lost task 3.2 in stage 1.0 (TID 103, xxx.xxx.241.186, executor 9): java.lang.AssertionError: assertion failed: SpreadRDD failure - IPs are not equal: (3,xxx.xxx.241.184,-1) != (9, xxx.xxx.241.186)

@renegademonkey
Copy link
Author

renegademonkey commented Nov 17, 2017

running sparkling water locally with a local spark cluster seems to start up fine - but no h2o commands are available in the sparkling-water shell.

@renegademonkey
Copy link
Author

I'm still having trouble with sparkling water on our cluster, though now not sure if this is still related to the IP not found in this machine problem.

Since I thought it might be a firewall issue, I switched my driver's network switch. My spark connection still terminates before starting an h20 context, but now the only non-INFO log message is

Jar check fails; my hash=[-112, -97, 98, 77, -15, 10, 25, -47, -14, 0, 19, -4, 36, -123, -32, -48]
Jar check fails; received hash=[-50, -93, -29, -52, 36, -79, 79, -124, 6, 69, -97, -21, -26, 1, -116, 65]

Any ideas what might be going on here?

@jakubhava
Copy link
Contributor

Are you starting Sparkling Water in internal backend ( it's the default and you would definitely know if it would be the second - external backend )

This message says that there is collision in jar hashes - you may have several H2O clusters with the same name started or your cluster is using different sparkling-water jar artefacts.

Can you please verify that you don't have extra clusters running and your spark driver and executors are using the exact same Sparkling Water jar ?

@renegademonkey
Copy link
Author

renegademonkey commented Nov 21, 2017

OK, that was the issue with the jar mismatch - I was simultaneously testing sparkling water via rsparkling on the same cluster (which is having the same connection issues) - hence the clash.

Now I'm back to my original error, IP address not found on this machine :-/

@renegademonkey
Copy link
Author

It seems like it could be a communication issue the between the spark master / executors / driver when starting the h2o context. But this is strange as I can run regular spark jobs on this cluster without any communication issues.

@renegademonkey
Copy link
Author

renegademonkey commented Nov 28, 2017

OK, after days of debugging, I've figured out that that IP address not found was arising because the h2o InetAddress.isReachable call was returning false with the default timeout on my system. I managed to get it to return true by manually setting the timeout to a higher value in scala:

System.setProperty("sys.ai.h2o.network.ip.ping.timeout", "6000")

And it works in scala! However, I now need to set the same value for sparkling water in R - how do I set this property in the config file when connecting in R via rsparkling?

@jakubhava
Copy link
Contributor

jakubhava commented Nov 28, 2017

Amazing @renegademonkey, thank you very much for the time you put into this and the debugging. We will make sure this is easily configurable in sparkling water.

For now, you can use spark.executor.extraJavaOptions=-Dsys.ai.h2o.network.ip.ping.timeout=6000 spark.driver.extraJavaOptions=-Dsys.ai.h2o.network.ip.ping.timeout=6000. These two properties ensures that this is available as system property on the nodes with Spark. More info how to tune SparklyR is https://spark.rstudio.com/deployment/.

Kuba

@renegademonkey
Copy link
Author

renegademonkey commented Nov 28, 2017

Hi @jakubhava - thanks once again for replying so quickly. I had discovered those settings a little while ago and have been trying different combinations of them for the past hour, but in R i still get IP address not found on this machine :-(

My configs have been in a .yml file rather than in a spark config file created in R, i.e. with the following code:

spark.executor.extraJavaOptions: -Dsys.ai.h2o.network.ip.ping.timeout=6000 
spark.driver.extraJavaOptions: -Dsys.ai.h2o.network.ip.ping.timeout=6000

I have tried this with and without quotes around the -D settings. I imagine a yml file shouldn't make a difference - other settings specified in the yml always work.

@renegademonkey
Copy link
Author

aaaaaand also just tried setting it in an R script config file - same error.

@renegademonkey
Copy link
Author

renegademonkey commented Dec 5, 2017

@jakubhava, any further suggestions on what to try / how to troubleshoot this?

@jakubhava
Copy link
Contributor

@renegademonkey, sorry, didn’t have time to look at the problem so far in RSparkling. Good news is that it works using the Scala API. Have you figured it out so far in RSparkling ? Can you please have a look here ? sparklyr/sparklyr#895 It’s not the same, but very similar issue related to setting system property.

Thanks,
Kuba

@jakubhava
Copy link
Contributor

Hi @renegademonkey, I will close this issue for now, but feel free to comment/reopen . Please also have a look on the latest comment where they are setting up system property from SparklyR.

@renegademonkey
Copy link
Author

so for some reason it is all working again in R after the christmas break, with no real config changes that I recall making. not sure I will ever get to the bottom of this one . . . thanks for your help again @jakubhava!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants