New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
H2OContext.getOrCreate() error on CDH #1446
Comments
@GlockGao What exactly is the issue? The are no signs of errors in the logs |
Actually, the code paste here is just a test - start session and then stop the session.
'19/08/16 07:11:38 INFO spark.ExecutorAllocationManager: New executor 20 has registered (new total is 9) |
Please run the code without
and send us yarn logs |
Thanks. Attached. The command is : "spark2-submit pysparkling_test.py " |
I see in the logs What is the full code you are trying to run please? Can you please point me to exact line on your code and failure? |
Hi Jakubhava, Thanks very much for your help. I did the several tests: (2) BTW, the error code is "hc = H2OContext.getOrCreate(spark, conf=h2oConf)" The code is below (copy from github examples): from pysparkling import * from pysparkling import * import h2o spark = SparkSession spark_conf = spark.sparkContext._conf.getAll() h2oConf = H2OConf(spark) frame = h2o.import_file('prostate.csv') h2o.cluster().shutdown() |
Hi Jakubhava, Another question is, if we want to run sparkling water (we use python) on CDH (client or cluster mode), we should install h2o_sparlking_version package on all the cluster worker nodes? Thanks very much! Cheers |
I see, thanks for the explanation! regarding the where instead of Regarding the second question -> If you start spark and add pysparkling package using Kuba |
Hi, Thanks very much for your help. BTW, Still in the code "hc = H2OContext.getOrCreate(spark, conf=h2oConf)" Cheers, |
Good to hear! Can you please also apply |
Hi, Thanks very much! Sorry, I have no resource to run the spark job now. Cheers, |
Hi, I'm not sure whether it is configured properly. All the workers node IP is 10.156.4.*/24, then I set 'spark.ext.h2o.node.network.mask=10.156.4.0/24', but encounter the same error. After that, I changed a little about configuration: but with different error. The full trace is below (Sorry about too many lines): |
Hi Jakubhava, Thanks. There has question about resource allocation. The code I already copy/paste the before, actually just import a simple file 'prostate.csv'. The job is running more than 30 minutes (and not finished yet). Occupation is huge: more than 1000 vcores and 2T memory. Can sparkling water or native spark have configuration that limit the resource occupation? And based on my understanding, once the job is preempted by other job, the whole process is failed? Because next time I re-submit the job and also encounter the exception: Cheers, |
@GlockGao thanks, we have some progress! Could you please send us the complete YARN logs? From the last message I can see that the timeout for starting the worker node has been reached and in the logs I can see what prevented the worker node to start. Thank you! Kuba |
Hi Kuba, I checked the log, it seems has OOM issue. The exception is below: I hope we can run pysparkling with limitation resource ocupation:) Cheers, |
Hi Kuba, Thanks very much in advanced! In the meanwhile, I have other two questions: They are similar dependencies distribute issue.
If the solution is available, I'm not sure how to zip the packages. I listed the information, can you help to check?
Cheers, |
Yup, it's OOM -> Regarding the distribution. If you use |
Hi Cuba, For distribution issue, maybe I didn't get the right way to go.
Log Upload Time: Fri Aug 23 15:29:22 +1000 2019 Log Length: 164 Traceback (most recent call last): Cheers, |
Please give it a try with official pysparkling zip file downloaded from here https://www.h2o.ai/download/#sparkling-water ( Instructions are on the download page on the PySparkling tab). I can see that yours does not have the right name so I assume it was either renamed or built manually which is always bit harder to support. Also you can mimic what we do in our pysparkling launcher (add zip file on python path):
We have seen issues on older versions where the python path configuration was necessary |
Hi I downloaded sparkling-water-3.26.2-2.3.zip from https://www.h2o.ai/download/#sparkling-water. I got the same errors:(
with the following commands: |
Hi, Thanks very much. I unzip the the 'h2o_pysparkling_2.3-3.26.2-2.3.zip' in 'sparkling-water-3.26.2-2.3.zip' and execute the following command: Still failure but with different error: Do you have any ideas? Cheers, |
@GlockGao Yes, previously, you were putting on --py-files the wrong file. You need to extract the pysparkling zip file from the distribution. The last exception just says that you don't have installed python dependencies for PySparkling. The dependencies have to be installed on each node of the cluster as well or passed to --py-files as well. See the list of dependencies on the PySparkling docs http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/pysparkling.html |
Hi Cuba, Thanks very much for your support!! Finally I can run the sparkling water in the local mode on edge node (even modify the source code of 'future' package). But there still have problems when client or cluster mode.
Actually I ran the spark job on the same cluster, why sometimes the H2O worker (cluster size) is 14, sometimes it is 21? Currently I just used a very small dataset, can I configure the worker number? My configure code is below: Once I configure the last line, it raised the exception: h2o.exceptions.H2OServerError: Cluster cannot reach consensus |
Hi Cuba, After the configuration : I can run the sparkling job with client mode.
But during context close process, it still raise some exceptions:
Cheers, |
Hi Cuba, Thanks very much for your support. Cheers |
When trying to start H2OContext on CDSW (Cloudera Data Science Workbench).
Command is : spark2-submit pysparkling_test.py
I got the following error: It repeatedly print the logs..
19/08/16 06:57:32 INFO spark.SparkContext: Added JAR /home/cdsw/.local/lib/python2.7/site-packages/sparkling_water/sparkling_water_assembly.jar at spark://10.156.4.64:24583/jars/sparkling_water_assembl
y.jar with timestamp 1565938652817
19/08/16 06:57:32 WARN internal.InternalH2OBackend: Increasing 'spark.locality.wait' to value 0 (Infinitive) as we need to ensure we run on the nodes with H2O
19/08/16 06:57:32 WARN internal.InternalBackendUtils: Unsupported options spark.dynamicAllocation.enabled detected!
19/08/16 06:57:32 INFO internal.InternalH2OBackend: Starting H2O services: Sparkling Water configuration:
backend cluster mode : internal
workers : None
cloudName : sparkling-water-cdsw_application_1563277926760_191096
clientBasePort : 54321
nodeBasePort : 54321
cloudTimeout : 60000
h2oNodeLog : INFO
h2oClientLog : INFO
nthreads : -1
drddMulFactor : 10
19/08/16 06:57:33 INFO spark.SparkContext: Starting job: collect at SpreadRDDBuilder.scala:62
19/08/16 06:57:33 INFO scheduler.DAGScheduler: Registering RDD 2 (distinct at SpreadRDDBuilder.scala:62)
19/08/16 06:57:33 INFO scheduler.DAGScheduler: Got job 0 (collect at SpreadRDDBuilder.scala:62) with 201 output partitions
19/08/16 06:57:33 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at SpreadRDDBuilder.scala:62)
19/08/16 06:57:33 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
19/08/16 06:57:33 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 0)
19/08/16 06:57:33 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[2] at distinct at SpreadRDDBuilder.scala:62), which has no missing parents
19/08/16 06:57:33 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 9.3 KB, free 7.6 GB)
19/08/16 06:57:33 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 4.9 KB, free 7.6 GB)
19/08/16 06:57:33 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.156.4.64:21432 (size: 4.9 KB, free: 7.6 GB)
19/08/16 06:57:33 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1039
19/08/16 06:57:33 INFO scheduler.DAGScheduler: Submitting 201 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[2] at distinct at SpreadRDDBuilder.scala:62) (first 15 tasks are for partitions Vect
or(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
19/08/16 06:57:33 INFO cluster.YarnScheduler: Adding task set 0.0 with 201 tasks
19/08/16 06:57:34 INFO spark.ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 1)
19/08/16 06:57:35 INFO spark.ExecutorAllocationManager: Requesting 2 new executors because tasks are backlogged (new desired total will be 3)
19/08/16 06:57:36 INFO spark.ExecutorAllocationManager: Requesting 4 new executors because tasks are backlogged (new desired total will be 7)
19/08/16 06:57:37 INFO spark.ExecutorAllocationManager: Requesting 8 new executors because tasks are backlogged (new desired total will be 15)
19/08/16 06:57:38 INFO spark.ExecutorAllocationManager: Requesting 16 new executors because tasks are backlogged (new desired total will be 31)
19/08/16 06:57:39 INFO spark.ExecutorAllocationManager: Requesting 32 new executors because tasks are backlogged (new desired total will be 63)
19/08/16 06:57:40 INFO spark.ExecutorAllocationManager: Requesting 64 new executors because tasks are backlogged (new desired total will be 127)
19/08/16 06:57:40 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (100.66.128.0:58270) with ID 2
19/08/16 06:57:40 INFO spark.ExecutorAllocationManager: New executor 2 has registered (new total is 1)
19/08/16 06:57:40 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, aup7964s.unix.anz, executor 2, partition 0, PROCESS_LOCAL, 7743 bytes)
19/08/16 06:57:40 INFO storage.BlockManagerMasterEndpoint: Registering block manager aup7964s.unix.anz:33662 with 366.3 MB RAM, BlockManagerId(2, aup7964s.unix.anz, 33662, None)
19/08/16 06:57:40 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (100.66.128.0:58280) with ID 4
19/08/16 06:57:40 INFO spark.ExecutorAllocationManager: New executor 4 has registered (new total is 2)
19/08/16 06:57:40 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, aup7964s.unix.anz, executor 4, partition 1, PROCESS_LOCAL, 7743 bytes)
19/08/16 06:57:40 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (100.66.128.0:58282) with ID 14
19/08/16 06:57:40 INFO spark.ExecutorAllocationManager: New executor 14 has registered (new total is 3)
Once switch to command: spark2-submit --master local[2] pysparkling_test.py
the errors are below:
19/08/16 06:59:53 INFO spark.SparkContext: Added JAR /home/cdsw/.local/lib/python2.7/site-packages/sparkling_water/sparkling_water_assembly.jar at spark://10.156.4.64:24583/jars/sparkling_water_assembl
y.jar with timestamp 1565938793481
19/08/16 06:59:53 WARN internal.InternalH2OBackend: Increasing 'spark.locality.wait' to value 0 (Infinitive) as we need to ensure we run on the nodes with H2O
19/08/16 06:59:53 WARN internal.InternalBackendUtils: Unsupported options spark.dynamicAllocation.enabled detected!
19/08/16 06:59:53 INFO internal.InternalH2OBackend: Starting H2O services: Sparkling Water configuration:
backend cluster mode : internal
workers : None
cloudName : sparkling-water-cdsw_local-1565938791492
clientBasePort : 54321
nodeBasePort : 54321
cloudTimeout : 60000
h2oNodeLog : INFO
h2oClientLog : INFO
nthreads : -1
drddMulFactor : 10
19/08/16 06:59:53 INFO java.NativeLibrary: Loaded library from lib/linux_64/libxgboost4j_gpu.so (/tmp/libxgboost4j_gpu392673508027643109.so)
Sparkling Water version: 3.26.2-2.3
Spark version: 2.3.0.cloudera3
Integrated H2O version: 3.26.0.2
The following Spark configuration is used:
(spark.eventLog.enabled,true)
(spark.app.name,SparklingWaterApp)
(spark.scheduler.minRegisteredResourcesRatio,1)
(spark.ext.h2o.cloud.name,sparkling-water-cdsw_local-1565938791492)
(spark.driver.memory,14976m)
(spark.yarn.jars,local:/app/hadoop/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/jars/)
(spark.eventLog.dir,hdfs://nameservice1/user/spark/spark2ApplicationHistory)
(spark.ui.killEnabled,true)
(spark.yarn.appMasterEnv.PYSPARK_PYTHON,/app/hadoop/parcels/Anaconda-4.3.1/bin/python)
(spark.ui.port,20049)
(spark.driver.bindAddress,100.66.128.10)
(spark.dynamicAllocation.executorIdleTimeout,60)
(spark.serializer,org.apache.spark.serializer.KryoSerializer)
(spark.ext.h2o.client.log.dir,/home/cdsw/h2ologs/local-1565938791492)
(spark.io.encryption.enabled,false)
(spark.yarn.am.extraLibraryPath,/app/hadoop/parcels/CDH-5.13.3-1.cdh5.13.3.p3486.3704/lib/hadoop/lib/native:/app/hadoop/parcels/GPLEXTRAS-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/lib/native)
(spark.authenticate,false)
(spark.sql.hive.metastore.jars,${env:HADOOP_COMMON_HOME}/../hive/lib/:${env:HADOOP_COMMON_HOME}/client/*)
(spark.lineage.log.dir,/var/log/spark2/lineage)
(spark.app.id,local-1565938791492)
(spark.serializer.objectStreamReset,100)
(spark.locality.wait,0)
(spark.submit.deployMode,client)
(spark.sql.autoBroadcastJoinThreshold,-1)
(spark.yarn.historyServer.address,http://aup7727s.unix.anz:18089)
(spark.network.crypto.enabled,false)
(spark.dynamicAllocation,false)
(spark.lineage.enabled,false)
(spark.shuffle.service.enabled,true)
(spark.hadoop.hadoop.treat.subject.external,true)
(spark.executor.id,driver)
(spark.dynamicAllocation.schedulerBacklogTimeout,1)
(spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON,/app/hadoop/parcels/Anaconda-4.3.1/bin/python)
(spark.shuffle.service.port,7337)
(spark.sql.hive.metastore.version,1.1.0)
(spark.ext.h2o.fail.on.unsupported.spark.param,false)
(spark.yarn.rmProxy.enabled,false)
(spark.sql.warehouse.dir,/user/hive/warehouse)
(spark.ext.h2o.client.ip,10.156.4.64)
(spark.sql.catalogImplementation,hive)
(spark.rdd.compress,True)
(spark.executor.extraLibraryPath,/app/hadoop/parcels/CDH-5.13.3-1.cdh5.13.3.p3486.3704/lib/hadoop/lib/native:/app/hadoop/parcels/GPLEXTRAS-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/lib/native)
(spark.yarn.config.gatewayPath,/app/hadoop/parcels)
(spark.ui.enabled,false)
(spark.dynamicAllocation.minExecutors,0)
(spark.yarn.config.replacementPath,{{HADOOP_COMMON_HOME}}/../../..)
(spark.dynamicAllocation.enabled,true)
(spark.driver.extraLibraryPath,/app/hadoop/parcels/CDH-5.13.3-1.cdh5.13.3.p3486.3704/lib/hadoop/lib/native:/app/hadoop/parcels/GPLEXTRAS-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/lib/native)
(spark.files,file:/home/cdsw/pysparkling_test.py)
(spark.driver.blockManager.port,21432)
(spark.master,local[2])
(spark.driver.port,24583)
(spark.driver.host,10.156.4.64)
----- H2O started -----
Build git branch: rel-yau
Build git hash: 4854053b2e1773e6df02e04895709f692ebf7088
Build git describe: jenkins-3.26.0.1-71-g4854053
Build project version: 3.26.0.2
Build age: 20 days
Built by: 'jenkins'
Built on: '2019-07-26 23:05:58'
Found H2O Core extensions: [HiveTableImporter, StackTraceCollector, Watchdog, XGBoost]
Processed H2O arguments: [-name, sparkling-water-cdsw_local-1565938791492, -port_offset, 1, -quiet, -log_level, INFO, -log_dir, /home/cdsw/h2ologs/local-1565938791492, -baseport, 54321, -ip, 10.156.4.64, -flatfile, /tmp/1565938793606-0/flatfile.txt]
Java availableProcessors: 64
Java heap totalMemory: 2.46 GB
Java heap maxMemory: 13.00 GB
Java version: Java 1.8.0_111 (from Oracle Corporation)
JVM launch parameters: [-Xmx14976m]
OS version: Linux 3.10.0-862.25.3.el7.x86_64 (amd64)
Machine physical memory: 251.62 GB
Machine locale: en_US
X-h2o-cluster-id: 1565938793549
User name: 'cdsw'
IPv6 stack selected: false
Possible IP Address: eth0 (eth0), 100.66.128.10
Possible IP Address: lo (lo), 127.0.0.1
IP address not found on this machine
19/08/16 06:59:54 INFO spark.SparkContext: Invoking stop() from shutdown hook
19/08/16 06:59:54 INFO server.AbstractConnector: Stopped Spark@2b2add4a{HTTP/1.1,[http/1.1]}{0.0.0.0:20049}
19/08/16 06:59:54 INFO ui.SparkUI: Stopped Spark web UI at http://10.156.4.64:20049
19/08/16 06:59:54 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/08/16 06:59:54 INFO memory.MemoryStore: MemoryStore cleared
19/08/16 06:59:54 INFO storage.BlockManager: BlockManager stopped
19/08/16 06:59:54 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
19/08/16 06:59:54 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/08/16 06:59:54 INFO spark.SparkContext: Successfully stopped SparkContext
19/08/16 06:59:54 INFO util.ShutdownHookManager: Shutdown hook called
19/08/16 06:59:54 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-ff331aa3-bb6b-474c-80a9-7b887e278c1d
19/08/16 06:59:54 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-ff331aa3-bb6b-474c-80a9-7b887e278c1d/pyspark-4ad91a7c-a673-419e-afa6-d292234f630d
19/08/16 06:59:54 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-83099783-3b75-4c01-a6e9-71db2cd82014
Providing us with the observed and expected behavior definitely helps. Giving us with the following information definitively helps:
Sparkling Water/PySparkling/RSparkling version
h2o_pysparkling_2.3
Hadoop Version & Distribution
CDH
Execution mode
YARN-client
,YARN-cluster
, standalone, local ..YARN-client
Please also provide us with the full and minimal reproducible code.
from pysparkling import *
import h2o
from h2o.estimators.xgboost import *
spark = SparkSession
.builder
.appName('SparklingWaterApp')
.getOrCreate()
h2oConf = H2OConf(spark)
.set('spark.ui.enabled', 'false')
.set('spark.ext.h2o.fail.on.unsupported.spark.param', 'false')
.set('spark.dynamicAllocation', 'false')
.set('spark.scheduler.minRegisteredResourcesRatio', '1')
.set('spark.sql.autoBroadcastJoinThreshold', '-1')
.set('spark.locality.wait', '0')
hc = H2OContext.getOrCreate(spark, conf=h2oConf)
h2o.cluster().shutdown()
spark.stop()
The text was updated successfully, but these errors were encountered: