Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H2OContext.getOrCreate() error on CDH #1446

Closed
GlockGao opened this issue Aug 16, 2019 · 25 comments
Closed

H2OContext.getOrCreate() error on CDH #1446

GlockGao opened this issue Aug 16, 2019 · 25 comments

Comments

@GlockGao
Copy link

  • When trying to start H2OContext on CDSW (Cloudera Data Science Workbench).
    Command is : spark2-submit pysparkling_test.py
    I got the following error: It repeatedly print the logs..
    19/08/16 06:57:32 INFO spark.SparkContext: Added JAR /home/cdsw/.local/lib/python2.7/site-packages/sparkling_water/sparkling_water_assembly.jar at spark://10.156.4.64:24583/jars/sparkling_water_assembl
    y.jar with timestamp 1565938652817
    19/08/16 06:57:32 WARN internal.InternalH2OBackend: Increasing 'spark.locality.wait' to value 0 (Infinitive) as we need to ensure we run on the nodes with H2O
    19/08/16 06:57:32 WARN internal.InternalBackendUtils: Unsupported options spark.dynamicAllocation.enabled detected!
    19/08/16 06:57:32 INFO internal.InternalH2OBackend: Starting H2O services: Sparkling Water configuration:
    backend cluster mode : internal
    workers : None
    cloudName : sparkling-water-cdsw_application_1563277926760_191096
    clientBasePort : 54321
    nodeBasePort : 54321
    cloudTimeout : 60000
    h2oNodeLog : INFO
    h2oClientLog : INFO
    nthreads : -1
    drddMulFactor : 10
    19/08/16 06:57:33 INFO spark.SparkContext: Starting job: collect at SpreadRDDBuilder.scala:62
    19/08/16 06:57:33 INFO scheduler.DAGScheduler: Registering RDD 2 (distinct at SpreadRDDBuilder.scala:62)
    19/08/16 06:57:33 INFO scheduler.DAGScheduler: Got job 0 (collect at SpreadRDDBuilder.scala:62) with 201 output partitions
    19/08/16 06:57:33 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at SpreadRDDBuilder.scala:62)
    19/08/16 06:57:33 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
    19/08/16 06:57:33 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 0)
    19/08/16 06:57:33 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[2] at distinct at SpreadRDDBuilder.scala:62), which has no missing parents
    19/08/16 06:57:33 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 9.3 KB, free 7.6 GB)
    19/08/16 06:57:33 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 4.9 KB, free 7.6 GB)
    19/08/16 06:57:33 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.156.4.64:21432 (size: 4.9 KB, free: 7.6 GB)
    19/08/16 06:57:33 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1039
    19/08/16 06:57:33 INFO scheduler.DAGScheduler: Submitting 201 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[2] at distinct at SpreadRDDBuilder.scala:62) (first 15 tasks are for partitions Vect
    or(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
    19/08/16 06:57:33 INFO cluster.YarnScheduler: Adding task set 0.0 with 201 tasks
    19/08/16 06:57:34 INFO spark.ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 1)
    19/08/16 06:57:35 INFO spark.ExecutorAllocationManager: Requesting 2 new executors because tasks are backlogged (new desired total will be 3)
    19/08/16 06:57:36 INFO spark.ExecutorAllocationManager: Requesting 4 new executors because tasks are backlogged (new desired total will be 7)
    19/08/16 06:57:37 INFO spark.ExecutorAllocationManager: Requesting 8 new executors because tasks are backlogged (new desired total will be 15)
    19/08/16 06:57:38 INFO spark.ExecutorAllocationManager: Requesting 16 new executors because tasks are backlogged (new desired total will be 31)
    19/08/16 06:57:39 INFO spark.ExecutorAllocationManager: Requesting 32 new executors because tasks are backlogged (new desired total will be 63)
    19/08/16 06:57:40 INFO spark.ExecutorAllocationManager: Requesting 64 new executors because tasks are backlogged (new desired total will be 127)
    19/08/16 06:57:40 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (100.66.128.0:58270) with ID 2
    19/08/16 06:57:40 INFO spark.ExecutorAllocationManager: New executor 2 has registered (new total is 1)
    19/08/16 06:57:40 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, aup7964s.unix.anz, executor 2, partition 0, PROCESS_LOCAL, 7743 bytes)
    19/08/16 06:57:40 INFO storage.BlockManagerMasterEndpoint: Registering block manager aup7964s.unix.anz:33662 with 366.3 MB RAM, BlockManagerId(2, aup7964s.unix.anz, 33662, None)
    19/08/16 06:57:40 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (100.66.128.0:58280) with ID 4
    19/08/16 06:57:40 INFO spark.ExecutorAllocationManager: New executor 4 has registered (new total is 2)
    19/08/16 06:57:40 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, aup7964s.unix.anz, executor 4, partition 1, PROCESS_LOCAL, 7743 bytes)
    19/08/16 06:57:40 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (100.66.128.0:58282) with ID 14
    19/08/16 06:57:40 INFO spark.ExecutorAllocationManager: New executor 14 has registered (new total is 3)

  • Once switch to command: spark2-submit --master local[2] pysparkling_test.py
    the errors are below:
    19/08/16 06:59:53 INFO spark.SparkContext: Added JAR /home/cdsw/.local/lib/python2.7/site-packages/sparkling_water/sparkling_water_assembly.jar at spark://10.156.4.64:24583/jars/sparkling_water_assembl
    y.jar with timestamp 1565938793481
    19/08/16 06:59:53 WARN internal.InternalH2OBackend: Increasing 'spark.locality.wait' to value 0 (Infinitive) as we need to ensure we run on the nodes with H2O
    19/08/16 06:59:53 WARN internal.InternalBackendUtils: Unsupported options spark.dynamicAllocation.enabled detected!
    19/08/16 06:59:53 INFO internal.InternalH2OBackend: Starting H2O services: Sparkling Water configuration:
    backend cluster mode : internal
    workers : None
    cloudName : sparkling-water-cdsw_local-1565938791492
    clientBasePort : 54321
    nodeBasePort : 54321
    cloudTimeout : 60000
    h2oNodeLog : INFO
    h2oClientLog : INFO
    nthreads : -1
    drddMulFactor : 10
    19/08/16 06:59:53 INFO java.NativeLibrary: Loaded library from lib/linux_64/libxgboost4j_gpu.so (/tmp/libxgboost4j_gpu392673508027643109.so)
    Sparkling Water version: 3.26.2-2.3
    Spark version: 2.3.0.cloudera3
    Integrated H2O version: 3.26.0.2
    The following Spark configuration is used:
    (spark.eventLog.enabled,true)
    (spark.app.name,SparklingWaterApp)
    (spark.scheduler.minRegisteredResourcesRatio,1)
    (spark.ext.h2o.cloud.name,sparkling-water-cdsw_local-1565938791492)
    (spark.driver.memory,14976m)
    (spark.yarn.jars,local:/app/hadoop/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/jars/)
    (spark.eventLog.dir,hdfs://nameservice1/user/spark/spark2ApplicationHistory)
    (spark.ui.killEnabled,true)
    (spark.yarn.appMasterEnv.PYSPARK_PYTHON,/app/hadoop/parcels/Anaconda-4.3.1/bin/python)
    (spark.ui.port,20049)
    (spark.driver.bindAddress,100.66.128.10)
    (spark.dynamicAllocation.executorIdleTimeout,60)
    (spark.serializer,org.apache.spark.serializer.KryoSerializer)
    (spark.ext.h2o.client.log.dir,/home/cdsw/h2ologs/local-1565938791492)
    (spark.io.encryption.enabled,false)
    (spark.yarn.am.extraLibraryPath,/app/hadoop/parcels/CDH-5.13.3-1.cdh5.13.3.p3486.3704/lib/hadoop/lib/native:/app/hadoop/parcels/GPLEXTRAS-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/lib/native)
    (spark.authenticate,false)
    (spark.sql.hive.metastore.jars,${env:HADOOP_COMMON_HOME}/../hive/lib/
    :${env:HADOOP_COMMON_HOME}/client/*)
    (spark.lineage.log.dir,/var/log/spark2/lineage)
    (spark.app.id,local-1565938791492)
    (spark.serializer.objectStreamReset,100)
    (spark.locality.wait,0)
    (spark.submit.deployMode,client)
    (spark.sql.autoBroadcastJoinThreshold,-1)
    (spark.yarn.historyServer.address,http://aup7727s.unix.anz:18089)
    (spark.network.crypto.enabled,false)
    (spark.dynamicAllocation,false)
    (spark.lineage.enabled,false)
    (spark.shuffle.service.enabled,true)
    (spark.hadoop.hadoop.treat.subject.external,true)
    (spark.executor.id,driver)
    (spark.dynamicAllocation.schedulerBacklogTimeout,1)
    (spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON,/app/hadoop/parcels/Anaconda-4.3.1/bin/python)
    (spark.shuffle.service.port,7337)
    (spark.sql.hive.metastore.version,1.1.0)
    (spark.ext.h2o.fail.on.unsupported.spark.param,false)
    (spark.yarn.rmProxy.enabled,false)
    (spark.sql.warehouse.dir,/user/hive/warehouse)
    (spark.ext.h2o.client.ip,10.156.4.64)
    (spark.sql.catalogImplementation,hive)
    (spark.rdd.compress,True)
    (spark.executor.extraLibraryPath,/app/hadoop/parcels/CDH-5.13.3-1.cdh5.13.3.p3486.3704/lib/hadoop/lib/native:/app/hadoop/parcels/GPLEXTRAS-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/lib/native)
    (spark.yarn.config.gatewayPath,/app/hadoop/parcels)
    (spark.ui.enabled,false)
    (spark.dynamicAllocation.minExecutors,0)
    (spark.yarn.config.replacementPath,{{HADOOP_COMMON_HOME}}/../../..)
    (spark.dynamicAllocation.enabled,true)
    (spark.driver.extraLibraryPath,/app/hadoop/parcels/CDH-5.13.3-1.cdh5.13.3.p3486.3704/lib/hadoop/lib/native:/app/hadoop/parcels/GPLEXTRAS-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/lib/native)
    (spark.files,file:/home/cdsw/pysparkling_test.py)
    (spark.driver.blockManager.port,21432)
    (spark.master,local[2])
    (spark.driver.port,24583)
    (spark.driver.host,10.156.4.64)

----- H2O started -----
Build git branch: rel-yau
Build git hash: 4854053b2e1773e6df02e04895709f692ebf7088
Build git describe: jenkins-3.26.0.1-71-g4854053
Build project version: 3.26.0.2
Build age: 20 days
Built by: 'jenkins'
Built on: '2019-07-26 23:05:58'
Found H2O Core extensions: [HiveTableImporter, StackTraceCollector, Watchdog, XGBoost]
Processed H2O arguments: [-name, sparkling-water-cdsw_local-1565938791492, -port_offset, 1, -quiet, -log_level, INFO, -log_dir, /home/cdsw/h2ologs/local-1565938791492, -baseport, 54321, -ip, 10.156.4.64, -flatfile, /tmp/1565938793606-0/flatfile.txt]
Java availableProcessors: 64
Java heap totalMemory: 2.46 GB
Java heap maxMemory: 13.00 GB
Java version: Java 1.8.0_111 (from Oracle Corporation)
JVM launch parameters: [-Xmx14976m]
OS version: Linux 3.10.0-862.25.3.el7.x86_64 (amd64)
Machine physical memory: 251.62 GB
Machine locale: en_US
X-h2o-cluster-id: 1565938793549
User name: 'cdsw'
IPv6 stack selected: false
Possible IP Address: eth0 (eth0), 100.66.128.10
Possible IP Address: lo (lo), 127.0.0.1
IP address not found on this machine
19/08/16 06:59:54 INFO spark.SparkContext: Invoking stop() from shutdown hook
19/08/16 06:59:54 INFO server.AbstractConnector: Stopped Spark@2b2add4a{HTTP/1.1,[http/1.1]}{0.0.0.0:20049}
19/08/16 06:59:54 INFO ui.SparkUI: Stopped Spark web UI at http://10.156.4.64:20049
19/08/16 06:59:54 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/08/16 06:59:54 INFO memory.MemoryStore: MemoryStore cleared
19/08/16 06:59:54 INFO storage.BlockManager: BlockManager stopped
19/08/16 06:59:54 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
19/08/16 06:59:54 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/08/16 06:59:54 INFO spark.SparkContext: Successfully stopped SparkContext
19/08/16 06:59:54 INFO util.ShutdownHookManager: Shutdown hook called
19/08/16 06:59:54 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-ff331aa3-bb6b-474c-80a9-7b887e278c1d
19/08/16 06:59:54 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-ff331aa3-bb6b-474c-80a9-7b887e278c1d/pyspark-4ad91a7c-a673-419e-afa6-d292234f630d
19/08/16 06:59:54 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-83099783-3b75-4c01-a6e9-71db2cd82014

Providing us with the observed and expected behavior definitely helps. Giving us with the following information definitively helps:

  • Sparkling Water/PySparkling/RSparkling version
    h2o_pysparkling_2.3

  • Hadoop Version & Distribution
    CDH

  • Execution mode YARN-client, YARN-cluster, standalone, local ..
    YARN-client

Please also provide us with the full and minimal reproducible code.
from pysparkling import *
import h2o
from h2o.estimators.xgboost import *

spark = SparkSession
.builder
.appName('SparklingWaterApp')
.getOrCreate()

h2oConf = H2OConf(spark)
.set('spark.ui.enabled', 'false')
.set('spark.ext.h2o.fail.on.unsupported.spark.param', 'false')
.set('spark.dynamicAllocation', 'false')
.set('spark.scheduler.minRegisteredResourcesRatio', '1')
.set('spark.sql.autoBroadcastJoinThreshold', '-1')
.set('spark.locality.wait', '0')
hc = H2OContext.getOrCreate(spark, conf=h2oConf)

h2o.cluster().shutdown()
spark.stop()

@GlockGao GlockGao changed the title H2OContext.getOrCreate() error on Cloudera H2OContext.getOrCreate() error on CDH Aug 16, 2019
@jakubhava
Copy link
Contributor

@GlockGao What exactly is the issue? The are no signs of errors in the logs

@GlockGao
Copy link
Author

Actually, the code paste here is just a test - start session and then stop the session.
The real purpose is using 'H2OXGBoost' for model building.

  1. for "spark2-submit --master local[2] pysparkling_test.py", there has an exception:"IP address not found on this machine" that cause session shutdown.
  2. for "spark2-submit pysparkling_test.py", it always print logs without stopping. H2OContext.getOrCreate(spark, conf=h2oConf) has no return.

'19/08/16 07:11:38 INFO spark.ExecutorAllocationManager: New executor 20 has registered (new total is 9)
19/08/16 07:11:38 INFO scheduler.TaskSetManager: Starting task 77.0 in stage 0.0 (TID 77, aup7714s.unix.anz, executor 20, partition 77, PROCESS_LOCAL, 7747 bytes)
19/08/16 07:11:38 INFO scheduler.TaskSetManager: Starting task 78.0 in stage 0.0 (TID 78, aup7965s.unix.anz, executor 5, partition 78, PROCESS_LOCAL, 7747 bytes)
19/08/16 07:11:38 INFO scheduler.TaskSetManager: Finished task 14.0 in stage 0.0 (TID 14) in 974 ms on aup7965s.unix.anz (executor 5) (70/201)
19/08/16 07:11:38 INFO scheduler.TaskSetManager: Starting task 79.0 in stage 0.0 (TID 79, aup7957s.unix.anz, executor 1, partition 79, PROCESS_LOCAL, 7747 bytes)
19/08/16 07:11:38 INFO scheduler.TaskSetManager: Finished task 76.0 in stage 0.0 (TID 76) in 21 ms on aup7957s.unix.anz (executor 1) (71/201)
19/08/16 07:11:38 INFO scheduler.TaskSetManager: Starting task 80.0 in stage 0.0 (TID 80, aup7951s.unix.anz, executor 2, partition 80, PROCESS_LOCAL, 7747 bytes)
19/08/16 07:11:38 INFO scheduler.TaskSetManager: Finished task 73.0 in stage 0.0 (TID 73) in 48 ms on aup7951s.unix.anz (executor 2) (72/201)
19/08/16 07:11:38 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (100.66.128.0:34282) with ID 9
19/08/16 07:11:38 INFO spark.ExecutorAllocationManager: New executor 9 has registered (new total is 10)
19/08/16 07:11:38 INFO scheduler.TaskSetManager: Starting task 81.0 in stage 0.0 (TID 81, aup7714s.unix.anz, executor 9, partition 81, PROCESS_LOCAL, 7747 bytes)
19/08/16 07:11:38 INFO scheduler.TaskSetManager: Starting task 82.0 in stage 0.0 (TID 82, aup7951s.unix.anz, executor 4, partition 82, PROCESS_LOCAL, 7747 bytes)
19/08/16 07:11:38 INFO scheduler.TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 1125 ms on aup7951s.unix.anz (executor 4) (73/201)
19/08/16 07:11:38 INFO scheduler.TaskSetManager: Starting task 83.0 in stage 0.0 (TID 83, aup7957s.unix.anz, executor 1, partition 83, PROCESS_LOCAL, 7747 bytes)
19/08/16 07:11:38 INFO scheduler.TaskSetManager: Finished task 79.0 in stage 0.0 (TID 79) in 19 ms on aup7957s.unix.anz (executor 1) (74/201)
19/08/16 07:11:38 INFO scheduler.TaskSetManager: Starting task 84.0 in stage 0.0 (TID 84, aup7965s.unix.anz, executor 3, partition 84, PROCESS_LOCAL, 7747 bytes)
19/08/16 07:11:38 INFO scheduler.TaskSetManager: Finished task 16.0 in stage 0.0 (TID 16) in 984 ms on aup7965s.unix.anz (executor 3) (75/201)
19/08/16 07:11:38 INFO storage.BlockManagerMasterEndpoint: Registering block manager aup7714s.unix.anz:34805 with 366.3 MB RAM, BlockManagerId(6, aup7714s.unix.anz, 34805, None)
19/08/16 07:11:38 INFO scheduler.TaskSetManager: Starting task 85.0 in stage 0.0 (TID 85, aup7957s.unix.anz, executor 1, partition 85, PROCESS_LOCAL, 7747 bytes)
19/08/16 07:11:38 INFO scheduler.TaskSetManager: Finished task 83.0 in stage 0.0 (TID 83) in 19 ms on aup7957s.unix.anz (executor 1) (76/201)
19/08/16 07:11:38 INFO scheduler.TaskSetManager: Starting task 86.0 in stage 0.0 (TID 86, aup7951s.unix.anz, executor 2, partition 86, PROCESS_LOCAL, 7747 bytes)
19/08/16 07:11:38 INFO scheduler.TaskSetManager: Finished task 80.0 in stage 0.0 (TID 80) in 34 ms on aup7951s.unix.anz (executor 2) (77/201)'

@jakubhava
Copy link
Contributor

jakubhava commented Aug 16, 2019

Please run the code without

h2o.cluster().shutdown()
spark.stop()

and send us yarn logs

@GlockGao
Copy link
Author

Thanks.
sparkling.zip

Attached.

The command is : "spark2-submit pysparkling_test.py "

@jakubhava
Copy link
Contributor

I see in the logs 19/08/16 17:20:18 INFO util.Utils: Successfully started service 'sparkling-water-h2o-start-618' on port 35266. and still no other mentions of the errors.

What is the full code you are trying to run please? Can you please point me to exact line on your code and failure?

@GlockGao
Copy link
Author

GlockGao commented Aug 22, 2019

Hi Jakubhava,

Thanks very much for your help.
I have some investigation about the issue. It is very similar like issue:Sparkling Water fails to create h2oContext in simple spark project #466

I did the several tests:
(1) Download just standalone H2O from the following link - http://h2o-release.s3.amazonaws.com/h2o/rel-weierstrass/7/index.html, and start H2O successfully.

(2)
a. use command:"spark-submit --master local[*] pysparkling_xgboost.py" to submit the job. or
b. export MASTER=local, and then use command:"spark2-submit pysparkling_xgboost.py"
Will get the same errors:
it raise the exception:"Possible IP Address: eth0 (eth0), 100.66.32.5
Possible IP Address: lo (lo), 127.0.0.1 IP address not found on this machine" that lead H2O context creation failure. I submitted the job on the container. The spark deployment mode is 'client', the driver(host machine of container) IP address is:"10.156.4.67", the container IP address is 100.66.32.5. According to the code (is https://github.com/h2oai/h2o-3/blob/master/h2o-core/src/main/java/water/init/HostnameGuesser.java#L227), H2O initialization process include IP address verification step that will cause the problem.
I am the new about spark and sparkling water. So I want to know how to solve the problem or whether sparkling water support such deployment (such as later we will use openshift | kubernates to deploy the model, it also start the container to run H2O context )?

BTW, the error code is "hc = H2OContext.getOrCreate(spark, conf=h2oConf)"

The code is below (copy from github examples):

from pysparkling import *
from pysparkling.ml import ColumnPruner, H2OGBM, H2ODeepLearning, H2OAutoML, H2OXGBoost

from pysparkling import *
from pyspark.sql import SparkSession

import h2o
from h2o.estimators.xgboost import *

spark = SparkSession
.builder
.appName('SparklingWaterApp')
.getOrCreate()

spark_conf = spark.sparkContext._conf.getAll()

h2oConf = H2OConf(spark)
.set('spark.ui.enabled', 'false')
.set('spark.ext.h2o.fail.on.unsupported.spark.param', 'false')
.set('spark.dynamicAllocation.enabled', 'false')
.set('spark.scheduler.minRegisteredResourcesRatio', '1')
.set('spark.sql.autoBroadcastJoinThreshold', '-1')
.set('spark.locality.wait', '0')
.set('spark.executor.cores', '4')
.set('spark.executor.instances', '2')
hc = H2OContext.getOrCreate(spark, conf=h2oConf)

frame = h2o.import_file('prostate.csv')
sparkDF = hc.as_spark_frame(frame)
sparkDF = sparkDF.withColumn("CAPSULE", sparkDF.CAPSULE.cast("string"))
[trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2])
estimator = H2OXGBoost(labelCol = "CAPSULE")
model = estimator.fit(trainingDF)
model.getModelDetails()
model.transform(testingDF).show(truncate = False)

h2o.cluster().shutdown()
spark.stop()

@GlockGao
Copy link
Author

Hi Jakubhava,

Another question is, if we want to run sparkling water (we use python) on CDH (client or cluster mode), we should install h2o_sparlking_version package on all the cluster worker nodes?

Thanks very much!

Cheers

@jakubhava
Copy link
Contributor

I see, thanks for the explanation!

regarding the IP not found error we have seen something similar earlier on the similar environment. Can you please pass the following configuration to your spark job? --conf spark.ext.h2o.client.network.mask=10.10.0.1/24

where instead of 10.10.0.1/24 please specify network where your job will be running. This will force H2O to use the network interface.

Regarding the second question ->

If you start spark and add pysparkling package using --py-files option than Spark distributes the dependency automatically and no additional work is required. If you are installing the package manually, then yes, it is required to be installed on all the nodes in the cluster.

Kuba

@GlockGao
Copy link
Author

Hi,

Thanks very much for your help.
After I configured 'spark.ext.h2o.client.network.mask', I can run the code successfully with local mode.
But there still has problem with 'client' mode and raised the exception :"java.lang.RuntimeException: Cloud size 0 under 1".
I used the default backend - internal mode.

BTW, Still in the code "hc = H2OContext.getOrCreate(spark, conf=h2oConf)"

Cheers,

@jakubhava
Copy link
Contributor

jakubhava commented Aug 22, 2019

Good to hear!

Can you please also apply spark.ext.h2o.node.network.mask with the same network settings( if the worker nodes are running in the same network). That should help also worker node with the selection of network interface

@GlockGao
Copy link
Author

Hi,

Thanks very much!

Sorry, I have no resource to run the spark job now.
I will do the test tomorrow and let you know the results.

Cheers,

@GlockGao
Copy link
Author

Hi,

I'm not sure whether it is configured properly. All the workers node IP is 10.156.4.*/24, then I set 'spark.ext.h2o.node.network.mask=10.156.4.0/24', but encounter the same error.

After that, I changed a little about configuration:
'spark.executor.instances' : '2' -> '10'
'spark.ececutor.memory', '24G'
'spark.driver.memory', '8G'

but with different error. The full trace is below (Sorry about too many lines):
19/08/22 09:44:01 INFO scheduler.DAGScheduler: Job 7 finished: collect at SpreadRDDBuilder.scala:62, took 0.502590 s
19/08/22 09:44:01 INFO internal.SpreadRDDBuilder: Detected 46 spark executors for 10 H2O workers!
19/08/22 09:44:01 INFO client.TransportClientFactory: Successfully created connection to aup7714s.unix.anz/10.156.4.37:37748 after 5 ms (0 ms spent in bootstraps)
19/08/22 09:44:01 INFO client.TransportClientFactory: Successfully created connection to aup7956s.unix.anz/10.156.4.76:38076 after 1 ms (0 ms spent in bootstraps)
19/08/22 09:44:01 INFO client.TransportClientFactory: Successfully created connection to aup7954s.unix.anz/10.156.4.74:41600 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7950s.unix.anz/10.156.4.70:35642 after 1 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7952s.unix.anz/10.156.4.72:35515 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7954s.unix.anz/10.156.4.74:41158 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7722s.unix.anz/10.156.4.45:40437 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7964s.unix.anz/10.156.4.84:38968 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7964s.unix.anz/10.156.4.84:38933 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7961s.unix.anz/10.156.4.81:34187 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7958s.unix.anz/10.156.4.78:44556 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7964s.unix.anz/10.156.4.84:42221 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7960s.unix.anz/10.156.4.80:42960 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7960s.unix.anz/10.156.4.80:40858 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7950s.unix.anz/10.156.4.70:38078 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7723s.unix.anz/10.156.4.46:43313 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7952s.unix.anz/10.156.4.72:34555 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7715s.unix.anz/10.156.4.38:35782 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7959s.unix.anz/10.156.4.79:46262 after 1 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7952s.unix.anz/10.156.4.72:39393 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7715s.unix.anz/10.156.4.38:34596 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7961s.unix.anz/10.156.4.81:33677 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7715s.unix.anz/10.156.4.38:43199 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7950s.unix.anz/10.156.4.70:42558 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7954s.unix.anz/10.156.4.74:41583 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7951s.unix.anz/10.156.4.71:43468 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7962s.unix.anz/10.156.4.82:42723 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7724s.unix.anz/10.156.4.47:37238 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7717s.unix.anz/10.156.4.40:36304 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7958s.unix.anz/10.156.4.78:44098 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7717s.unix.anz/10.156.4.40:34773 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7950s.unix.anz/10.156.4.70:41306 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7715s.unix.anz/10.156.4.38:35753 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7722s.unix.anz/10.156.4.45:38717 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7960s.unix.anz/10.156.4.80:41082 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7713s.unix.anz/10.156.4.36:40216 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7713s.unix.anz/10.156.4.36:41940 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7725s.unix.anz/10.156.4.48:46468 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7725s.unix.anz/10.156.4.48:44074 after 1 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7724s.unix.anz/10.156.4.47:38024 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7713s.unix.anz/10.156.4.36:46653 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7954s.unix.anz/10.156.4.74:40870 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7958s.unix.anz/10.156.4.78:36526 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7959s.unix.anz/10.156.4.79:37139 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7715s.unix.anz/10.156.4.38:44519 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:02 INFO client.TransportClientFactory: Successfully created connection to aup7952s.unix.anz/10.156.4.72:45935 after 0 ms (0 ms spent in bootstraps)
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 381
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on 10.156.4.67:23024 in memory (size: 5.0 KB, free: 7.6 GB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7950s.unix.anz:45711 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7725s.unix.anz:44285 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7724s.unix.anz:35126 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7952s.unix.anz:38047 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7964s.unix.anz:39005 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7714s.unix.anz:34972 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7717s.unix.anz:36004 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7962s.unix.anz:42217 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7964s.unix.anz:45624 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7717s.unix.anz:43139 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7950s.unix.anz:34987 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7723s.unix.anz:36256 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7950s.unix.anz:44295 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7715s.unix.anz:45526 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7952s.unix.anz:34287 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7950s.unix.anz:41544 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7715s.unix.anz:39989 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7715s.unix.anz:46845 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7713s.unix.anz:36440 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7952s.unix.anz:46560 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7725s.unix.anz:32914 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7713s.unix.anz:32966 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7713s.unix.anz:43508 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7964s.unix.anz:40970 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7715s.unix.anz:33077 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7715s.unix.anz:34142 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7961s.unix.anz:33862 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7954s.unix.anz:38816 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7722s.unix.anz:44051 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7959s.unix.anz:43223 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7724s.unix.anz:37302 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7961s.unix.anz:33491 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7952s.unix.anz:46223 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7959s.unix.anz:38453 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7722s.unix.anz:44039 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7958s.unix.anz:44977 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7956s.unix.anz:39620 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7958s.unix.anz:34087 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7954s.unix.anz:38529 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7954s.unix.anz:37000 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7954s.unix.anz:46510 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7960s.unix.anz:32814 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7958s.unix.anz:38670 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7951s.unix.anz:40289 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7960s.unix.anz:34859 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on aup7960s.unix.anz:41697 in memory (size: 5.0 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 362
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 359
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 358
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 370
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 377
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 368
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 395
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 392
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 384
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 366
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 371
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 374
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 383
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 380
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 396
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 400
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 360
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 389
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 351
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 365
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 382
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 397
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 385
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 375
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 373
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 390
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 353
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 364
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 355
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 369
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 386
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 372
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 361
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 388
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 398
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 363
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 367
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on 10.156.4.67:23024 in memory (size: 2.3 KB, free: 7.6 GB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7725s.unix.anz:44285 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7723s.unix.anz:36256 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7714s.unix.anz:34972 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7964s.unix.anz:45624 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7717s.unix.anz:43139 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7964s.unix.anz:39005 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7713s.unix.anz:43508 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7950s.unix.anz:44295 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7713s.unix.anz:36440 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7950s.unix.anz:45711 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7713s.unix.anz:32966 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7950s.unix.anz:41544 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7725s.unix.anz:32914 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7724s.unix.anz:35126 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7964s.unix.anz:40970 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7950s.unix.anz:34987 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7724s.unix.anz:37302 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7952s.unix.anz:38047 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7717s.unix.anz:36004 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7715s.unix.anz:46845 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7952s.unix.anz:34287 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7715s.unix.anz:34142 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7962s.unix.anz:42217 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7715s.unix.anz:33077 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7959s.unix.anz:43223 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7715s.unix.anz:39989 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7952s.unix.anz:46223 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7715s.unix.anz:45526 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7954s.unix.anz:38816 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7961s.unix.anz:33491 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7722s.unix.anz:44051 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7722s.unix.anz:44039 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7954s.unix.anz:46510 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7956s.unix.anz:39620 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7960s.unix.anz:34859 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7960s.unix.anz:41697 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7954s.unix.anz:37000 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7951s.unix.anz:40289 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7961s.unix.anz:33862 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7952s.unix.anz:46560 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7958s.unix.anz:38670 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7959s.unix.anz:38453 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7960s.unix.anz:32814 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7954s.unix.anz:38529 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7958s.unix.anz:44977 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO storage.BlockManagerInfo: Removed broadcast_15_piece0 on aup7958s.unix.anz:34087 in memory (size: 2.3 KB, free: 366.3 MB)
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 379
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 357
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 378
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 394
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 391
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 393
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned shuffle 7
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 399
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 354
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 376
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 352
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 356
19/08/22 09:44:37 INFO spark.ContextCleaner: Cleaned accumulator 387
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Request to remove executorIds: 30, 39, 45, 2, 5, 33, 27, 12, 8, 15, 42, 36, 21, 18, 24, 35, 41, 7, 17, 1, 44, 23, 38, 26, 4, 11, 32, 14, 29, 20, 46, 28, 6, 34, 40, 43, 9, 22, 16, 37, 19, 3, 10, 25, 31, 13
19/08/22 09:45:01 INFO cluster.YarnClientSchedulerBackend: Requesting to kill executor(s) 30, 39, 45, 2, 5, 33, 27, 12, 8, 15, 42, 36, 21, 18, 24, 35, 41, 7, 17, 1, 44, 23, 38, 26, 4, 11, 32, 14, 29, 20, 46, 28, 6, 34, 40, 43, 9, 22, 16, 37, 19, 3, 10, 25, 31, 13
19/08/22 09:45:01 INFO cluster.YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 30, 39, 45, 2, 5, 33, 27, 12, 8, 15, 42, 36, 21, 18, 24, 35, 41, 7, 17, 1, 44, 23, 38, 26, 4, 11, 32, 14, 29, 20, 46, 28, 6, 34, 40, 43, 9, 22, 16, 37, 19, 3, 10, 25, 31, 13
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 30 because it has been idle for 60 seconds (new desired total will be 45)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 39 because it has been idle for 60 seconds (new desired total will be 44)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 45 because it has been idle for 60 seconds (new desired total will be 43)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 2 because it has been idle for 60 seconds (new desired total will be 42)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 5 because it has been idle for 60 seconds (new desired total will be 41)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 33 because it has been idle for 60 seconds (new desired total will be 40)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 27 because it has been idle for 60 seconds (new desired total will be 39)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 12 because it has been idle for 60 seconds (new desired total will be 38)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 8 because it has been idle for 60 seconds (new desired total will be 37)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 15 because it has been idle for 60 seconds (new desired total will be 36)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 42 because it has been idle for 60 seconds (new desired total will be 35)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 36 because it has been idle for 60 seconds (new desired total will be 34)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 21 because it has been idle for 60 seconds (new desired total will be 33)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 18 because it has been idle for 60 seconds (new desired total will be 32)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 24 because it has been idle for 60 seconds (new desired total will be 31)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 35 because it has been idle for 60 seconds (new desired total will be 30)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 41 because it has been idle for 60 seconds (new desired total will be 29)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 7 because it has been idle for 60 seconds (new desired total will be 28)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 17 because it has been idle for 60 seconds (new desired total will be 27)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 26)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 44 because it has been idle for 60 seconds (new desired total will be 25)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 23 because it has been idle for 60 seconds (new desired total will be 24)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 38 because it has been idle for 60 seconds (new desired total will be 23)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 26 because it has been idle for 60 seconds (new desired total will be 22)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 4 because it has been idle for 60 seconds (new desired total will be 21)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 11 because it has been idle for 60 seconds (new desired total will be 20)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 32 because it has been idle for 60 seconds (new desired total will be 19)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 14 because it has been idle for 60 seconds (new desired total will be 18)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 29 because it has been idle for 60 seconds (new desired total will be 17)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 20 because it has been idle for 60 seconds (new desired total will be 16)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 46 because it has been idle for 60 seconds (new desired total will be 15)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 28 because it has been idle for 60 seconds (new desired total will be 14)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 6 because it has been idle for 60 seconds (new desired total will be 13)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 34 because it has been idle for 60 seconds (new desired total will be 12)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 40 because it has been idle for 60 seconds (new desired total will be 11)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 43 because it has been idle for 60 seconds (new desired total will be 10)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 9 because it has been idle for 60 seconds (new desired total will be 9)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 22 because it has been idle for 60 seconds (new desired total will be 8)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 16 because it has been idle for 60 seconds (new desired total will be 7)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 37 because it has been idle for 60 seconds (new desired total will be 6)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 19 because it has been idle for 60 seconds (new desired total will be 5)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 3 because it has been idle for 60 seconds (new desired total will be 4)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 10 because it has been idle for 60 seconds (new desired total will be 3)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 25 because it has been idle for 60 seconds (new desired total will be 2)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 31 because it has been idle for 60 seconds (new desired total will be 1)
19/08/22 09:45:01 INFO spark.ExecutorAllocationManager: Removing executor 13 because it has been idle for 60 seconds (new desired total will be 0)
19/08/22 09:45:04 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1.
19/08/22 09:45:04 INFO scheduler.DAGScheduler: Executor lost: 1 (epoch 8)
19/08/22 09:45:04 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
19/08/22 09:45:04 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, aup7723s.unix.anz, 36256, None)
19/08/22 09:45:04 INFO storage.BlockManagerMaster: Removed 1 successfully in removeExecutor
19/08/22 09:45:04 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 23.
19/08/22 09:45:04 INFO scheduler.DAGScheduler: Executor lost: 23 (epoch 8)
19/08/22 09:45:04 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 23 from BlockManagerMaster.
19/08/22 09:45:04 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(23, aup7959s.unix.anz, 43223, None)
19/08/22 09:45:04 INFO storage.BlockManagerMaster: Removed 23 successfully in removeExecutor
19/08/22 09:45:04 INFO cluster.YarnScheduler: Executor 1 on aup7723s.unix.anz killed by driver.
19/08/22 09:45:04 INFO cluster.YarnScheduler: Executor 23 on aup7959s.unix.anz killed by driver.
19/08/22 09:45:04 INFO spark.ExecutorAllocationManager: Existing executor 1 has been removed (new total is 45)
19/08/22 09:45:04 INFO spark.ExecutorAllocationManager: Existing executor 23 has been removed (new total is 44)
19/08/22 09:45:04 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 25.
19/08/22 09:45:04 INFO scheduler.DAGScheduler: Executor lost: 25 (epoch 8)
19/08/22 09:45:04 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 25 from BlockManagerMaster.
19/08/22 09:45:04 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(25, aup7725s.unix.anz, 44285, None)
19/08/22 09:45:04 INFO storage.BlockManagerMaster: Removed 25 successfully in removeExecutor
19/08/22 09:45:04 INFO cluster.YarnScheduler: Executor 25 on aup7725s.unix.anz killed by driver.
19/08/22 09:45:04 INFO spark.ExecutorAllocationManager: Existing executor 25 has been removed (new total is 43)
19/08/22 09:45:04 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 41.
19/08/22 09:45:04 INFO scheduler.DAGScheduler: Executor lost: 41 (epoch 8)
19/08/22 09:45:04 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 41 from BlockManagerMaster.
19/08/22 09:45:04 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(41, aup7725s.unix.anz, 32914, None)
19/08/22 09:45:04 INFO storage.BlockManagerMaster: Removed 41 successfully in removeExecutor
19/08/22 09:45:04 INFO cluster.YarnScheduler: Executor 41 on aup7725s.unix.anz killed by driver.
19/08/22 09:45:04 INFO spark.ExecutorAllocationManager: Existing executor 41 has been removed (new total is 42)
19/08/22 09:45:04 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 39.
19/08/22 09:45:04 INFO scheduler.DAGScheduler: Executor lost: 39 (epoch 8)
19/08/22 09:45:04 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 39 from BlockManagerMaster.
19/08/22 09:45:04 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(39, aup7959s.unix.anz, 38453, None)
19/08/22 09:45:04 INFO storage.BlockManagerMaster: Removed 39 successfully in removeExecutor
19/08/22 09:45:04 INFO cluster.YarnScheduler: Executor 39 on aup7959s.unix.anz killed by driver.
19/08/22 09:45:04 INFO spark.ExecutorAllocationManager: Existing executor 39 has been removed (new total is 41)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 15.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 15 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 15 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(15, aup7713s.unix.anz, 43508, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 15 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 15 on aup7713s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 15 has been removed (new total is 40)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 22.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 22 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 22 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(22, aup7713s.unix.anz, 32966, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 22 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 22 on aup7713s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 22 has been removed (new total is 39)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 38.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 38 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 38 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(38, aup7713s.unix.anz, 36440, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 38 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 38 on aup7713s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 38 has been removed (new total is 38)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 27.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 27 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 27 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(27, aup7717s.unix.anz, 36004, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 27 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 27 on aup7717s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 27 has been removed (new total is 37)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 40.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 40 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 40 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(40, aup7724s.unix.anz, 37302, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 40 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 24.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 24 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 24 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(24, aup7724s.unix.anz, 35126, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 24 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 40 on aup7724s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 40 has been removed (new total is 36)
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 24 on aup7724s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 24 has been removed (new total is 35)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 43.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 43 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 43 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(43, aup7717s.unix.anz, 43139, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 43 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 43 on aup7717s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 43 has been removed (new total is 34)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 12.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 12 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 12 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(12, aup7950s.unix.anz, 45711, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 12 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 12 on aup7950s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 12 has been removed (new total is 33)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 7 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 7 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(7, aup7950s.unix.anz, 44295, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 7 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 7 on aup7950s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 7 has been removed (new total is 32)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 2.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 2 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, aup7954s.unix.anz, 46510, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 2 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 2 on aup7954s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 2 has been removed (new total is 31)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 19.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 19 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 19 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(19, aup7950s.unix.anz, 41544, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 19 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 19 on aup7950s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 19 has been removed (new total is 30)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 35.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 35 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 35 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(35, aup7950s.unix.anz, 34987, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 35 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 6.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, aup7952s.unix.anz, 38047, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 35 on aup7950s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 35 has been removed (new total is 29)
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 6 on aup7952s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 6 has been removed (new total is 28)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 4.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 4 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(4, aup7954s.unix.anz, 38816, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 4 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 4 on aup7954s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 4 has been removed (new total is 27)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 26.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 26 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 26 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(26, aup7961s.unix.anz, 33862, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 26 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 26 on aup7961s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 26 has been removed (new total is 26)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 11.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 11 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 11 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(11, aup7952s.unix.anz, 34287, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 11 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 11 on aup7952s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 11 has been removed (new total is 25)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 31.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 31 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 31 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(31, aup7954s.unix.anz, 37000, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 31 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 8.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 8 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 8 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(8, aup7954s.unix.anz, 38529, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 8 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 31 on aup7954s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 31 has been removed (new total is 24)
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 8 on aup7954s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 8 has been removed (new total is 23)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 42.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 42 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 42 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(42, aup7961s.unix.anz, 33491, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 42 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 45.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 45 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 45 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(45, aup7714s.unix.anz, 34972, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 45 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 42 on aup7961s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 42 has been removed (new total is 22)
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 45 on aup7714s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 45 has been removed (new total is 21)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 18.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 18 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 18 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(18, aup7952s.unix.anz, 46223, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 18 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 18 on aup7952s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 18 has been removed (new total is 20)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 34.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 34 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 34 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(34, aup7952s.unix.anz, 46560, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 34 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 34 on aup7952s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 34 has been removed (new total is 19)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 17.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 17 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 17 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(17, aup7958s.unix.anz, 44977, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 17 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 17 on aup7958s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 17 has been removed (new total is 18)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 13.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 13 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 13 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(13, aup7964s.unix.anz, 39005, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 13 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 10.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 10 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 10 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(10, aup7958s.unix.anz, 34087, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 10 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 13 on aup7964s.unix.anz killed by driver.
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 10 on aup7958s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 13 has been removed (new total is 17)
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 10 has been removed (new total is 16)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 33.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 33 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(33, aup7958s.unix.anz, 38670, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 33 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 33 on aup7958s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 33 has been removed (new total is 15)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 20.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 20 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 20 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(20, aup7964s.unix.anz, 40970, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 20 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 20 on aup7964s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 20 has been removed (new total is 14)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 36.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 36 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 36 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(36, aup7964s.unix.anz, 45624, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 36 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 36 on aup7964s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 36 has been removed (new total is 13)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 30.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 30 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 30 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(30, aup7956s.unix.anz, 39620, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 30 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 30 on aup7956s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 30 has been removed (new total is 12)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 3.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 3 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, aup7715s.unix.anz, 39989, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 3 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 3 on aup7715s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 3 has been removed (new total is 11)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 5.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, aup7715s.unix.anz, 33077, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 5 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 29.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 29 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 29 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(29, aup7951s.unix.anz, 40289, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 29 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 5 on aup7715s.unix.anz killed by driver.
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 29 on aup7951s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 5 has been removed (new total is 10)
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 29 has been removed (new total is 9)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 46.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 46 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 46 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(46, aup7962s.unix.anz, 42217, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 46 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 46 on aup7962s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 46 has been removed (new total is 8)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 9.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 9 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 9 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(9, aup7715s.unix.anz, 34142, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 9 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 9 on aup7715s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 9 has been removed (new total is 7)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 32.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 32 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 32 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(32, aup7715s.unix.anz, 46845, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 32 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 32 on aup7715s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 32 has been removed (new total is 6)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 16.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 16 (epoch 8)
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 16 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(16, aup7715s.unix.anz, 45526, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 16 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 16 on aup7715s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 16 has been removed (new total is 5)
19/08/22 09:45:05 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 44.
19/08/22 09:45:05 INFO scheduler.DAGScheduler: Executor lost: 44 (epoch 8)
19/08/22 09:45:05 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from aup7722s.unix.anz/10.156.4.45:38717 is closed
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 44 from BlockManagerMaster.
19/08/22 09:45:05 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(44, aup7722s.unix.anz, 44051, None)
19/08/22 09:45:05 INFO storage.BlockManagerMaster: Removed 44 successfully in removeExecutor
19/08/22 09:45:05 INFO cluster.YarnScheduler: Executor 44 on aup7722s.unix.anz killed by driver.
19/08/22 09:45:05 INFO spark.ExecutorAllocationManager: Existing executor 44 has been removed (new total is 4)
Traceback (most recent call last):
File "/home/cdsw/pysparkling_xgboost.py", line 40, in
hc = H2OContext.getOrCreate(spark, conf=h2oConf)
File "/home/cdsw/.local/lib/python2.7/site-packages/pysparkling/context.py", line 163, in getOrCreate
jhc = jvm.org.apache.spark.h2o.JavaH2OContext.getOrCreate(jspark_session, selected_conf._jconf)
File "/app/hadoop/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/app/hadoop/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/app/hadoop/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.h2o.JavaH2OContext.getOrCreate.
: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.h2o.backends.internal.InternalH2OBackend$$anonfun$startH2OWorkers$1.apply(InternalH2OBackend.scala:159)
at org.apache.spark.h2o.backends.internal.InternalH2OBackend$$anonfun$startH2OWorkers$1.apply(InternalH2OBackend.scala:157)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.h2o.backends.internal.InternalH2OBackend$.startH2OWorkers(InternalH2OBackend.scala:157)
at org.apache.spark.h2o.backends.internal.InternalH2OBackend$.org$apache$spark$h2o$backends$internal$InternalH2OBackend$$startH2OCluster(InternalH2OBackend.scala:95)
at org.apache.spark.h2o.backends.internal.InternalH2OBackend.init(InternalH2OBackend.scala:74)
at org.apache.spark.h2o.H2OContext.init(H2OContext.scala:128)
at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:396)
at org.apache.spark.h2o.H2OContext.getOrCreate(H2OContext.scala)
at org.apache.spark.h2o.JavaH2OContext.getOrCreate(JavaH2OContext.java:255)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection from aup7722s.unix.anz/10.156.4.45:38717 closed
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1354)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:917)
at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:822)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more

@GlockGao
Copy link
Author

GlockGao commented Aug 22, 2019

Hi Jakubhava,

Thanks.

There has question about resource allocation.
Below are all my configurations:
h2oConf = H2OConf(spark)
.set('spark.ui.enabled', 'false')
.set('spark.ext.h2o.fail.on.unsupported.spark.param', 'false')
.set('spark.dynamicAllocation.enabled', 'false')
.set('spark.ext.h2o.client.network.mask', '100.66.0.7/24')
.set('spark.ext.h2o.node.network.mask', '10.156.4.0/24')

The code I already copy/paste the before, actually just import a simple file 'prostate.csv'. The job is running more than 30 minutes (and not finished yet). Occupation is huge: more than 1000 vcores and 2T memory.

Can sparkling water or native spark have configuration that limit the resource occupation?

And based on my understanding, once the job is preempted by other job, the whole process is failed? Because next time I re-submit the job and also encounter the exception:
Traceback (most recent call last):
File "/home/cdsw/pysparkling_xgboost_test.py", line 33, in
hc = H2OContext.getOrCreate(spark, conf=h2oConf)
File "/home/cdsw/.local/lib/python2.7/site-packages/pysparkling/context.py", line 163, in getOrCreate
jhc = jvm.org.apache.spark.h2o.JavaH2OContext.getOrCreate(jspark_session, selected_conf._jconf)
File "/app/hadoop/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/app/hadoop/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/app/hadoop/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.h2o.JavaH2OContext.getOrCreate.
: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.h2o.backends.internal.InternalH2OBackend$$anonfun$startH2OWorkers$1.apply(InternalH2OBackend.scala:159)
at org.apache.spark.h2o.backends.internal.InternalH2OBackend$$anonfun$startH2OWorkers$1.apply(InternalH2OBackend.scala:157)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.h2o.backends.internal.InternalH2OBackend$.startH2OWorkers(InternalH2OBackend.scala:157)
at org.apache.spark.h2o.backends.internal.InternalH2OBackend$.org$apache$spark$h2o$backends$internal$InternalH2OBackend$$startH2OCluster(InternalH2OBackend.scala:95)
at org.apache.spark.h2o.backends.internal.InternalH2OBackend.init(InternalH2OBackend.scala:74)
at org.apache.spark.h2o.H2OContext.init(H2OContext.scala:128)
at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:396)
at org.apache.spark.h2o.H2OContext.getOrCreate(H2OContext.scala)
at org.apache.spark.h2o.JavaH2OContext.getOrCreate(JavaH2OContext.java:255)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection from aup7950s.unix.anz/10.156.4.70:40665 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1354)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:917)
at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:822)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more

Cheers,

@jakubhava
Copy link
Contributor

@GlockGao thanks, we have some progress! Could you please send us the complete YARN logs? From the last message I can see that the timeout for starting the worker node has been reached and in the logs I can see what prevented the worker node to start.

Thank you!

Kuba

@GlockGao
Copy link
Author

Hi Kuba,

I checked the log, it seems has OOM issue. The exception is below:
19/08/23 01:20:52 INFO scheduler.DAGScheduler: Executor lost: 19 (epoch 3)
19/08/23 01:20:52 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 19 from BlockManagerMaster.
19/08/23 01:20:52 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(19, aup7725s.unix.anz, 36525, None)
19/08/23 01:20:52 INFO storage.BlockManagerMaster: Removed 19 successfully in removeExecutor
19/08/23 01:20:52 INFO cluster.YarnScheduler: Executor 19 on aup7725s.unix.anz killed by driver.
19/08/23 01:20:52 INFO spark.ExecutorAllocationManager: Existing executor 19 has been removed (new total is 1)
19/08/23 01:20:52 ERROR client.TransportResponseHandler: Still have 2 requests outstanding when connection from /100.66.128.0:53288 is closed
19/08/23 01:20:52 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 30.
19/08/23 01:20:52 WARN storage.BlockManagerMasterEndpoint: Error trying to remove broadcast 5 from block manager BlockManagerId(30, aup7725s.unix.anz, 34934, None)
java.io.IOException: Connection from /100.66.128.0:53288 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1354)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:917)
at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:822)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:745)
19/08/23 01:20:52 INFO scheduler.DAGScheduler: Executor lost: 30 (epoch 3)
19/08/23 01:20:52 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 30 from BlockManagerMaster.
19/08/23 01:20:52 WARN storage.BlockManagerMasterEndpoint: No more replicas available for broadcast_5_piece0 !
19/08/23 01:20:52 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(30, aup7725s.unix.anz, 34934, None)
19/08/23 01:20:52 INFO storage.BlockManagerMaster: Removed 30 successfully in removeExecutor
19/08/23 01:20:52 INFO cluster.YarnScheduler: Executor 30 on aup7725s.unix.anz killed by driver.
19/08/23 01:20:52 INFO spark.ExecutorAllocationManager: Existing executor 30 has been removed (new total is 0)
19/08/23 01:20:52 INFO cluster.YarnClientSchedulerBackend: Interrupting monitor thread
19/08/23 01:20:52 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors
19/08/23 01:20:52 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
19/08/23 01:20:52 INFO cluster.SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
services=List(),
started=false)
19/08/23 01:20:52 INFO cluster.YarnClientSchedulerBackend: Stopped
19/08/23 01:20:52 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/08/23 01:20:52 INFO memory.MemoryStore: MemoryStore cleared
19/08/23 01:20:52 INFO storage.BlockManager: BlockManager stopped
19/08/23 01:20:52 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
19/08/23 01:20:52 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/08/23 01:20:52 INFO spark.SparkContext: Successfully stopped SparkContext
19/08/23 01:20:52 INFO util.ShutdownHookManager: Shutdown hook called

I hope we can run pysparkling with limitation resource ocupation:)

Cheers,
pysparkling.log

@GlockGao
Copy link
Author

GlockGao commented Aug 23, 2019

Hi Kuba,

Thanks very much in advanced!

In the meanwhile, I have other two questions: They are similar dependencies distribute issue.

  1. We plan to submit the sparkling water job on the edge node with client or cluster mode. But currently we cannot install relevant sparkling python packages on both edge node and cluster data nodes.
    Here is my proposed solution, is that possible?
    (1) Install the sparkling python packages on one engine machine which is also belongs to the CDH.
    (2) Zip the relevant python packages - pysparkling.zip
    (3) Transfer the pysparkling.zip to edge node. Execute the command:
    spark2-submit --deploy-mode client|cluster --py-files pysparkling.zip pysparkling_test.py

If the solution is available, I'm not sure how to zip the packages. I listed the information, can you help to check?

cdsw@lqzdu94xj8yej318:~/.local/lib/python2.7/site-packages$ pwd
/home/cdsw/.local/lib/python2.7/site-packages

cdsw@lqzdu94xj8yej318:~/.local/lib/python2.7/site-packages$ ls -lrt | grep -i h2o
drwxr-xr-x  2 cdsw cdsw  4096 Aug 15 09:06 h2o_pysparkling_2.3-3.26.2.dist-info
drwxr-xr-x 14 cdsw cdsw  4096 Aug 15 09:38 h2o
drwxr-xr-x  2 cdsw cdsw  4096 Aug 15 09:38 h2o-3.26.0.2.dist-info

cdsw@lqzdu94xj8yej318:~/.local/lib/python2.7/site-packages$ ls -lrt | grep -i sparkling
drwxr-xr-x  2 cdsw cdsw  4096 Aug 15 09:06 sparkling_water
drwxr-xr-x  3 cdsw cdsw  4096 Aug 15 09:06 pysparkling
drwxr-xr-x  3 cdsw cdsw  4096 Aug 15 09:06 py_sparkling
drwxr-xr-x  2 cdsw cdsw  4096 Aug 15 09:06 h2o_pysparkling_2.3-3.26.2.dist-info
  1. Submit the sparkling job on one machine( which is already install sparkling package) with cluster mode. The proposed solution is almost same as above with same questions:)

Cheers,

@jakubhava
Copy link
Contributor

Yup, it's OOM -> 08-23 11:19:55.978 10.156.4.82:54321 35683 #t-loop-7 INFO: Java heap maxMemory: 910.5 MB Java heap is less then 1 GB which is really small.

Regarding the distribution. If you use --py-files, than you should be fine as it distributes the dependencies to the whole cluster. I also suggest not to try to manually install and zip the package but instead download the official pysparkling zip distribution from our release page and put that onto --py-files.

@GlockGao
Copy link
Author

Hi Cuba,

For distribution issue, maybe I didn't get the right way to go.
It seems I didn't distribute the dependency well. I submit the job on edge node.

  1. client mode:
    spark2-submit --deploy-mode client --py-files sparkling-water-2.2.10.zip pysparkling_xgboost_test.py
    The error is below:
    Traceback (most recent call last):
    File "/var/tmp/gaoy8/pysparkling_xgboost_test.py", line 5, in
    from pysparkling import *
    ImportError: No module named pysparkling
    19/08/23 15:36:33 INFO util.ShutdownHookManager: Shutdown hook called
    19/08/23 15:36:33 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-77f3fdfb-6cbd-417d-b2e2-9862c3efb109

  2. cluster mode: spark2-submit --deploy-mode cluster --py-files sparkling-water-2.2.10.zip pysparkling_xgboost_test.py. The error is the same:
    Log Type: stdout

Log Upload Time: Fri Aug 23 15:29:22 +1000 2019

Log Length: 164

Traceback (most recent call last):
File "pysparkling_xgboost_test.py", line 5, in
from pysparkling import *
ImportError: No module named pysparkling

Cheers,

@jakubhava
Copy link
Contributor

jakubhava commented Aug 23, 2019

Please give it a try with official pysparkling zip file downloaded from here https://www.h2o.ai/download/#sparkling-water ( Instructions are on the download page on the PySparkling tab). I can see that yours does not have the right name so I assume it was either renamed or built manually which is always bit harder to support.

Also you can mimic what we do in our pysparkling launcher (add zip file on python path):

PYTHONPATH=sparkling-water-2.2.10.zip:$PYTHONPATH spark2-submit --deploy-mode client --py-files sparkling-water-2.2.10.zip pysparkling_xgboost_test.py

We have seen issues on older versions where the python path configuration was necessary

@GlockGao
Copy link
Author

Hi

I downloaded sparkling-water-3.26.2-2.3.zip from https://www.h2o.ai/download/#sparkling-water.

I got the same errors:(

The error is below:
Traceback (most recent call last):
File "/var/tmp/gaoy8/pysparkling_xgboost_test.py", line 5, in 
from pysparkling import *
ImportError: No module named pysparkling

with the following commands:
PYTHONPATH=sparkling-water-3.26.2-2.3.zip:$PYTHONPATH spark2-submit --deploy-mode client --py-files sparkling-water-3.26.2-2.3.zip pysparkling_xgboost_test.py
or
spark2-submit --deploy-mode client --py-files sparkling-water-3.26.2-2.3.zip pysparkling_xgboost_test.py

@GlockGao
Copy link
Author

Hi,

Thanks very much.

I unzip the the 'h2o_pysparkling_2.3-3.26.2-2.3.zip' in 'sparkling-water-3.26.2-2.3.zip' and execute the following command:
spark2-submit --master local[*] --py-files h2o_pysparkling_2.3-3.26.2-2.3.zip pysparkling_xgboost_test.py

Still failure but with different error: Do you have any ideas?
fb2c5a13/sparkling-water-3.26.2-2.3.zip
Traceback (most recent call last):
File "/var/tmp/gaoy8/pysparkling_xgboost_test.py", line 19, in
from pysparkling import *
File "/tmp/spark-3e0904e9-0f35-4a84-9ba8-da154eea0036/userFiles-031ee51a-0475-472b-ab18-0581fb2c5a13/h2o_pysparkling_2.3-3.26.2-2.3.zip/pysparkling/init.py", line 44, in
File "/tmp/spark-3e0904e9-0f35-4a84-9ba8-da154eea0036/userFiles-031ee51a-0475-472b-ab18-0581fb2c5a13/h2o_pysparkling_2.3-3.26.2-2.3.zip/pysparkling/context.py", line 5, in
File "/tmp/spark-3e0904e9-0f35-4a84-9ba8-da154eea0036/userFiles-031ee51a-0475-472b-ab18-0581fb2c5a13/h2o_pysparkling_2.3-3.26.2-2.3.zip/h2o/init.py", line 11, in
File "/tmp/spark-3e0904e9-0f35-4a84-9ba8-da154eea0036/userFiles-031ee51a-0475-472b-ab18-0581fb2c5a13/h2o_pysparkling_2.3-3.26.2-2.3.zip/h2o/h2o.py", line 15, in
File "/tmp/spark-3e0904e9-0f35-4a84-9ba8-da154eea0036/userFiles-031ee51a-0475-472b-ab18-0581fb2c5a13/h2o_pysparkling_2.3-3.26.2-2.3.zip/h2o/backend/init.py", line 42, in
File "/tmp/spark-3e0904e9-0f35-4a84-9ba8-da154eea0036/userFiles-031ee51a-0475-472b-ab18-0581fb2c5a13/h2o_pysparkling_2.3-3.26.2-2.3.zip/h2o/backend/cluster.py", line 10, in
File "/tmp/spark-3e0904e9-0f35-4a84-9ba8-da154eea0036/userFiles-031ee51a-0475-472b-ab18-0581fb2c5a13/h2o_pysparkling_2.3-3.26.2-2.3.zip/h2o/display.py", line 10, in
ImportError: No module named tabulate

Cheers,

@jakubhava
Copy link
Contributor

@GlockGao Yes, previously, you were putting on --py-files the wrong file. You need to extract the pysparkling zip file from the distribution. The last exception just says that you don't have installed python dependencies for PySparkling. The dependencies have to be installed on each node of the cluster as well or passed to --py-files as well. See the list of dependencies on the PySparkling docs http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/pysparkling.html

@GlockGao
Copy link
Author

Hi Cuba,

Thanks very much for your support!!

Finally I can run the sparkling water in the local mode on edge node (even modify the source code of 'future' package).

But there still have problems when client or cluster mode.

  1. Sometimes, it seems H2OContext.getOrCreate() is successful. The logs are below:
Sparkling Water Context:
 * Sparkling Water Version: 3.26.2-2.3
 * H2O name: sparkling-water-gaoy8_application_1563277926760_217016
 * cluster size: 14
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (9,aup7959s.unix.anz,54321)
  (7,aup7714s.unix.anz,54321)
  (10,aup7714s.unix.anz,54323)
  (1,aup7722s.unix.anz,54321)
  (11,aup7964s.unix.anz,54323)
  (14,aup7963s.unix.anz,54321)
  (8,aup7716s.unix.anz,54321)
  (6,aup7959s.unix.anz,54323)
  (5,aup7718s.unix.anz,54321)
  (4,aup7716s.unix.anz,54323)
  (12,aup7963s.unix.anz,54323)
  (3,aup7718s.unix.anz,54323)
  (2,aup7716s.unix.anz,54325)
  (13,aup7715s.unix.anz,54321)
  ------------------------

  Open H2O Flow in browser: http://10.156.4.49:54321 (CMD + click in Mac OSX)

    
 * Yarn App ID of Spark application: application_1563277926760_217016
    
Connecting to H2O server at http://10.156.4.49:54321 ... successful.
--------------------------  ------------------------------------------------------
H2O cluster uptime:         15 secs
H2O cluster timezone:       Australia/Melbourne
H2O data parsing timezone:  UTC
H2O cluster version:        3.26.0.2
H2O cluster version age:    30 days
H2O cluster name:           sparkling-water-gaoy8_application_1563277926760_217016
H2O cluster total nodes:    14
H2O cluster free memory:    12.59 Gb
H2O cluster total cores:    896
H2O cluster allowed cores:  56
H2O cluster status:         accepting new members, healthy
H2O connection url:         http://10.156.4.49:54321
H2O connection proxy:
H2O internal security:      False
H2O API Extensions:         XGBoost, Algos, Amazon S3, AutoML, Core V3, Core V4
Python version:             2.7.13 final
--------------------------  ------------------------------------------------------

Sparkling Water Context:
 * Sparkling Water Version: 3.26.2-2.3
 * H2O name: sparkling-water-gaoy8_application_1563277926760_217016
 * cluster size: 14
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (9,aup7959s.unix.anz,54321)
  (7,aup7714s.unix.anz,54321)
  (10,aup7714s.unix.anz,54323)
  (1,aup7722s.unix.anz,54321)
  (11,aup7964s.unix.anz,54323)
  (14,aup7963s.unix.anz,54321)
  (8,aup7716s.unix.anz,54321)
  (6,aup7959s.unix.anz,54323)
  (5,aup7718s.unix.anz,54321)
  (4,aup7716s.unix.anz,54323)
  (12,aup7963s.unix.anz,54323)
  (3,aup7718s.unix.anz,54323)
  (2,aup7716s.unix.anz,54325)
  (13,aup7715s.unix.anz,54321)
  ------------------------

  Open H2O Flow in browser: http://10.156.4.49:54321 (CMD + click in Mac OSX)

    
 * Yarn App ID of Spark application: application_1563277926760_217016

Sparkling Water Context:
 * Sparkling Water Version: 3.26.2-2.3
 * H2O name: sparkling-water-gaoy8_application_1563277926760_217016
 * cluster size: 14
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (9,aup7959s.unix.anz,54321)
  (7,aup7714s.unix.anz,54321)
  (10,aup7714s.unix.anz,54323)
  (1,aup7722s.unix.anz,54321)
  (11,aup7964s.unix.anz,54323)
  (14,aup7963s.unix.anz,54321)
  (8,aup7716s.unix.anz,54321)
  (6,aup7959s.unix.anz,54323)
  (5,aup7718s.unix.anz,54321)
  (4,aup7716s.unix.anz,54323)
  (12,aup7963s.unix.anz,54323)
  (3,aup7718s.unix.anz,54323)
  (2,aup7716s.unix.anz,54325)
  (13,aup7715s.unix.anz,54321)
  ------------------------

  Open H2O Flow in browser: http://10.156.4.49:54321 (CMD + click in Mac OSX)

    
 * Yarn App ID of Spark application: application_1563277926760_217016
    
Connecting to H2O server at http://10.156.4.49:54321 ... successful.
--------------------------  ------------------------------------------------------
H2O cluster uptime:         15 secs
H2O cluster timezone:       Australia/Melbourne
H2O data parsing timezone:  UTC
H2O cluster version:        3.26.0.2
H2O cluster version age:    30 days
H2O cluster name:           sparkling-water-gaoy8_application_1563277926760_217016
H2O cluster total nodes:    14
H2O cluster free memory:    12.59 Gb
H2O cluster total cores:    896
H2O cluster allowed cores:  56
H2O cluster status:         accepting new members, healthy
H2O connection url:         http://10.156.4.49:54321
H2O connection proxy:
H2O internal security:      False
H2O API Extensions:         XGBoost, Algos, Amazon S3, AutoML, Core V3, Core V4
Python version:             2.7.13 final
--------------------------  ------------------------------------------------------

Sparkling Water Context:
 * Sparkling Water Version: 3.26.2-2.3
 * H2O name: sparkling-water-gaoy8_application_1563277926760_217016
 * cluster size: 14
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (9,aup7959s.unix.anz,54321)
  (7,aup7714s.unix.anz,54321)
  (10,aup7714s.unix.anz,54323)
  (1,aup7722s.unix.anz,54321)
  (11,aup7964s.unix.anz,54323)
  (14,aup7963s.unix.anz,54321)
  (8,aup7716s.unix.anz,54321)
  (6,aup7959s.unix.anz,54323)
  (5,aup7718s.unix.anz,54321)
  (4,aup7716s.unix.anz,54323)
  (12,aup7963s.unix.anz,54323)
  (3,aup7718s.unix.anz,54323)
  (2,aup7716s.unix.anz,54325)
  (13,aup7715s.unix.anz,54321)
  ------------------------

  Open H2O Flow in browser: http://10.156.4.49:54321 (CMD + click in Mac OSX)
   
 * Yarn App ID of Spark application: application_1563277926760_217016
  1. But most common error is H2OContext.getOrCreate() failure. The whole logs are below:
Traceback (most recent call last):
  File "/var/tmp/gaoy8/pysparkling_xgboost_test.py", line 54, in <module>
    hc = H2OContext.getOrCreate(spark, conf=h2oConf)
  File "h2o_pysparkling_2.3-3.26.2-2.3.zip/pysparkling/context.py", line 163, in getOrCreate
  File "/app/hadoop/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/app/hadoop/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/app/hadoop/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.h2o.JavaH2OContext.getOrCreate.
: java.lang.RuntimeException: Cloud size 1 under 21
        at water.H2O.waitForCloudSize(H2O.java:1819)
        at org.apache.spark.h2o.backends.internal.InternalH2OBackend$.org$apache$spark$h2o$backends$internal$InternalH2OBackend$$startH2OCluster(InternalH2OBackend.scala:102)
        at org.apache.spark.h2o.backends.internal.InternalH2OBackend.init(InternalH2OBackend.scala:74)
        at org.apache.spark.h2o.H2OContext.init(H2OContext.scala:128)
        at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:396)
        at org.apache.spark.h2o.H2OContext.getOrCreate(H2OContext.scala)
        at org.apache.spark.h2o.JavaH2OContext.getOrCreate(JavaH2OContext.java:255)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:745)

Actually I ran the spark job on the same cluster, why sometimes the H2O worker (cluster size) is 14, sometimes it is 21?

Currently I just used a very small dataset, can I configure the worker number? My configure code is below: Once I configure the last line, it raised the exception: h2o.exceptions.H2OServerError: Cluster cannot reach consensus
h2oConf = H2OConf(spark)
.set('spark.ui.enabled', 'false')
.set('spark.ext.h2o.fail.on.unsupported.spark.param', 'false')
.set('spark.dynamicAllocation.enabled', 'false')
.set('spark.scheduler.minRegisteredResourcesRatio', '1')
.set('spark.sql.autoBroadcastJoinThreshold', '-1')
.set('spark.locality.wait', '0')
.set('spark.executor.cores', '4')
.set('spark.executor.instances', '2')
.set('spark.ext.h2o.node.network.mask', '10.156.4.0/24')
#.set('spark.ext.h2o.cluster.size', '2')

@GlockGao
Copy link
Author

Hi Cuba,

After the configuration : I can run the sparkling job with client mode.

spark = SparkSession \
  .builder \
  .config('spark.port.maxRetries', '32') \
  .config('spark.executor.instances', '3') \
  .config('spark.executor.memory', '3G') \
  .config('spark.driver.memory', '8G') \
  .appName('SparklingWaterApp') \
  .getOrCreate()

But during context close process, it still raise some exceptions:

19/08/26 19:38:08 INFO spark.ContextCleaner: Cleaned accumulator 570
19/08/26 19:38:08 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1.
19/08/26 19:38:08 INFO storage.BlockManagerInfo: Removed broadcast_23_piece0 on aup7726s.unix.anz:45082 in memory (size: 61.7 KB, free: 366.2 MB)
19/08/26 19:38:08 ERROR client.TransportClient: Failed to send RPC 8333763389007815444 to /10.156.4.34:49064: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
19/08/26 19:38:08 ERROR client.TransportClient: Failed to send RPC 5006121310018069158 to /10.156.4.85:54938: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
19/08/26 19:38:08 WARN storage.BlockManagerMasterEndpoint: Error trying to remove broadcast 23 from block manager BlockManagerId(1, aup7711s.unix.anz, 44482, None)
java.io.IOException: Failed to send RPC 8333763389007815444 to /10.156.4.34:49064: java.nio.channels.ClosedChannelException
        at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1316)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
        at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
        at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
        at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)

Cheers,

@GlockGao
Copy link
Author

Hi Cuba,

Thanks very much for your support.
Eventually I can submit sparkling job successfully.

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants