Validation failed for GlutenHashAggregateExecTransformer class #729

obedmr · 2022-12-16T22:46:47Z

Describe the bug

I've been trying to run the TPC-H benchmark in a 1 master and 1 worker Spark cluster with Gluten+Velox configuration.
When running the benchmark, I'm seeing a couple of issues in the benchmark's logs. Any hint or suggestion is well received.

1. Validation failed for GlutenHashAggregateExecTransformer class
Below part of the log and a couple of consecuent logs, not sure if they're generated because the failed validation.

benchmark-app 22/12/16 22:16:25 DEBUG GlutenHashAggregateExecTransformer: Validation failed for class io.glutenproject.execution.GlutenHashAggregateExecTransformer due to Could not initialize class io.glutenproject.expression.ExpressionMappings$
benchmark-app 22/12/16 22:16:25 DEBUG TransformPreOverrides: Columnar Processing for class org.apache.spark.sql.execution.exchange.ShuffleExchangeExec is currently supported.
benchmark-app22/12/16 22:16:25 DEBUG TransformPreOverrides: Columnar Processing for class org.apache.spark.sql.execution.aggregate.HashAggregateExec is under row guard.
benchmark-app 22/12/16 22:16:25 DEBUG TransformPreOverrides: Transformation for class org.apache.spark.sql.execution.LocalTableScanExec is currently not supported.

2. Failed stage caused by an exception

benchmark-driver 22/12/07 23:27:53 INFO TaskSetManager: Lost task 0.3 in stage 33.0 (TID 86) on 172.17.0.11, executor 0: java.lang.ArrayIndexOutOfBoundsException (-1) [duplicate 3]
benchmark-driver 22/12/07 23:27:53 ERROR TaskSetManager: Task 0 in stage 33.0 failed 4 times; aborting job
benchmark-driver 22/12/07 23:27:53 INFO TaskSchedulerImpl: Removed TaskSet 33.0, whose tasks have all completed, from pool 
benchmark-driver 22/12/07 23:27:53 INFO TaskSchedulerImpl: Cancelling stage 33
benchmark-driver 22/12/07 23:27:53 INFO TaskSchedulerImpl: Killing all running tasks in stage 33: Stage cancelled
benchmark-driver 22/12/07 23:27:53 INFO DAGScheduler: ResultStage 33 (collect at VeloxSparkPlanExecApi.scala:231) failed in 0.124 s due to Job aborted due to stage failure: Task 0 in stage 33.0 failed 4 times, most recent failure: Lost task 0.3 in stage 33.0 (TID 86) (172.17.0.11 executor 0): java.lang.ArrayIndexOutOfBoundsException: -1
benchmark-driver     at org.apache.spark.util.ExecutorManager$.tryTaskSet(ExecutorManager.scala:49)
benchmark-driver     at io.glutenproject.execution.NativeWholeStageColumnarRDD.compute(NativeWholeStageColumnarRDD.scala:112)
benchmark-driver     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
benchmark-driver     at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
benchmark-driver     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
benchmark-driver     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
benchmark-driver     at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
benchmark-driver     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
benchmark-driver     at org.apache.spark.scheduler.Task.run(Task.scala:136)
benchmark-driver     at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
benchmark-driver     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
benchmark-driver     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
benchmark-driver     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
benchmark-driver     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
benchmark-driver     at java.lang.Thread.run(Thread.java:750)
benchmark-driver 
benchmark-driver Driver stacktrace:
benchmark-driver 22/12/07 23:27:53 DEBUG DAGScheduler: After removal of stage 33, remaining stages = 1
benchmark-driver 22/12/07 23:27:53 INFO DAGScheduler: Job 33 failed: collect at VeloxSparkPlanExecApi.scala:231, took 0.127070 s
benchmark-driver 22/12/07 23:27:53 INFO DAGScheduler: Asked to cancel job group 6bd9d826-a85b-40b3-a932-23f2a92117b6

To Reproduce

Basically, it's a 1 master and 1 worker cluster, deployed in Kubernetes, not using Yarn. Below my spark configuration file's section that applies to Gluten

    spark.driver.extraClassPath  /opt/gluten-spark3.3.jar:/opt/velox-package-spark33.jar:/opt/arrow-c-data.jar
    spark.executor.extraClassPath  /opt/gluten-spark3.3.jar:/opt/velox-package-spark33.jar:/opt/arrow-c-data.jar
    spark.memory.offHeap.size=30G
    spark.sql.sources.useV1SourceList=avro
    spark.sql.join.preferSortMergeJoin=false
    spark.plugins=io.glutenproject.GlutenPlugin
    spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager
    spark.gluten.sql.columnar.batchscan=true
    spark.gluten.sql.columnar.hashagg=true
    spark.gluten.sql.columnar.projfilter=true
    spark.gluten.sql.columnar.codegen.sort=true
    spark.gluten.sql.columnar.window=true
    spark.gluten.sql.columnar.shuffledhashjoin=true
    spark.gluten.sql.columnar.forceshuffledhashjoin=true
    spark.gluten.sql.columnar.sort=true
    spark.gluten.sql.columnar.sortmergejoin=true
    spark.gluten.sql.columnar.union=true
    spark.gluten.sql.columnar.expand=true
    spark.gluten.sql.columnar.broadcastexchange=true
    spark.gluten.sql.columnar.broadcastJoin=true
    spark.gluten.sql.columnar.wholestagetransform=false
    spark.gluten.sql.columnar.wholestagecodegen.breakdownTime=false
    spark.gluten.sql.columnar.shuffle.customizedCompression.codec="lz4"
    spark.gluten.sql.columnar.numaBinding=true
    spark.gluten.sql.columnar.coreRange="0-17,36-53 |18-35,54-71"
    spark.sql.execution.arrow.maxRecordsPerBatch=10000

    spark.executor.memoryOverhead=1g
    spark.memory.offHeap.enabled=true
    spark.gluten.sql.columnar.shuffleSplitDefaultSize=8192
    spark.gluten.sql.columnar.numaBinding=true
    spark.gluten.sql.columnar.backend.lib=velox
    spark.gluten.soft-affinity.enabled=true
    spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager
    spark.sql.parquet.columnarReaderBatchSize=10240
    spark.sql.inMemoryColumnarStorage.batchSize=10240
    spark.sql.execution.arrow.maxRecordsPerBatch=10240
    spark.sql.files.maxPartitionBytes=2g
    spark.sql.autoBroadcastJoinThreshold=10M
    spark.sql.broadcastTimeout=4800
    spark.driver.maxResultSize=4g
    spark.executorEnv.LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2

Expected behavior
I would expect the Spark cluster accept the TPC-H workload and generate results

Additional context
I'm attaching a couple of log files that show the first and the second issues.
issue1.log
issue2.log

The text was updated successfully, but these errors were encountered:

zhouyuan · 2022-12-21T13:01:07Z

Hi @obedmr
Could not initialize class io.glutenproject.expression.ExpressionMappings$
it looks like the issue is due to some missing/incompatible class, can you please try with the fat jar which contains all the depedencies?

Here's one example

spark.driver.extraClassPath /home/sparkuser/nativesql_jars/gluten-spark3.3_2.12-1.0.0-SNAPSHOT-jar-with-dependencies.jar
spark.executor.extraClassPath /home/sparkuser/nativesql_jars/gluten-spark3.3_2.12-1.0.0-SNAPSHOT-jar-with-dependencies.jar

Note currently Gluten only supports Scala 2.12, so please also check if the Spark version is correct

for the 2nd issue, it's due to missing the numa core range in spark conf:
here's one example on my testing server:

spark.gluten.sql.columnar.numaBinding=true
spark.gluten.sql.columnar.coreRange= 0-35,72-107|36-71,108-143

the core rage expects two numa nodes splited by "|", you may need to fill with the right number in your system
lscpu | grep "NUMA node" | tail -n2
if you system does not support NUMA, then simply disable the numma binding feature.
Note this is one experimental feature, it may hurt the performance if running on some cloud environment

Thanks, -yuan

liralon · 2023-03-19T15:15:42Z

The validation failure in GlutenHashAggregateExecTransformer due to failing to initialize class io.glutenproject.expression.ExpressionMappings$ seems to be similar the bug I have found and root-caused in #1161 . Consider checking if the workaround described there solves your issue.

jinchengchenghh · 2023-04-04T08:59:19Z

Can you try the solution? If solved, I will close this issue. @obedmr
In addition, we should use mvn clean install to build the java project

obedmr · 2023-04-04T18:29:40Z

Hi @jinchengchenghh, I'm sorry that I havent replied before, the issue has been solved, there were other packages that were included that were causing the issue, alongside the clean you just mentioned. Thanks a lot for all the support.

obedmr added the bug Something isn't working label Dec 16, 2022

weiting-chen added the velox backend works for Velox backend label Mar 1, 2023

obedmr closed this as completed Apr 4, 2023

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation failed for GlutenHashAggregateExecTransformer class #729

Validation failed for GlutenHashAggregateExecTransformer class #729

obedmr commented Dec 16, 2022 •

edited

Loading

zhouyuan commented Dec 21, 2022

liralon commented Mar 19, 2023

jinchengchenghh commented Apr 4, 2023 •

edited

Loading

obedmr commented Apr 4, 2023

Validation failed for GlutenHashAggregateExecTransformer class #729

Validation failed for GlutenHashAggregateExecTransformer class #729

Comments

obedmr commented Dec 16, 2022 • edited Loading

zhouyuan commented Dec 21, 2022

liralon commented Mar 19, 2023

jinchengchenghh commented Apr 4, 2023 • edited Loading

obedmr commented Apr 4, 2023

obedmr commented Dec 16, 2022 •

edited

Loading

jinchengchenghh commented Apr 4, 2023 •

edited

Loading