Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation failed for GlutenHashAggregateExecTransformer class #729

Closed
obedmr opened this issue Dec 16, 2022 · 4 comments
Closed

Validation failed for GlutenHashAggregateExecTransformer class #729

obedmr opened this issue Dec 16, 2022 · 4 comments
Labels
bug Something isn't working velox backend works for Velox backend

Comments

@obedmr
Copy link

obedmr commented Dec 16, 2022

Describe the bug

I've been trying to run the TPC-H benchmark in a 1 master and 1 worker Spark cluster with Gluten+Velox configuration.
When running the benchmark, I'm seeing a couple of issues in the benchmark's logs. Any hint or suggestion is well received.

1. Validation failed for GlutenHashAggregateExecTransformer class
Below part of the log and a couple of consecuent logs, not sure if they're generated because the failed validation.

benchmark-app 22/12/16 22:16:25 DEBUG GlutenHashAggregateExecTransformer: Validation failed for class io.glutenproject.execution.GlutenHashAggregateExecTransformer due to Could not initialize class io.glutenproject.expression.ExpressionMappings$
benchmark-app 22/12/16 22:16:25 DEBUG TransformPreOverrides: Columnar Processing for class org.apache.spark.sql.execution.exchange.ShuffleExchangeExec is currently supported.
benchmark-app22/12/16 22:16:25 DEBUG TransformPreOverrides: Columnar Processing for class org.apache.spark.sql.execution.aggregate.HashAggregateExec is under row guard.
benchmark-app 22/12/16 22:16:25 DEBUG TransformPreOverrides: Transformation for class org.apache.spark.sql.execution.LocalTableScanExec is currently not supported.

2. Failed stage caused by an exception

benchmark-driver 22/12/07 23:27:53 INFO TaskSetManager: Lost task 0.3 in stage 33.0 (TID 86) on 172.17.0.11, executor 0: java.lang.ArrayIndexOutOfBoundsException (-1) [duplicate 3]
benchmark-driver 22/12/07 23:27:53 ERROR TaskSetManager: Task 0 in stage 33.0 failed 4 times; aborting job
benchmark-driver 22/12/07 23:27:53 INFO TaskSchedulerImpl: Removed TaskSet 33.0, whose tasks have all completed, from pool 
benchmark-driver 22/12/07 23:27:53 INFO TaskSchedulerImpl: Cancelling stage 33
benchmark-driver 22/12/07 23:27:53 INFO TaskSchedulerImpl: Killing all running tasks in stage 33: Stage cancelled
benchmark-driver 22/12/07 23:27:53 INFO DAGScheduler: ResultStage 33 (collect at VeloxSparkPlanExecApi.scala:231) failed in 0.124 s due to Job aborted due to stage failure: Task 0 in stage 33.0 failed 4 times, most recent failure: Lost task 0.3 in stage 33.0 (TID 86) (172.17.0.11 executor 0): java.lang.ArrayIndexOutOfBoundsException: -1
benchmark-driver     at org.apache.spark.util.ExecutorManager$.tryTaskSet(ExecutorManager.scala:49)
benchmark-driver     at io.glutenproject.execution.NativeWholeStageColumnarRDD.compute(NativeWholeStageColumnarRDD.scala:112)
benchmark-driver     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
benchmark-driver     at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
benchmark-driver     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
benchmark-driver     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
benchmark-driver     at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
benchmark-driver     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
benchmark-driver     at org.apache.spark.scheduler.Task.run(Task.scala:136)
benchmark-driver     at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
benchmark-driver     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
benchmark-driver     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
benchmark-driver     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
benchmark-driver     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
benchmark-driver     at java.lang.Thread.run(Thread.java:750)
benchmark-driver 
benchmark-driver Driver stacktrace:
benchmark-driver 22/12/07 23:27:53 DEBUG DAGScheduler: After removal of stage 33, remaining stages = 1
benchmark-driver 22/12/07 23:27:53 INFO DAGScheduler: Job 33 failed: collect at VeloxSparkPlanExecApi.scala:231, took 0.127070 s
benchmark-driver 22/12/07 23:27:53 INFO DAGScheduler: Asked to cancel job group 6bd9d826-a85b-40b3-a932-23f2a92117b6

To Reproduce

Basically, it's a 1 master and 1 worker cluster, deployed in Kubernetes, not using Yarn. Below my spark configuration file's section that applies to Gluten

    spark.driver.extraClassPath  /opt/gluten-spark3.3.jar:/opt/velox-package-spark33.jar:/opt/arrow-c-data.jar
    spark.executor.extraClassPath  /opt/gluten-spark3.3.jar:/opt/velox-package-spark33.jar:/opt/arrow-c-data.jar
    spark.memory.offHeap.size=30G
    spark.sql.sources.useV1SourceList=avro
    spark.sql.join.preferSortMergeJoin=false
    spark.plugins=io.glutenproject.GlutenPlugin
    spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager
    spark.gluten.sql.columnar.batchscan=true
    spark.gluten.sql.columnar.hashagg=true
    spark.gluten.sql.columnar.projfilter=true
    spark.gluten.sql.columnar.codegen.sort=true
    spark.gluten.sql.columnar.window=true
    spark.gluten.sql.columnar.shuffledhashjoin=true
    spark.gluten.sql.columnar.forceshuffledhashjoin=true
    spark.gluten.sql.columnar.sort=true
    spark.gluten.sql.columnar.sortmergejoin=true
    spark.gluten.sql.columnar.union=true
    spark.gluten.sql.columnar.expand=true
    spark.gluten.sql.columnar.broadcastexchange=true
    spark.gluten.sql.columnar.broadcastJoin=true
    spark.gluten.sql.columnar.wholestagetransform=false
    spark.gluten.sql.columnar.wholestagecodegen.breakdownTime=false
    spark.gluten.sql.columnar.shuffle.customizedCompression.codec="lz4"
    spark.gluten.sql.columnar.numaBinding=true
    spark.gluten.sql.columnar.coreRange="0-17,36-53 |18-35,54-71"
    spark.sql.execution.arrow.maxRecordsPerBatch=10000

    spark.executor.memoryOverhead=1g
    spark.memory.offHeap.enabled=true
    spark.gluten.sql.columnar.shuffleSplitDefaultSize=8192
    spark.gluten.sql.columnar.numaBinding=true
    spark.gluten.sql.columnar.backend.lib=velox
    spark.gluten.soft-affinity.enabled=true
    spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager
    spark.sql.parquet.columnarReaderBatchSize=10240
    spark.sql.inMemoryColumnarStorage.batchSize=10240
    spark.sql.execution.arrow.maxRecordsPerBatch=10240
    spark.sql.files.maxPartitionBytes=2g
    spark.sql.autoBroadcastJoinThreshold=10M
    spark.sql.broadcastTimeout=4800
    spark.driver.maxResultSize=4g
    spark.executorEnv.LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2

Expected behavior
I would expect the Spark cluster accept the TPC-H workload and generate results

Additional context
I'm attaching a couple of log files that show the first and the second issues.
issue1.log
issue2.log

@obedmr obedmr added the bug Something isn't working label Dec 16, 2022
@zhouyuan
Copy link
Contributor

Hi @obedmr
Could not initialize class io.glutenproject.expression.ExpressionMappings$
it looks like the issue is due to some missing/incompatible class, can you please try with the fat jar which contains all the depedencies?

Here's one example

spark.driver.extraClassPath /home/sparkuser/nativesql_jars/gluten-spark3.3_2.12-1.0.0-SNAPSHOT-jar-with-dependencies.jar
spark.executor.extraClassPath /home/sparkuser/nativesql_jars/gluten-spark3.3_2.12-1.0.0-SNAPSHOT-jar-with-dependencies.jar

Note currently Gluten only supports Scala 2.12, so please also check if the Spark version is correct

for the 2nd issue, it's due to missing the numa core range in spark conf:
here's one example on my testing server:

spark.gluten.sql.columnar.numaBinding=true
spark.gluten.sql.columnar.coreRange= 0-35,72-107|36-71,108-143

the core rage expects two numa nodes splited by "|", you may need to fill with the right number in your system
lscpu | grep "NUMA node" | tail -n2
if you system does not support NUMA, then simply disable the numma binding feature.
Note this is one experimental feature, it may hurt the performance if running on some cloud environment

Thanks, -yuan

@weiting-chen weiting-chen added the velox backend works for Velox backend label Mar 1, 2023
@liralon
Copy link

liralon commented Mar 19, 2023

The validation failure in GlutenHashAggregateExecTransformer due to failing to initialize class io.glutenproject.expression.ExpressionMappings$ seems to be similar the bug I have found and root-caused in #1161 . Consider checking if the workaround described there solves your issue.

@jinchengchenghh
Copy link
Contributor

jinchengchenghh commented Apr 4, 2023

Can you try the solution? If solved, I will close this issue. @obedmr
In addition, we should use mvn clean install to build the java project

@obedmr
Copy link
Author

obedmr commented Apr 4, 2023

Hi @jinchengchenghh, I'm sorry that I havent replied before, the issue has been solved, there were other packages that were included that were causing the issue, alongside the clean you just mentioned. Thanks a lot for all the support.

@obedmr obedmr closed this as completed Apr 4, 2023
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working velox backend works for Velox backend
Projects
None yet
Development

No branches or pull requests

5 participants