Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TPCH][VL] tpch has some query execution error logs but queries could finish and the result is correct #1090

Closed
deshanxiao opened this issue Mar 8, 2023 · 14 comments
Labels
bug Something isn't working velox backend works for Velox backend

Comments

@deshanxiao
Copy link

Describe the bug
There are some issues when execute the tpch query of "q4 q12 q13 q16 q21". Below is the detailed error log:
error log.txt

image

To Reproduce
Based on the doc : https://github.com/oap-project/gluten/tree/main/backends-velox/workload/tpch
···
bash tpch_parquet.sh
···
Spark Version:3.2.4

Expected behavior
No error log

Additional context

@deshanxiao deshanxiao added the bug Something isn't working label Mar 8, 2023
@deshanxiao deshanxiao changed the title [TPCH][VELOX] tpch has some query execution errors [TPCH][VL] tpch has some query execution errors Mar 8, 2023
@zzl200012
Copy link

zzl200012 commented Mar 8, 2023

I encountered the same problem before, which seems to be caused by constructing MetadataFilter(which is used for RowGroup pruning in Parquet Reader) for remaining filters (which seems to be conditions that contain multiple columns e.g. col_1 > col_2 or some other "complex expressions"), but MetadataFilter could only be constructed for "simple expressions" which contain one column e.g. col_1 > 10, col_2 in (1, 3, 5, 7). So I commented these 2 lines of code (https://github.com/oap-project/velox/blob/main/velox/connectors/hive/HiveConnector.cpp#L173-L174) as a workaround, then it worked.
In theory, we can not do predicate pushdown (late materialization & row group pruning) for the filters with multiple columns (just like l_commitdate < l_receiptdate in TPCH Q4), so I'm wondering if there is something not aligned between gluten and upstream velox, or do I miss anything.

@zhouyuan
Copy link
Contributor

zhouyuan commented Mar 9, 2023

@PHILO-HE

@zhouyuan
Copy link
Contributor

zhouyuan commented Mar 9, 2023

hi @deshanxiao

We have TPCH jobs(spark3.2.2) running nightly and didn't run into these issues. Based on the logs, it seems you are testing with Spark 3.2.4-SNAPSHOT, which is not released yet.
Spark runtime version 3.2.4 is not matched with Gluten's fully tested version 3.2.2
Is this built from https://github.com/apache/spark/tree/branch-3.2 ? Are there any modifications on Spark?

Thanks, -yuan

@deshanxiao
Copy link
Author

Thank you for your reply, yes, I tested it with https://github.com/apache/spark/tree/branch-3.2 and no any modifications on Spark.

Let me re-build a spark3.2.2 to test it again. Thank you @zhouyuan

@zzl200012
Copy link

Hi @zhouyuan, did you run those TPCH jobs with parquet? From the above, this issue should be related to the native parquet reader of velox.

@zhouyuan
Copy link
Contributor

zhouyuan commented Mar 9, 2023

@zzl200012

yes, the issue is related with filter pushdown as you suggested, and the two line code is not touched for some time.

The nightly jobs are running with parquet datasource as described here: https://github.com/oap-project/gluten/tree/main/backends-velox/workload/tpch

The github action CI also checks for running with small TPCH/TPCDS datasets: https://github.com/oap-project/gluten/actions/runs/4362916625/jobs/7628376329
image

So the issue here seems a bit surprise to me, I'm trying to build a new Spark to reproduce the issues(the spark used in our tests are 322 and 321, something may changed in the 3.2 branch)

Thanks, -yuan

@zzl200012
Copy link

zzl200012 commented Mar 9, 2023

Hi @zhouyuan, I click on these tabs and see:
image
Could you take a look? I picked this one: https://github.com/oap-project/gluten/actions/runs/4362916625/jobs/7628376329

@zhouyuan
Copy link
Contributor

zhouyuan commented Mar 9, 2023

@zzl200012

Thanks for pointing out the logs, I get your point now. So the issue is: TPCH query reported exceptions during run, but query is finished and the result is correct. Initially I misunderstood the issue here, I thought you meant for "query will fail due to the exceptions".
In the current gluten code, we had lots of code blocks like:

try {
  do validate in velox
} catch {
  go to fallback or another code path
} 

and in some validation functions, velox will dump out some error logs, but it should be well handled in the following catch block in gluten.
Indeed this looks ugly and makes people nervous :) , let me refine the doc to explain this issue.

Thanks, -yuan

@deshanxiao
Copy link
Author

Got it, so these errors are expected, right? In fact, from the log I uploaded, the query can get the result eventually. Please forgive my inaccurate expression.

@deshanxiao deshanxiao changed the title [TPCH][VL] tpch has some query execution errors [TPCH][VL] tpch has some query execution errors logs but query is finished and the result is correct Mar 9, 2023
@deshanxiao deshanxiao changed the title [TPCH][VL] tpch has some query execution errors logs but query is finished and the result is correct [TPCH][VL] tpch has some query execution error logs but queries could finish and the result is correct Mar 9, 2023
@zhouyuan
Copy link
Contributor

zhouyuan commented Mar 9, 2023

@deshanxiao no worries, yes, the error messages are expected. Thanks for pointing this out usually we only verify the results and just ignored the "error messages" shown in executor logs. I can help to add one explanation in the doc first, a better fix would be improving the try/catch code to give out a more clarity
-yuan

@zzl200012
Copy link

@zhouyuan Thanks for your explanation and sorry for the late response, previously I used an old commit of gluten, and this velox issue would cause a core dump (which confused me for a while). Now I tried the latest commit, and everything works fine. I'll follow closer to the community!

@zzl200012
Copy link

BTW, I've seen that the validation & fallback mechanism would ensure the query to be executed properly, but for this issue, it will still cause an unnecessary fallback of the FileScan operator, am I right?

@PHILO-HE
Copy link
Contributor

BTW, I've seen that the validation & fallback mechanism would ensure the query to be executed properly, but for this issue, it will still cause an unnecessary fallback of the FileScan operator, am I right?

Not really. For this issue, the filter like col_1 < col_2 is NOT supported as pushed down filter for scanning in velox backend. But it is still offloaded to velox and executed after scanning, not goes into fallback path. You can check spark DAG to confirm it. As yuan said, indeed such log message may confuse users and can be refined in the future.

@PHILO-HE
Copy link
Contributor

Closing this issue. Please re-open it if needed.

@weiting-chen weiting-chen added the velox backend works for Velox backend label Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working velox backend works for Velox backend
Projects
None yet
Development

No branches or pull requests

5 participants