-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estimate cardinalities of predicates with uncorrelated subquery results #2536
Conversation
…join_to_predicate
…join_to_predicate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It confuses me. I probably need some whiteboard explanations. :-/
subquery_statistics = estimate_statistics(subquery_expression->lqp); | ||
} | ||
|
||
// Case (ii): Between predicate with column BETWEEN min(<subquery) AND max(<subquery>). Equivalent to a semi-join |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how a between can be equivalent to a semi-join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the UCC information ist still missing here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess it's too hidden here?
hyrise/src/lib/statistics/cardinality_estimator.cpp
Lines 544 to 545 in a03c7e3
// We do not have to further check if the subqueries return at most one row. This will be ensured during execution | |
// by the TableScan operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think I would still add "some" UCC information here. It doesn't really help reading the code, when the main assumption here is cleared up many lines later. And stating that the table scan would recognize certain cases doesn't help to understand why we can even do those between-to-join reformulations.
@Bouncner I added more comments, please see if they are sufficient. |
// equals or between condition, it acts as a filter comparable to a semi-join with the join key of the subquery | ||
// result (see examples below). We obtain such predicates with subquery results from the JoinToPredicateRewriteRule. | ||
// This rule also checks that all preconditions are met to ensure correct query results. Thus, we do not check them | ||
// here. For more information about this query rewrite, see `join_to_predicate_rewrite_rule.hpp`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't found text that explains that we only run into this branch here for the rewrite and not when somebody manually created "equal" subqueries.
I think that would help understanding this branch a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the column = <subquery>
case, we can actually have it triggered by the user with the example below:
SELECT n_name FROM nation WHERE n_regionkey = (SELECT r_regionkey FROM region WHERE r_name = 'ASIA');
However, the SubqueryToJoinRule
eventually rewrites the subquery scan to a join.
What do you think of pulling the part starting from We obtain such predicates ...
up and move it after the first sentence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I would even move it up to line 421.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing on the code side, but happy to further annoy you with stupid questions.
It's okay that you contribute with what you're able to :b |
@Bouncner happy? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I promise, I'll approve next time.
Maybe ... if I understand what you're doing here. So ... maybe. No.
…ts (#2536) Estimate query plans rewritten by the `JoinToPredicateRewriteRule` as if they were still (semi-) joins to place the rwritten predicates correctly in the query plans.
…ts (#2536) Estimate query plans rewritten by the `JoinToPredicateRewriteRule` as if they were still (semi-) joins to place the rwritten predicates correctly in the query plans.
This PR estimates query plans rewritten by the
JoinToPredicateRewriteRule
as if they were still (semi-) joins. Thus, the rwritten predicates are correctly placed in the query plans.Imagine the following query plan (simplified example, edges annotated with estimated output cardinality and selectivity):
In the query plan, the semi-join has a selectivity of 0.2 and the like predicate a selectivity of 0.5. Thus, the semi-join is executed before (and placed below the predicate by the PredicatePlacementRule and the PredicateReorderingRule). When we rewrite the plan with the JoinToPredicateRewriteRule, the plan looks like this (w/o predicate placement and ordering):
Currently, the predicate containing an uncorrelated subquery is not resolved to an OperatorScanPredicate. Thus, we assume the worst case and forward its input statistics (i.e., selectivity = 1). The predicate reordering sorts predicates to execute the predicate with the lowest selectivity first: The like predicate would thus end up below the predicate with the subquery.
However, we can get the desired result/ordering when we estimate the predicate as we would do for the (rewritten) semi-join: With the changes made by this PR, we correctly estimate the predicate to have an output cardinality of 200 rows / selectivity 0.2 and the favorable predicate ordering is achieved.
This is just a simplified example. In experiments (will post when finished), we observed that this can lead to absolutely unfortunate query plans when we push predicates with subqueries behind, e.g., expensive predicates that are performed by the ExpressionEvaluator or end up in completely different plans due to semi-join reductions, which will also be placed below our predicate.
The extensio of the cardinality estimation also includes the correct estimation of a future order dependency-based join to predicate rewrite. (Can provide details on that if desired.)
Benchmarks will follow.
closes #2508
Benchmark
master
vs PRtl;dr Nothing really changes (as expected) - we don't really have scans with uncorrelated subquery per default.
System
nemea - click to expand
Commit Info and Build Time
hyriseBenchmarkTPCH - single-threaded, SF 10.0
Sum of avg. item runtimes: -1% || Geometric mean of throughput changes: +1%
Configuration Overview - click to expand
hyriseBenchmarkTPCH - single-threaded, SF 0.01
Sum of avg. item runtimes: -0% || Geometric mean of throughput changes: -1%
Configuration Overview - click to expand
hyriseBenchmarkTPCH - multi-threaded, ordered, 1 client, 28 cores, SF 10.0
Sum of avg. item runtimes: -0% || Geometric mean of throughput changes: +1%
Configuration Overview - click to expand
hyriseBenchmarkTPCH - multi-threaded, shuffled, 28 clients, 28 cores, SF 10.0
Sum of avg. item runtimes: +0% || Geometric mean of throughput changes: +0%
Configuration Overview - click to expand
hyriseBenchmarkTPCDS - single-threaded
Sum of avg. item runtimes: -2% || Geometric mean of throughput changes: +3%
Configuration Overview - click to expand
hyriseBenchmarkTPCDS - multi-threaded, shuffled, 28 clients, 28 cores
Sum of avg. item runtimes: +0% || Geometric mean of throughput changes: +0%
Configuration Overview - click to expand
hyriseBenchmarkTPCC - single-threaded
Sum of avg. item runtimes: -1% || Geometric mean of throughput changes: +1%
Configuration Overview - click to expand
hyriseBenchmarkTPCC - multi-threaded, shuffled, 28 clients, 28 cores
Sum of avg. item runtimes: +0% || Geometric mean of throughput changes: -0%
Configuration Overview - click to expand
hyriseBenchmarkJoinOrder - single-threaded
Sum of avg. item runtimes: -0% || Geometric mean of throughput changes: +1%
Configuration Overview - click to expand
hyriseBenchmarkJoinOrder - multi-threaded, shuffled, 28 clients, 28 cores
Sum of avg. item runtimes: +0% || Geometric mean of throughput changes: +0%
Configuration Overview - click to expand