Estimate cardinalities of predicates with uncorrelated subquery results #2536

dey4ss · 2023-02-01T16:34:31Z

This PR estimates query plans rewritten by the JoinToPredicateRewriteRule as if they were still (semi-) joins. Thus, the rwritten predicates are correctly placed in the query plans.
Imagine the following query plan (simplified example, edges annotated with estimated output cardinality and selectivity):

                        100 rows (0.5)
                +---------------+
                | LikePredicate |
                +---------------+
                        |
                        | 200 rows (0.2)
                        |
                +---------------+
                |   SemiJoin    |
                +---------------+
               /                 \
              /                   \ 1 row (0.05)
             |                     |
             |             +---------------+
             |             |   Predicate   |  
             |             +---------------+
             |                     |
             | 1000 rows           | 20 rows
             |                     |
     +---------------+     +---------------+
     |    Table A    |     |    Table B    |
     +---------------+     +---------------+

In the query plan, the semi-join has a selectivity of 0.2 and the like predicate a selectivity of 0.5. Thus, the semi-join is executed before (and placed below the predicate by the PredicatePlacementRule and the PredicateReorderingRule). When we rewrite the plan with the JoinToPredicateRewriteRule, the plan looks like this (w/o predicate placement and ordering):

                        500 rows (0.5)
                +---------------+
                | LikePredicate |
                +---------------+
                        |
                        | 1000 rows (1)
                        |
                +---------------+
                |   Predicate   |
                +---------------+
               /                 * 
              /                   *  1 row (1), uncorrelated subquery
             |                     *
             |             +---------------+
             |             |  Projection   |  
             |             +---------------+
             |                     |
             |                     | 1 row (0.05)
             |                     |
             |             +---------------+
             |             |   Predicate   |  
             |             +---------------+
             |                     |
             | 1000 rows           | 20 rows
             |                     |
     +---------------+     +---------------+
     |    Table A    |     |    Table B    |
     +---------------+     +---------------+

Currently, the predicate containing an uncorrelated subquery is not resolved to an OperatorScanPredicate. Thus, we assume the worst case and forward its input statistics (i.e., selectivity = 1). The predicate reordering sorts predicates to execute the predicate with the lowest selectivity first: The like predicate would thus end up below the predicate with the subquery.

However, we can get the desired result/ordering when we estimate the predicate as we would do for the (rewritten) semi-join: With the changes made by this PR, we correctly estimate the predicate to have an output cardinality of 200 rows / selectivity 0.2 and the favorable predicate ordering is achieved.

This is just a simplified example. In experiments (will post when finished), we observed that this can lead to absolutely unfortunate query plans when we push predicates with subqueries behind, e.g., expensive predicates that are performed by the ExpressionEvaluator or end up in completely different plans due to semi-join reductions, which will also be placed below our predicate.

The extensio of the cardinality estimation also includes the correct estimation of a future order dependency-based join to predicate rewrite. (Can provide details on that if desired.)

Benchmarks will follow.

closes #2508

Benchmark `master` vs PR

tl;dr Nothing really changes (as expected) - we don't really have scans with uncorrelated subquery per default.
System

nemea - click to expand

property	value
Hostname	nemea
CPU	Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
Memory	939GB
numactl	nodebind: 2
numactl	membind: 2

Commit Info and Build Time

commit	date	message	build time
`41512c3`	01.02.2023 12:54	Schedule uncorrelated subqueries together with other operators (#2520)	real 334.68 user 2905.01 sys 90.37
`ef5dfaf`	01.02.2023 17:51	more tests	real 335.44 user 2854.20 sys 88.13

hyriseBenchmarkTPCH - single-threaded, SF 10.0

Sum of avg. item runtimes: -1% || Geometric mean of throughput changes: +1%