From 812600b77a8ec1c42a890b8cd919e39635f71fea Mon Sep 17 00:00:00 2001 From: tiancaiamao Date: Tue, 30 Jun 2020 11:10:37 +0800 Subject: [PATCH 1/6] perf-tuning: add docs for subquery optimizations --- correlated-subquery-optimization.md | 79 ++++++++++++++++++++++++++ subquery-optimization.md | 88 +++++++++++++++++++++++++++++ 2 files changed, 167 insertions(+) create mode 100644 correlated-subquery-optimization.md create mode 100644 subquery-optimization.md diff --git a/correlated-subquery-optimization.md b/correlated-subquery-optimization.md new file mode 100644 index 0000000000000..56b02ec5ccf40 --- /dev/null +++ b/correlated-subquery-optimization.md @@ -0,0 +1,79 @@ +--- +title: Decorrelation of correlated subquery +summary: Understand how to decorrelate correlated subqueries +category: performance +--- + +# Decorrelation of correlated subquery + +[Subquery related optimization] (/subquery-optimization.md) describes how TiDB handles subqueries when there are no correlated columns. Decorrelation of correlated subquery is complex, this article introduces some simple scenarios and the scope that the optimization rule applies to. + +## Introduction + +Take `select * from t1 where t1.a < (select sum(t2.a) from t2 where t2.b = t1.b)` as an example, the subquery `t1.a < (select sum(t2.a) from t2 where t2.b = t1.b)` here refers to the correlated column in the query condition `t2.b=t1.b`, this condition happens to be an equivalent condition, so the query can be rewritten as `select t1.* from t1, (select b, sum(a) sum_a from t2 group by b) t2 where t1.b = t2.b and t1.a < t2.sum_a;`. In this way, a correlated subquery is rewritten into `JOIN`. + +The reason why TiDB needs to do this rewriting is that the correlated subquery is bound to its external query result every time the subquery is executed. In the above example, if `t1.a` has 10 million values, this subquery would repeat 10 million times, because the condition `t2.b=t1.b` varies with the value of `t1.a`. When the correlation is lifted somehow, this subquery would execute only once. + +## Restrictions + +The disadvantage of this rewriting is that when the correlation is not resolved, the optimizer can use the index on the correlated column. That is to say, although this subquery may repeat many times, the index can be used to filter data each time. While after using the rewriting rule, the position of the correlated column usually changes. Although the subquery is only executed once, the single execution time would be longer than that without decorrelation. + +Therefore, when there are few external values, do not do decorrelation may bring better execution performance. At present, this optimization can be turned off by setting `subquery decorrelation` optimization rules in [blocklist of optimization rules and expression pushdown](/blacklist-control-plan.md). + +## Example + +{{< copyable "sql" >}} + +```sql +create table t1(a int, b int); +create table t2(a int, b int, index idx(b)); +explain select * from t1 where t1.a < (select sum(t2.a) from t2 where t2.b = t1.b); +``` + +```sql ++----------------------------------+----------+-----------+---------------+-----------------------------------------------------------------------------------------+ +| id | estRows | task | access object | operator info | ++----------------------------------+----------+-----------+---------------+-----------------------------------------------------------------------------------------+ +| HashJoin_11 | 9990.00 | root | | inner join, equal:[eq(test.t1.b, test.t2.b)], other cond:lt(cast(test.t1.a), Column#7) | +| ├─HashAgg_23(Build) | 7992.00 | root | | group by:test.t2.b, funcs:sum(Column#8)->Column#7, funcs:firstrow(test.t2.b)->test.t2.b | +| │ └─TableReader_24 | 7992.00 | root | | data:HashAgg_16 | +| │ └─HashAgg_16 | 7992.00 | cop[tikv] | | group by:test.t2.b, funcs:sum(test.t2.a)->Column#8 | +| │ └─Selection_22 | 9990.00 | cop[tikv] | | not(isnull(test.t2.b)) | +| │ └─TableFullScan_21 | 10000.00 | cop[tikv] | table:t2 | keep order:false, stats:pseudo | +| └─TableReader_15(Probe) | 9990.00 | root | | data:Selection_14 | +| └─Selection_14 | 9990.00 | cop[tikv] | | not(isnull(test.t1.b)) | +| └─TableFullScan_13 | 10000.00 | cop[tikv] | table:t1 | keep order:false, stats:pseudo | ++----------------------------------+----------+-----------+---------------+-----------------------------------------------------------------------------------------+ + +``` + +The above is an example where the optimization takes effect, `HashJoin_11` is a normal `inner join`. + +Then, turn off the subquery decorrelation rules: + +{{< copyable "sql" >}} + +```sql +insert into mysql.opt_rule_blacklist values("decorrelate"); +admin reload opt_rule_blacklist; +explain select * from t1 where t1.a < (select sum(t2.a) from t2 where t2.b = t1.b); +``` + +```sql ++----------------------------------------+----------+-----------+------------------------+------------------------------------------------------------------------------+ +| id | estRows | task | access object | operator info | ++----------------------------------------+----------+-----------+------------------------+------------------------------------------------------------------------------+ +| Projection_10 | 10000.00 | root | | test.t1.a, test.t1.b | +| └─Apply_12 | 10000.00 | root | | CARTESIAN inner join, other cond:lt(cast(test.t1.a), Column#7) | +| ├─TableReader_14(Build) | 10000.00 | root | | data:TableFullScan_13 | +| │ └─TableFullScan_13 | 10000.00 | cop[tikv] | table:t1 | keep order:false, stats:pseudo | +| └─MaxOneRow_15(Probe) | 1.00 | root | | | +| └─HashAgg_27 | 1.00 | root | | funcs:sum(Column#10)->Column#7 | +| └─IndexLookUp_28 | 1.00 | root | | | +| ├─IndexRangeScan_25(Build) | 10.00 | cop[tikv] | table:t2, index:idx(b) | range: decided by [eq(test.t2.b, test.t1.b)], keep order:false, stats:pseudo | +| └─HashAgg_17(Probe) | 1.00 | cop[tikv] | | funcs:sum(test.t2.a)->Column#10 | +| └─TableRowIDScan_26 | 10.00 | cop[tikv] | table:t2 | keep order:false, stats:pseudo | ++----------------------------------------+----------+-----------+------------------------+------------------------------------------------------------------------------+ +``` + +After disabling the subquery decorrelation rule, you can see `range: decided by [eq(test.t2.b, test.t1.b)]` in `operator info` of `IndexRangeScan_25(Build)`. It means that the decorrelation of correlated subquery is not performed and TiDB uses the index range query. diff --git a/subquery-optimization.md b/subquery-optimization.md new file mode 100644 index 0000000000000..397abb876d082 --- /dev/null +++ b/subquery-optimization.md @@ -0,0 +1,88 @@ +--- +title: Subquery related optimizations +summary: Understand optimizations related to subqueries +category: performance +--- + +# Subquery related optimization + +This article mainly introduces subquery related optimizations. + +Subqueries usually appear in the following situations: + +- `NOT IN (SELECT ... FROM ...)` +- `NOT EXISTS (SELECT ... FROM ...)` +- `IN (SELECT ... FROM ..)` +- `EXISTS (SELECT ... FROM ...)` +- `... >/>=/ ANY (SELECT ... FROM ...)` + +In this case, `ALL` and `ANY` can be replaced by `MAX` and `MIN`. When the table is empty, the result of `MAX(EXPR)` and `MIN(EXPR)` will be NULL, it works when the result of `EXPR` contains `NULL`. Whether the result of `EXPR` contains `NULL` may affect the final result of the expression, so the complete rewrite is given in the following form: + +- `t.id < all (select s.id from s)` will be rewritten as `t.id < min(s.id) and if(sum(s.id is null) != 0, null, true)`. +- `t.id < any (select s.id from s)` will be rewritten as `t.id < max(s.id) or if(sum(s.id is null) != 0, null, false)`. + +## `... != ANY (SELECT ... FROM ...)` + +In this case, if all the values from the subquery are distinct, it enough to compare the query with them. If the number of different values in the subquery is more than one, then there must be inequality. Therefore, such subqueries can be rewritten as follows: + +- `select * from t where t.id != any (select s.id from s)` is rewritten as `select t.* from t, (select s.id, count(distinct s.id) as cnt_distinct from s) where (t.id != s.id or cnt_distinct > 1)` + +## `... = ALL (SELECT ... FROM ...)` + +In this case, when the number of different values in the subquery is more than one, then the result of this expression must be false. Therefore, such subquery is rewritten into the following form in TiDB: + +- `select * from t where t.id = all (select s.id from s)` is rewritten as `select t.* from t, (select s.id, count(distinct s.id) as cnt_distinct from s ) where (t.id = s.id and cnt_distinct <= 1)` + +## `... IN (SELECT ... FROM ...)` + +In this case, the subquery of `IN` is rewritten into `SELECT ... FROM ... GROUP ...`, and then rewritten into the normal form of `JOIN`. +For example, `select * from t1 where t1.a in (select t2.a from t2)` will be rewritten as `select t1.* from t1, (select distinct(a) a from t2) t2 where t1.a = t2. The form of a`. The `DISTINCT` attribute here can be eliminated automatically if `t2.a` has the `UNIQUE` attribute. + +{{< copyable "sql" >}} + +```sql +explain select * from t1 where t1.a in (select t2.a from t2); +``` + +```sql ++------------------------------+---------+-----------+------------------------+----------------------------------------------------------------------------+ +| id | estRows | task | access object | operator info | ++------------------------------+---------+-----------+------------------------+----------------------------------------------------------------------------+ +| IndexJoin_12 | 9990.00 | root | | inner join, inner:TableReader_11, outer key:test.t2.a, inner key:test.t1.a | +| ├─HashAgg_21(Build) | 7992.00 | root | | group by:test.t2.a, funcs:firstrow(test.t2.a)->test.t2.a | +| │ └─IndexReader_28 | 9990.00 | root | | index:IndexFullScan_27 | +| │ └─IndexFullScan_27 | 9990.00 | cop[tikv] | table:t2, index:idx(a) | keep order:false, stats:pseudo | +| └─TableReader_11(Probe) | 1.00 | root | | data:TableRangeScan_10 | +| └─TableRangeScan_10 | 1.00 | cop[tikv] | table:t1 | range: decided by [test.t2.a], keep order:false, stats:pseudo | ++------------------------------+---------+-----------+------------------------+----------------------------------------------------------------------------+ +``` + +This rewrite will get better performance when the `IN` subquery is relatively small and the external query is relatively large, because without rewriting, using `index join` with t2 as the driving table is impossible. However, the disadvantage is that when the aggregation cannot be automatically eliminated during the rewritten and the `t2` table is relatively large, this rewrite will affect the performance of the query. Currently, the variable [tidb\_opt\_insubq\_to\_join\_and\_agg](/tidb-specific-system-variables.md#tidb_opt_insubq_to_join_and_agg) is used to control this optimization. When this optimization is not suitable, you can manually turn off it. + +## `EXISTS` subquery and `... >/>=/}} + +```sql +create table t1(a int); +create table t2(a int); +insert into t2 values(1); +explain select * from t where exists (select * from t2); +``` + +```sql ++------------------------+----------+-----------+---------------+--------------------------------+ +| id | estRows | task | access object | operator info | ++------------------------+----------+-----------+---------------+--------------------------------+ +| TableReader_12 | 10000.00 | root | | data:TableFullScan_11 | +| └─TableFullScan_11 | 10000.00 | cop[tikv] | table:t | keep order:false, stats:pseudo | ++------------------------+----------+-----------+---------------+--------------------------------+ +``` From d67d2aa590c62319089c0e26368bd441f41c7468 Mon Sep 17 00:00:00 2001 From: yikeke Date: Fri, 10 Jul 2020 13:37:42 +0800 Subject: [PATCH 2/6] minor edits to improve format and unify doc styles --- subquery-optimization.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/subquery-optimization.md b/subquery-optimization.md index 397abb876d082..b231d129d53ef 100644 --- a/subquery-optimization.md +++ b/subquery-optimization.md @@ -1,10 +1,9 @@ --- -title: Subquery related optimizations -summary: Understand optimizations related to subqueries -category: performance +title: Subquery Related Optimizations +summary: Understand optimizations related to subqueries. --- -# Subquery related optimization +# Subquery Related Optimizations This article mainly introduces subquery related optimizations. @@ -16,20 +15,20 @@ Subqueries usually appear in the following situations: - `EXISTS (SELECT ... FROM ...)` - `... >/>=/ ANY (SELECT ... FROM ...)` -In this case, `ALL` and `ANY` can be replaced by `MAX` and `MIN`. When the table is empty, the result of `MAX(EXPR)` and `MIN(EXPR)` will be NULL, it works when the result of `EXPR` contains `NULL`. Whether the result of `EXPR` contains `NULL` may affect the final result of the expression, so the complete rewrite is given in the following form: +In this case, `ALL` and `ANY` can be replaced by `MAX` and `MIN`. When the table is empty, the result of `MAX(EXPR)` and `MIN(EXPR)` is NULL. It works the same when the result of `EXPR` contains `NULL`. Whether the result of `EXPR` contains `NULL` may affect the final result of the expression, so the complete rewrite is given in the following form: -- `t.id < all (select s.id from s)` will be rewritten as `t.id < min(s.id) and if(sum(s.id is null) != 0, null, true)`. -- `t.id < any (select s.id from s)` will be rewritten as `t.id < max(s.id) or if(sum(s.id is null) != 0, null, false)`. +- `t.id < all (select s.id from s)` is rewritten as `t.id < min(s.id) and if(sum(s.id is null) != 0, null, true)` +- `t.id < any (select s.id from s)` is rewritten as `t.id < max(s.id) or if(sum(s.id is null) != 0, null, false)` ## `... != ANY (SELECT ... FROM ...)` -In this case, if all the values from the subquery are distinct, it enough to compare the query with them. If the number of different values in the subquery is more than one, then there must be inequality. Therefore, such subqueries can be rewritten as follows: +In this case, if all the values from the subquery are distinct, it is enough to compare the query with them. If the number of different values in the subquery is more than one, then there must be inequality. Therefore, such subqueries can be rewritten as follows: - `select * from t where t.id != any (select s.id from s)` is rewritten as `select t.* from t, (select s.id, count(distinct s.id) as cnt_distinct from s) where (t.id != s.id or cnt_distinct > 1)` @@ -42,7 +41,8 @@ In this case, when the number of different values in the subquery is more than o ## `... IN (SELECT ... FROM ...)` In this case, the subquery of `IN` is rewritten into `SELECT ... FROM ... GROUP ...`, and then rewritten into the normal form of `JOIN`. -For example, `select * from t1 where t1.a in (select t2.a from t2)` will be rewritten as `select t1.* from t1, (select distinct(a) a from t2) t2 where t1.a = t2. The form of a`. The `DISTINCT` attribute here can be eliminated automatically if `t2.a` has the `UNIQUE` attribute. + +For example, `select * from t1 where t1.a in (select t2.a from t2)` is rewritten as `select t1.* from t1, (select distinct(a) a from t2) t2 where t1.a = t2. The form of a`. The `DISTINCT` attribute here can be eliminated automatically if `t2.a` has the `UNIQUE` attribute. {{< copyable "sql" >}} @@ -63,11 +63,11 @@ explain select * from t1 where t1.a in (select t2.a from t2); +------------------------------+---------+-----------+------------------------+----------------------------------------------------------------------------+ ``` -This rewrite will get better performance when the `IN` subquery is relatively small and the external query is relatively large, because without rewriting, using `index join` with t2 as the driving table is impossible. However, the disadvantage is that when the aggregation cannot be automatically eliminated during the rewritten and the `t2` table is relatively large, this rewrite will affect the performance of the query. Currently, the variable [tidb\_opt\_insubq\_to\_join\_and\_agg](/tidb-specific-system-variables.md#tidb_opt_insubq_to_join_and_agg) is used to control this optimization. When this optimization is not suitable, you can manually turn off it. +This rewrite gets better performance when the `IN` subquery is relatively small and the external query is relatively large, because without rewriting, using `index join` with t2 as the driving table is impossible. However, the disadvantage is that when the aggregation cannot be automatically eliminated during the rewrite and the `t2` table is relatively large, this rewrite affects the performance of the query. Currently, the variable [tidb\_opt\_insubq\_to\_join\_and\_agg](/tidb-specific-system-variables.md#tidb_opt_insubq_to_join_and_agg) is used to control this optimization. When this optimization is not suitable, you can manually disable it. ## `EXISTS` subquery and `... >/>=/}} From 4a81fcdcb344f33a4cda7365de35b7196b1cf854 Mon Sep 17 00:00:00 2001 From: yikeke Date: Fri, 10 Jul 2020 13:48:18 +0800 Subject: [PATCH 3/6] improve wording, format, links, etc. --- correlated-subquery-optimization.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/correlated-subquery-optimization.md b/correlated-subquery-optimization.md index 56b02ec5ccf40..1c46b42a7e39f 100644 --- a/correlated-subquery-optimization.md +++ b/correlated-subquery-optimization.md @@ -1,24 +1,23 @@ --- -title: Decorrelation of correlated subquery -summary: Understand how to decorrelate correlated subqueries -category: performance +title: Decorrelation of Correlated Subquery +summary: Understand how to decorrelate correlated subqueries. --- -# Decorrelation of correlated subquery +# Decorrelation of Correlated Subquery -[Subquery related optimization] (/subquery-optimization.md) describes how TiDB handles subqueries when there are no correlated columns. Decorrelation of correlated subquery is complex, this article introduces some simple scenarios and the scope that the optimization rule applies to. +[Subquery related optimizations](/subquery-optimization.md) describes how TiDB handles subqueries when there are no correlated columns. Because decorrelation of correlated subquery is complex, this article introduces some simple scenarios and the scope to which the optimization rule applies. ## Introduction -Take `select * from t1 where t1.a < (select sum(t2.a) from t2 where t2.b = t1.b)` as an example, the subquery `t1.a < (select sum(t2.a) from t2 where t2.b = t1.b)` here refers to the correlated column in the query condition `t2.b=t1.b`, this condition happens to be an equivalent condition, so the query can be rewritten as `select t1.* from t1, (select b, sum(a) sum_a from t2 group by b) t2 where t1.b = t2.b and t1.a < t2.sum_a;`. In this way, a correlated subquery is rewritten into `JOIN`. +Take `select * from t1 where t1.a < (select sum(t2.a) from t2 where t2.b = t1.b)` as an example. The subquery `t1.a < (select sum(t2.a) from t2 where t2.b = t1.b)` here refers to the correlated column in the query condition `t2.b=t1.b`, this condition happens to be an equivalent condition, so the query can be rewritten as `select t1.* from t1, (select b, sum(a) sum_a from t2 group by b) t2 where t1.b = t2.b and t1.a < t2.sum_a;`. In this way, a correlated subquery is rewritten into `JOIN`. The reason why TiDB needs to do this rewriting is that the correlated subquery is bound to its external query result every time the subquery is executed. In the above example, if `t1.a` has 10 million values, this subquery would repeat 10 million times, because the condition `t2.b=t1.b` varies with the value of `t1.a`. When the correlation is lifted somehow, this subquery would execute only once. ## Restrictions -The disadvantage of this rewriting is that when the correlation is not resolved, the optimizer can use the index on the correlated column. That is to say, although this subquery may repeat many times, the index can be used to filter data each time. While after using the rewriting rule, the position of the correlated column usually changes. Although the subquery is only executed once, the single execution time would be longer than that without decorrelation. +The disadvantage of this rewriting is that when the correlation is not lifted, the optimizer can use the index on the correlated column. That is, although this subquery may repeat many times, the index can be used to filter data each time. After using the rewriting rule, the position of the correlated column usually changes. Although the subquery is only executed once, the single execution time would be longer than that without decorrelation. -Therefore, when there are few external values, do not do decorrelation may bring better execution performance. At present, this optimization can be turned off by setting `subquery decorrelation` optimization rules in [blocklist of optimization rules and expression pushdown](/blacklist-control-plan.md). +Therefore, when there are few external values, do not perform decorrelation, because it may bring better execution performance. At present, this optimization can be disabled by setting `subquery decorrelation` optimization rules in [blocklist of optimization rules and expression pushdown](/blocklist-control-plan.md). ## Example @@ -47,7 +46,7 @@ explain select * from t1 where t1.a < (select sum(t2.a) from t2 where t2.b = t1. ``` -The above is an example where the optimization takes effect, `HashJoin_11` is a normal `inner join`. +The above is an example where the optimization takes effect. `HashJoin_11` is a normal `inner join`. Then, turn off the subquery decorrelation rules: From fae8e747719ade07d9ef844cc915bc2cdf1dcbea Mon Sep 17 00:00:00 2001 From: tiancaiamao Date: Mon, 13 Jul 2020 11:28:37 +0800 Subject: [PATCH 4/6] address comment --- TOC.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/TOC.md b/TOC.md index 1f1d4a3a1e183..33d3a5ac83e31 100644 --- a/TOC.md +++ b/TOC.md @@ -106,6 +106,8 @@ + [SQL Optimization Process](/sql-optimization-concepts.md) + Logic Optimization + [Join Reorder](/join-reorder.md) + + [Subquery Related Optimizations](/subquery-optimization.md) + + [Decorrelation of Correlated Subquery](/correlated-subquery-optimization.md) + Physical Optimization + [Statistics](/statistics.md) + Control Execution Plan From 0a43823973cc957daa1e8c2233d2ded9b9fd69b1 Mon Sep 17 00:00:00 2001 From: Keke Yi <40977455+yikeke@users.noreply.github.com> Date: Mon, 13 Jul 2020 17:22:38 +0800 Subject: [PATCH 5/6] Update TOC.md --- TOC.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/TOC.md b/TOC.md index 33d3a5ac83e31..e86318c0d40e2 100644 --- a/TOC.md +++ b/TOC.md @@ -105,9 +105,9 @@ + SQL Optimization + [SQL Optimization Process](/sql-optimization-concepts.md) + Logic Optimization - + [Join Reorder](/join-reorder.md) + [Subquery Related Optimizations](/subquery-optimization.md) + [Decorrelation of Correlated Subquery](/correlated-subquery-optimization.md) + + [Join Reorder](/join-reorder.md) + Physical Optimization + [Statistics](/statistics.md) + Control Execution Plan From 95e112cb658caea891dcea3a8c37400b5fa7c8ce Mon Sep 17 00:00:00 2001 From: yikeke Date: Wed, 15 Jul 2020 12:08:55 +0800 Subject: [PATCH 6/6] fix a link --- subquery-optimization.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/subquery-optimization.md b/subquery-optimization.md index b231d129d53ef..125b16b245a45 100644 --- a/subquery-optimization.md +++ b/subquery-optimization.md @@ -63,7 +63,7 @@ explain select * from t1 where t1.a in (select t2.a from t2); +------------------------------+---------+-----------+------------------------+----------------------------------------------------------------------------+ ``` -This rewrite gets better performance when the `IN` subquery is relatively small and the external query is relatively large, because without rewriting, using `index join` with t2 as the driving table is impossible. However, the disadvantage is that when the aggregation cannot be automatically eliminated during the rewrite and the `t2` table is relatively large, this rewrite affects the performance of the query. Currently, the variable [tidb\_opt\_insubq\_to\_join\_and\_agg](/tidb-specific-system-variables.md#tidb_opt_insubq_to_join_and_agg) is used to control this optimization. When this optimization is not suitable, you can manually disable it. +This rewrite gets better performance when the `IN` subquery is relatively small and the external query is relatively large, because without rewriting, using `index join` with t2 as the driving table is impossible. However, the disadvantage is that when the aggregation cannot be automatically eliminated during the rewrite and the `t2` table is relatively large, this rewrite affects the performance of the query. Currently, the variable [tidb\_opt\_insubq\_to\_join\_and\_agg](/system-variables.md#tidb_opt_insubq_to_join_and_agg) is used to control this optimization. When this optimization is not suitable, you can manually disable it. ## `EXISTS` subquery and `... >/>=/