Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor: implement disk-based hash join #12067

Merged
merged 14 commits into from Sep 24, 2019

Conversation

@SunRunAway
Copy link
Member

commented Sep 6, 2019

What problem does this PR solve?

part of #11607

What is changed and how it works?

When the memory usage of a query exceeds mem-quota-query, it will be spilled out to disk.

I've written a slide to demonstrate how Spilling to disk is triggered in this PR,
https://docs.google.com/presentation/d/1Sa9xNbDTPnLwnQHLKfpwksdYXWodisXPEqp-WR5Up0U/edit?usp=sharing

Here is the benchmark result,

go test -run=^$ -bench="BenchmarkHashJoinExec|BenchmarkBuildHashTableForList" -test.benchmem -count 5 > /tmp/bench.txt && benchstat /tmp/bench.txt
name                                                                                 time/op
HashJoinExec/(rows:100000,_concurency:4,_joinKeyIdx:_[0_1],_disk:false)-12            743ms ± 1%
HashJoinExec/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:false)-12              175ms ± 1%
HashJoinExec/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12               753ms ± 4%
HashJoinExec/(rows:1000,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12                20.3ms ± 9%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0_1],_disk:false)-12      56.0µs ± 1%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0_1],_disk:true)-12        346µs ±13%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0],_disk:false)-12        4.49µs ± 4%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12          290µs ± 7%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0_1],_disk:false)-12   516ms ± 0%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0_1],_disk:true)-12    1.00s ±10%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:false)-12    9.42ms ± 2%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12      481ms ± 6%

name                                                                                 alloc/op
HashJoinExec/(rows:100000,_concurency:4,_joinKeyIdx:_[0_1],_disk:false)-12            304MB ± 0%
HashJoinExec/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:false)-12              304MB ± 0%
HashJoinExec/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12               889MB ± 0%
HashJoinExec/(rows:1000,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12                63.9MB ± 0%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0_1],_disk:false)-12      2.37kB ± 0%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0_1],_disk:true)-12       28.7kB ± 3%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0],_disk:false)-12        2.37kB ± 0%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12         29.5kB ± 2%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0_1],_disk:false)-12  7.13MB ± 0%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0_1],_disk:true)-12   8.09MB ± 0%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:false)-12    7.14MB ± 0%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12     8.07MB ± 0%

name                                                                                 allocs/op
HashJoinExec/(rows:100000,_concurency:4,_joinKeyIdx:_[0_1],_disk:false)-12             307k ± 0%
HashJoinExec/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:false)-12               307k ± 0%
HashJoinExec/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12               1.51M ± 0%
HashJoinExec/(rows:1000,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12                 16.5k ± 0%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0_1],_disk:false)-12        28.0 ± 0%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0_1],_disk:true)-12         68.0 ± 0%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0],_disk:false)-12          28.0 ± 0%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12           68.0 ± 0%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0_1],_disk:false)-12   4.80k ± 2%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0_1],_disk:true)-12    5.25k ± 1%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:false)-12     4.82k ± 0%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12      5.25k ± 1%

TPC-H query 13 result is here, #12067 (comment)

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
mysql> set tidb_mem_quota_query=1;
Query OK, 0 rows affected (0.00 sec)

mysql> explain analyze select /*+ TIDB_HJ(t1, t2) */  * from testtable t1, testtable t2 where t1.id=t2.id;
+------------------------+-------+------+----------------------------------------------------------------------+-----------------------------------+-----------------+
| id                     | count | task | operator info                                                        | execution info                    | memory          |
+------------------------+-------+------+----------------------------------------------------------------------+-----------------------------------+-----------------+
| HashLeftJoin_22        | 41.25 | root | inner join, inner:TableReader_27, equal:[eq(test.t1.id, test.t2.id)] | time:1.748537ms, loops:2, rows:33 | 65.6328125 KB   |
| ├─TableReader_25       | 33.00 | root | data:TableScan_24                                                    | time:850.28µs, loops:2, rows:33   | 1.0537109375 KB |
| │ └─TableScan_24       | 33.00 | cop  | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo          | time:1ms, loops:2, rows:33        | N/A             |
| └─TableReader_27       | 33.00 | root | data:TableScan_26                                                    | time:751.669µs, loops:2, rows:33  | 1.0537109375 KB |
|   └─TableScan_26       | 33.00 | cop  | table:t2, range:[-inf,+inf], keep order:false, stats:pseudo          | time:1ms, loops:2, rows:33        | N/A             |
+------------------------+-------+------+----------------------------------------------------------------------+-----------------------------------+-----------------+
5 rows in set (0.01 sec)

Now we can see log in tidb-server:

[2019/09/18 20:35:31.576 +08:00] [INFO] [hash_table.go:294] ["memory exceeds quota, spill to disk now."] [memory="\n\"explain analyze select /*+ TIDB_HJ(t1, t2) */  * from testtable t1, testtable t2 where t1.id=t2.id\"{\n  \"quota\": 1 Bytes\n  \"consumed\": 1.056640625 KB\n  \"TableReader_25\"{\n    \"quota\": 32 GB\n    \"consumed\": 0 Bytes\n  }\n  \"TableReader_27\"{\n    \"quota\": 32 GB\n    \"consumed\": 1.056640625 KB\n  }\n  \"HashLeftJoin_22\"{\n    \"quota\": 32 GB\n    \"consumed\": 0 Bytes\n    \"hashJoin.innerResult\"{\n      \"consumed\": 0 Bytes\n    }\n  }\n}\n"]

Code changes

  • Has exported function/method change
  • Has exported variable/fields change

Side effects

  • Possible performance regression
  • Breaking backward compatibility

Related changes

  • Need to update the documentation

Release note

  • Write release note for bug-fix or new feature.
    1. A new config variable oom-use-tmp-storage in tidb.toml, which is true in default, now takes control of the behavior whether to spill over to disk in a hashJoin (and other executors we will support spilling in the future). If a query with hashJoin exceeds the mem-quota-query, hashJoin will spill over to the temporary directory. (which could affect performance).
    2. If you want to control the temporary directory hash join using, set the environment variable TMPDIR when starting tidb-server.
@SunRunAway SunRunAway requested review from zz-jason, fzhedu, qw4990 and XuHuaiyu Sep 6, 2019
@SunRunAway SunRunAway force-pushed the SunRunAway:disk-join-issue11607 branch 2 times, most recently from e82763e to 6911bd3 Sep 6, 2019
@SunRunAway

This comment has been minimized.

Copy link
Member Author

commented Sep 6, 2019

/run-all-tests

@codecov

This comment has been minimized.

Copy link

commented Sep 6, 2019

Codecov Report

Merging #12067 into master will not change coverage.
The diff coverage is n/a.

@@             Coverage Diff             @@
##             master     #12067   +/-   ##
===========================================
  Coverage   80.8753%   80.8753%           
===========================================
  Files           454        454           
  Lines         99547      99547           
===========================================
  Hits          80509      80509           
  Misses        13251      13251           
  Partials       5787       5787
@SunRunAway SunRunAway force-pushed the SunRunAway:disk-join-issue11607 branch from 4bfaeb1 to b550331 Sep 7, 2019
@SunRunAway

This comment has been minimized.

Copy link
Member Author

commented Sep 7, 2019

TPC-H query 13

Run query 13 under TPC-H 10G on my laptop (MacBook Pro 15-inch, 2019, 2.6GHz)

time memory (top) inuse_space(go profile)
default 15.85s 3180MB 1480MB
disk 34.64s 1830MB 534MB

default:

mysql> explain analyze select c_count, count(*) as custdist from ( select c_custkey, count(o_orderkey) as c_count from customer left outer join orders on c_custkey = o_custkey and o_comment not like '%pending%deposits%' group by c_custkey ) c_orders group by c_count order by custdist desc,c_count desc;
+--------------------------------+-------------+------+---------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------+-----------------------+
| id                             | count       | task | operator info                                                                                     | execution info                                                                            | memory                |
+--------------------------------+-------------+------+---------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------+-----------------------+
| Sort_9                         | 150000.00   | root | custdist:desc, c_orders.c_count:desc                                                              | time:15.838720948s, loops:2, rows:65                                                      | 2.7578125 KB          |
| └─Projection_11                | 150000.00   | root | c_count, 6_col_0                                                                                  | time:15.838681959s, loops:2, rows:65                                                      | N/A                   |
|   └─HashAgg_14                 | 150000.00   | root | group by:c_count, funcs:count(1), firstrow(c_count)                                               | time:15.838639251s, loops:2, rows:65                                                      | N/A                   |
|     └─HashAgg_17               | 150000.00   | root | group by:tpch.customer.c_custkey, funcs:count(tpch.orders.o_orderkey)                             | time:15.837279594s, loops:1466, rows:1500000                                              | N/A                   |
|       └─HashLeftJoin_20        | 2252162.08  | root | left outer join, inner:TableReader_25, equal:[eq(tpch.customer.c_custkey, tpch.orders.o_custkey)] | time:14.629918936s, loops:14982, rows:15338185                                            | 1.0599235333502293 GB |
|         ├─TableReader_22       | 150000.00   | root | data:TableScan_21                                                                                 | time:490.729044ms, loops:1466, rows:1500000                                               | 12.883949279785156 MB |
|         │ └─TableScan_21       | 150000.00   | cop  | table:customer, range:[-inf,+inf], keep order:false                                               | proc max:369ms, min:259ms, p80:369ms, p95:369ms, rows:1500000, iters:1482, tasks:4        | N/A                   |
|         └─TableReader_25       | 12000000.00 | root | data:Selection_24                                                                                 | time:8.412349583s, loops:14492, rows:14838130                                             | 236.17509269714355 MB |
|           └─Selection_24       | 12000000.00 | cop  | not(like(tpch.orders.o_comment, "%pending%deposits%", 92))                                        | proc max:2.98s, min:1.359s, p80:2.63s, p95:2.966s, rows:14838130, iters:14788, tasks:31   | N/A                   |
|             └─TableScan_23     | 15000000.00 | cop  | table:orders, range:[-inf,+inf], keep order:false                                                 | proc max:2.878s, min:1.302s, p80:2.536s, p95:2.863s, rows:15000000, iters:14788, tasks:31 | N/A                   |
+--------------------------------+-------------+------+---------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------+-----------------------+
10 rows in set (15.85 sec)

image-20190907002419761

(Lots of memory is used by NewChunkWithCapacity in fetchInnerRows)

disk:

mysql> explain analyze select /*+ MEMORY_QUOTA(10 MB) */ c_count, count(*) as custdist from ( select c_custkey, count(o_orderkey) as c_count from customer left outer join orders on c_custkey = o_custkey and o_comment not like '%pending%deposits%' group by c_custkey ) c_orders group by c_count order by custdist desc,c_count desc;
+--------------------------------+-------------+------+---------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------+-----------------------+
| id                             | count       | task | operator info                                                                                     | execution info                                                                           | memory                |
+--------------------------------+-------------+------+---------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------+-----------------------+
| Sort_9                         | 150000.00   | root | custdist:desc, c_orders.c_count:desc                                                              | time:34.525661988s, loops:2, rows:65                                                     | 2.7578125 KB          |
| └─Projection_11                | 150000.00   | root | c_count, 6_col_0                                                                                  | time:34.524866162s, loops:2, rows:65                                                     | N/A                   |
|   └─HashAgg_14                 | 150000.00   | root | group by:c_count, funcs:count(1), firstrow(c_count)                                               | time:34.524826727s, loops:2, rows:65                                                     | N/A                   |
|     └─HashAgg_17               | 150000.00   | root | group by:tpch.customer.c_custkey, funcs:count(tpch.orders.o_orderkey)                             | time:34.52306466s, loops:1466, rows:1500000                                              | N/A                   |
|       └─HashLeftJoin_20        | 2252162.08  | root | left outer join, inner:TableReader_25, equal:[eq(tpch.customer.c_custkey, tpch.orders.o_custkey)] | time:33.306195067s, loops:14982, rows:15338185                                           | 76.69921875 KB        |
|         ├─TableReader_22       | 150000.00   | root | data:TableScan_21                                                                                 | time:263.787279ms, loops:1466, rows:1500000                                              | 12.883944511413574 MB |
|         │ └─TableScan_21       | 150000.00   | cop  | table:customer, range:[-inf,+inf], keep order:false                                               | proc max:252ms, min:199ms, p80:252ms, p95:252ms, rows:1500000, iters:1482, tasks:4       | N/A                   |
|         └─TableReader_25       | 12000000.00 | root | data:Selection_24                                                                                 | time:5.559257104s, loops:14492, rows:14838130                                            | 638.274112701416 MB   |
|           └─Selection_24       | 12000000.00 | cop  | not(like(tpch.orders.o_comment, "%pending%deposits%", 92))                                        | proc max:2.072s, min:905ms, p80:1.911s, p95:1.988s, rows:14838130, iters:14788, tasks:31 | N/A                   |
|             └─TableScan_23     | 15000000.00 | cop  | table:orders, range:[-inf,+inf], keep order:false                                                 | proc max:1.969s, min:836ms, p80:1.804s, p95:1.87s, rows:15000000, iters:14788, tasks:31  | N/A                   |
+--------------------------------+-------------+------+---------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------+-----------------------+
10 rows in set (34.64 sec)

image-20190907004531303

(We can see the memory usage in NewChunkWithCapacity disappears)

@sre-bot

This comment has been minimized.

Copy link

commented Sep 9, 2019

@SunRunAway

This comment has been minimized.

Copy link
Member Author

commented Sep 10, 2019

I've moved utility changes in this pr out to #12116 , Please review it first.

tools/check/go.mod Outdated Show resolved Hide resolved
@zz-jason zz-jason removed their request for review Sep 10, 2019
@XuHuaiyu XuHuaiyu removed their request for review Sep 11, 2019
@SunRunAway SunRunAway force-pushed the SunRunAway:disk-join-issue11607 branch from d4e0e09 to cbd9e30 Sep 11, 2019
@SunRunAway

This comment has been minimized.

Copy link
Member Author

commented Sep 11, 2019

/run-all-tests

@SunRunAway SunRunAway requested review from zz-jason, qw4990 and XuHuaiyu Sep 11, 2019
@SunRunAway

This comment has been minimized.

Copy link
Member Author

commented Sep 11, 2019

/run-all-tests

1 similar comment
@SunRunAway

This comment has been minimized.

Copy link
Member Author

commented Sep 11, 2019

/run-all-tests

@SunRunAway SunRunAway force-pushed the SunRunAway:disk-join-issue11607 branch from 11fa137 to 668bc27 Sep 11, 2019
@SunRunAway

This comment has been minimized.

Copy link
Member Author

commented Sep 11, 2019

/run-all-tests

@sre-bot

This comment has been minimized.

Copy link

commented Sep 20, 2019

@sre-bot

This comment has been minimized.

Copy link

commented Sep 22, 2019

@qw4990

This comment has been minimized.

Copy link
Contributor

commented Sep 23, 2019

/run-unit-test

executor/join.go Show resolved Hide resolved
executor/join.go Outdated Show resolved Hide resolved
executor/hash_table.go Outdated Show resolved Hide resolved
Copy link
Member

left a comment

LGTM

@zz-jason zz-jason dismissed their stale review Sep 24, 2019

only one LGTM 😂

@SunRunAway SunRunAway added this to the v4.0.0 milestone Sep 24, 2019
@qw4990
qw4990 approved these changes Sep 24, 2019
Copy link
Contributor

left a comment

LGTM

@sre-bot

This comment has been minimized.

Copy link

commented Sep 24, 2019

/run-all-tests

@sre-bot

This comment has been minimized.

Copy link

commented Sep 24, 2019

@SunRunAway merge failed.

@SunRunAway

This comment has been minimized.

Copy link
Member Author

commented Sep 24, 2019

/run-unit-test

2 similar comments
@SunRunAway

This comment has been minimized.

Copy link
Member Author

commented Sep 24, 2019

/run-unit-test

@SunRunAway

This comment has been minimized.

Copy link
Member Author

commented Sep 24, 2019

/run-unit-test

@SunRunAway

This comment has been minimized.

Copy link
Member Author

commented Sep 24, 2019

/run-integration-ddl-test

@SunRunAway SunRunAway merged commit 1f92255 into pingcap:master Sep 24, 2019
13 checks passed
13 checks passed
idc-jenkins-ci-tidb/build Jenkins job succeeded.
Details
idc-jenkins-ci-tidb/build_check_race Jenkins job succeeded.
Details
idc-jenkins-ci-tidb/check_dev Jenkins job succeeded.
Details
idc-jenkins-ci-tidb/check_dev_2 Jenkins job succeeded.
Details
idc-jenkins-ci-tidb/common-test job succeeded
Details
idc-jenkins-ci-tidb/integration-common-test Jenkins job succeeded.
Details
idc-jenkins-ci-tidb/integration-compatibility-test Jenkins job succeeded.
Details
idc-jenkins-ci-tidb/integration-ddl-test Jenkins job succeeded.
Details
idc-jenkins-ci-tidb/mybatis-test job succeeded
Details
idc-jenkins-ci-tidb/sqllogic-test-1 Jenkins job succeeded.
Details
idc-jenkins-ci-tidb/sqllogic-test-2 Jenkins job succeeded.
Details
idc-jenkins-ci-tidb/unit-test Jenkins job succeeded.
Details
license/cla Contributor License Agreement is signed.
Details
@SunRunAway SunRunAway deleted the SunRunAway:disk-join-issue11607 branch Sep 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.