Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor: exit all goroutines immediately when exceeded mem-quota #37405

Merged
merged 17 commits into from Sep 14, 2022

Conversation

wshwsh12
Copy link
Contributor

@wshwsh12 wshwsh12 commented Aug 26, 2022

What problem does this PR solve?

Issue Number: close #37379

Problem Summary:
Now if one sql use too many memory, it will panic and exit. But some goroutines maybe not exit immediately and continue to alloc memory.
We need exit all related goroutines and check it will not cause other problems, e.g. deadlock.

What is changed and how it works?

  1. Panic for all Consume() caller, if memory usage is more than quota.
  2. Inject some random panic and add some test
  3. Move iw.lookup.workerWg.Add(1) out from the new goroutine, to avoid panic "sync: WaitGroup is reused before previous Wait has returned"
  4. Remove defer closeBaseExecutor(&e.baseExecutor) from AggExec.Open(), to avoid memory leak.
    For example,
desc select job_meta, processing from mysql.tidb_ddl_job where job_id in (select min(job_id) from mysql.tidb_ddl_job group by schema_ids, table_ids) and not reorg  order by processing desc, job_id;
+--------------------------------------+----------+-----------+--------------------+-----------------------------------------------------------------------------------------------------------------------+
| id                                   | estRows  | task      | access object      | operator info                                                                                                         |
+--------------------------------------+----------+-----------+--------------------+-----------------------------------------------------------------------------------------------------------------------+
| Projection_13                        | 10.00    | root      |                    | mysql.tidb_ddl_job.job_meta, mysql.tidb_ddl_job.processing                                                            |
| └─Sort_14                        | 10.00    | root      |                    | mysql.tidb_ddl_job.processing:desc, mysql.tidb_ddl_job.job_id                                                         |
|   └─HashJoin_28                  | 10.00    | root      |                    | inner join, equal:[eq(mysql.tidb_ddl_job.job_id, Column#15)]                                                          |
|     ├─TableReader_41(Build)      | 10.00    | root      |                    | data:Selection_40                                                                                                     |
|     │ └─Selection_40           | 10.00    | cop[tikv] |                    | not(istrue_with_null(mysql.tidb_ddl_job.reorg))                                                                       |
|     │   └─TableFullScan_39     | 10000.00 | cop[tikv] | table:tidb_ddl_job | keep order:false, stats:pseudo                                                                                        |
|     └─HashAgg_29(Probe)          | 6400.00  | root      |                    | group by:Column#15, funcs:firstrow(Column#15)->Column#15                                                              |
|       └─Selection_30             | 6400.00  | root      |                    | not(isnull(Column#15))                                                                                                |
|         └─HashAgg_35             | 8000.00  | root      |                    | group by:mysql.tidb_ddl_job.schema_ids, mysql.tidb_ddl_job.table_ids, funcs:min(Column#18)->Column#15                 |
|           └─TableReader_36       | 8000.00  | root      |                    | data:HashAgg_31                                                                                                       |
|             └─HashAgg_31         | 8000.00  | cop[tikv] |                    | group by:mysql.tidb_ddl_job.schema_ids, mysql.tidb_ddl_job.table_ids, funcs:min(mysql.tidb_ddl_job.job_id)->Column#18 |
|               └─TableFullScan_34 | 10000.00 | cop[tikv] | table:tidb_ddl_job | keep order:false, stats:pseudo                                                                                        |
+--------------------------------------+----------+-----------+--------------------+-----------------------------------------------------------------------------------------------------------------------+
12 rows in set (0.00 sec)

If HashAgg_35 panic in Open(), HashAgg_29 will recover and call Close() function in closeBaseExecutor
Now, we have two ways to do follow-up resource release work

  1. Do nothing. But TableReader_41 is opened, and fetching data. There are some goroutine(or resource) leak.
  2. Try to close all executors from Projection_13. HashAgg_35 will Close() twice and throw panic.

Both of these methods have problems, so I remove this function closeBaseExecutor and Close() Projection_13 directly.

Benchmark

TPCH-50g q18

image

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Aug 26, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • XuHuaiyu
  • tiancaiamao

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note-none size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 26, 2022
@sre-bot
Copy link
Contributor

sre-bot commented Aug 26, 2022

@wshwsh12 wshwsh12 force-pushed the random-panic-test branch 2 times, most recently from cc028ea to a0b9096 Compare August 30, 2022 10:02
@wshwsh12 wshwsh12 changed the title executor: add some random panic test executor: exit all goroutines immediately when exceeded mem-quota Aug 31, 2022
"select /*+ HASH_JOIN(t1) */ /*+ USE_INDEX(t1,idx) */ /*+ USE_INDEX(t2,idx) */ * from t t1 join t t2 on t1.a=t2.a", // HashJoin
"select /*+ MERGE_JOIN(t1) */ /*+ USE_INDEX(t1,idx) */ /*+ USE_INDEX(t2,idx) */ * from t t1 join t t2 on t1.a=t2.a", // Shuffle+MergeJoin
"select /*+ INL_JOIN(t2) */ * from t t1 join t t2 on t1.a=t2.a;", // Index Join
"select /*+ INL_HASH_JOIN(t2) */ * from t t1 join t t2 on t1.a=t2.a;", // Index Hash Join
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we do not inject the random panic for inl_hash_join?

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Sep 14, 2022
@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Sep 14, 2022
@wshwsh12
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 66fa79b

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Sep 14, 2022
@ti-chi-bot ti-chi-bot merged commit 5a8e1b2 into pingcap:master Sep 14, 2022
@sre-bot
Copy link
Contributor

sre-bot commented Sep 14, 2022

TiDB MergeCI notify

✅ Well Done! New fixed [1] after this pr merged.

CI Name Result Duration Compare with Parent commit
idc-jenkins-ci-tidb/integration-common-test 🔴 failed 2, success 15, total 17 8 min 4 sec Existing failure
idc-jenkins-ci-tidb/integration-ddl-test ✅ all 6 tests passed 22 min Fixed
idc-jenkins-ci/integration-cdc-test 🟢 all 37 tests passed 27 min Existing passed
idc-jenkins-ci-tidb/common-test 🟢 all 11 tests passed 8 min 25 sec Existing passed
idc-jenkins-ci-tidb/tics-test 🟢 all 1 tests passed 5 min 37 sec Existing passed
idc-jenkins-ci-tidb/sqllogic-test-2 🟢 all 28 tests passed 4 min 38 sec Existing passed
idc-jenkins-ci-tidb/sqllogic-test-1 🟢 all 26 tests passed 4 min 14 sec Existing passed
idc-jenkins-ci-tidb/mybatis-test 🟢 all 1 tests passed 3 min 14 sec Existing passed
idc-jenkins-ci-tidb/integration-compatibility-test 🟢 all 1 tests passed 2 min 54 sec Existing passed
idc-jenkins-ci-tidb/plugin-test 🟢 build success, plugin test success 4min Existing passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none size/L Denotes a PR that changes 100-499 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. type/enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Immediately exit the sql when OOM
6 participants