Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tidb gc too slow after drop database with 60k tables #33069

Closed
fubinzh opened this issue Mar 15, 2022 · 7 comments · Fixed by #33082 or #44331
Closed

tidb gc too slow after drop database with 60k tables #33069

fubinzh opened this issue Mar 15, 2022 · 7 comments · Fixed by #33082 or #44331

Comments

@fubinzh
Copy link

fubinzh commented Mar 15, 2022

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. Use br to restore backup with 1 database, which contains 60K tables, 25T data.
  2. Drop data by executing "drop database xxx"
  3. Wait and check gc status

2. What did you expect to see? (Required)

gc should be finished correctly and and tikv spaces are recycled.

3. What did you see instead (Required)

  • drop database happens at 2022-03-14 17:24:49
  • tikv_gc_last_run_time is 20220314-17:43:15, and gc didn't finish after 20hours. From mysql.gc_delete_range we can see range deletion is very slow ( ~1 range / second).

gc_too_slow

gc_delete_range_status

  • from pd.log, we can see there are huge number of logs related to "placement rule is removed" (over 70 million rows of logs after ~20 hours of gc)
[root@zhengrong-16 pd-33379]# grep "placement rule is removed" pd-2022-03-14T18-03-16.823.log | head
[2022/03/14 17:45:14.052 +08:00] [INFO] [rule_manager.go:257] ["placement rule is removed"] [group=tiflash] [id=table-100-r]
[2022/03/14 17:45:14.053 +08:00] [INFO] [rule_manager.go:257] ["placement rule is removed"] [group=tiflash] [id=table-1000-r]
[2022/03/14 17:45:14.053 +08:00] [INFO] [rule_manager.go:257] ["placement rule is removed"] [group=tiflash] [id=table-10000-r]
[2022/03/14 17:45:14.054 +08:00] [INFO] [rule_manager.go:257] ["placement rule is removed"] [group=tiflash] [id=table-10001-r]
[2022/03/14 17:45:14.054 +08:00] [INFO] [rule_manager.go:257] ["placement rule is removed"] [group=tiflash] [id=table-10002-r]

grep "placement rule is removed" *.log| wc -l
71076121

4. What is your TiDB version? (Required)

MySQL [(none)]> select tidb_version();
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tidb_version() |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Release Version: v6.0.0-alpha
Edition: Community
Git Commit Hash: 9fcc6b0
Git Branch: heads/refs/tags/v6.0.0-alpha
UTC Build Time: 2022-03-13 14:56:23
GoVersion: go1.16.4
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

@fubinzh fubinzh added type/bug This issue is a bug. severity/critical labels Mar 15, 2022
@ti-chi-bot ti-chi-bot added may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. labels Mar 15, 2022
@tangenta
Copy link
Contributor

The QPS drops about 50% when the GC worker is started:

image

TiDB log:

[2022/03/12 16:08:34.171 +00:00] [INFO] [gc_worker.go:701] ["[gc worker] start delete ranges"] [uuid=5fdf8fab22c0002] [ranges=64]
[2022/03/12 16:08:34.424 +00:00] [INFO] [gc_worker.go:1920] ["try delete TiFlash pd rule"] [tableID=147] [endKey="t\ufffd\u0000\u0000\u0000\u0000\u0000\u0000\ufffd"]

@tangenta
Copy link
Contributor

Confirmed that this is introduced by #30842. When there are thousands of tables, thousands of HTTP requests are sent from TiDB in order to delete the placement rules...

for _, id := range physicalTableIDs {
// Delete pd rule
logutil.BgLogger().Info("try delete TiFlash pd rule", zap.Int64("tableID", id), zap.String("endKey", string(dr.EndKey)))
ruleID := fmt.Sprintf("table-%v-r", id)
if err := infosync.DeleteTiFlashPlacementRule(context.Background(), "tiflash", ruleID); err != nil {
// If DeletePlacementRule fails here, the rule will be deleted in `HandlePlacementRuleRoutine`.
logutil.BgLogger().Error("delete TiFlash pd rule failed when gc", zap.Error(err), zap.String("ruleID", ruleID))
}
}

@tangenta
Copy link
Contributor

@CalvinNeo PTAL, can we make a batch-delete API instead?

@MyonKeminta MyonKeminta removed may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. labels Mar 15, 2022
@CalvinNeo
Copy link
Member

CalvinNeo commented Mar 15, 2022

@CalvinNeo PTAL, can we make a batch-delete API instead?

I am trying not to request pd when the table has no TiFlash replica set by #33068.
However:

  1. It is a DELETE request, after a discussion with pd guys, we don't have such batch things.
  2. There is also a problem that for _, r := range ranges { for _, id := range physicalTableIDs { will repeatedly iterate table ids, if the job is DropSchema. I think it is a critical bug that need to be fixed. AFAIC, it may have effects on previous versions.

@MyonKeminta
Copy link
Contributor

MyonKeminta commented Mar 15, 2022

The doGCPlacementRules function are called for each range in deleteRanges, and it loads the deleted table ids by the job id corresponding to the deleted range, then call DeleteTiFlashPlacementRule for each table.
The problem is, if there are N tables in a database and we dropped the database, there will be N range pending to be deleted with the same job id. Then we call doGCPlacementRules for each range and causes DeleteTiFlashPlacementRule be executed times. It doesn't seem to be anything that can be fixed by batching.

@bb7133
Copy link
Member

bb7133 commented Mar 15, 2022

Use br to restore backup with 1 database, which contains 60K tables, 25T data.
Drop data by executing "drop database xxx"
Wait and check gc status

@fubinzh Please add the TiFlash replica information in 'Minimal reproduce step', which is important for reproducing this issue.

@fubinzh
Copy link
Author

fubinzh commented Mar 15, 2022

Use br to restore backup with 1 database, which contains 60K tables, 25T data.
Drop data by executing "drop database xxx"
Wait and check gc status

@fubinzh Please add the TiFlash replica information in 'Minimal reproduce step', which is important for reproducing this issue.

@bb7133 The data I used doesn't have TiFlash replica actually, the tables in the br backup data doesn't have tiflash replicas, and the TiDB cluster restored to doesn't have TiFlash node deployed either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants