Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stale-read query latency 10x spike cause by information schema cache miss #53428

Open
crazycs520 opened this issue May 21, 2024 · 8 comments · Fixed by #53445 · May be fixed by #53620
Open

stale-read query latency 10x spike cause by information schema cache miss #53428

crazycs520 opened this issue May 21, 2024 · 8 comments · Fixed by #53445 · May be fixed by #53620

Comments

@crazycs520
Copy link
Contributor

crazycs520 commented May 21, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

High schema cache miss.

Related metrics:

image image

Analyze

The reason for a large number of load snapshot schema operations must be InfoSchema cache miss, but what causes cache miss?

Almost all TiDB generate a large number of load snapshot schema operations at the same time. It's weird, I picked a TiDB (tidb-11) with a large amount of operations for log analysis.

grep 'load InfoSchema' tidb-11.log

"[2024/05/09 15:00:46.337 +00:00] [INFO] [domain.go:224] [\"diff load InfoSchema success\"] [currentSchemaVersion=585441] [neededSchemaVersion=585442] [\"start time\"=1.734944ms] [phyTblIDs=\"[]\"] [actionTypes=\"[]\"]",

"[2024/05/09 15:00:46.521 +00:00] [INFO] [domain.go:224] [\"diff load InfoSchema success\"] [currentSchemaVersion=585442] [neededSchemaVersion=585443] [\"start time\"=503.56µs] [phyTblIDs=\"[]\"] [actionTypes=\"[]\"]",

"[2024/05/09 15:01:34.130 +00:00] [INFO] [domain.go:224] [\"diff load InfoSchema success\"] [currentSchemaVersion=585443] [neededSchemaVersion=585445] [\"start time\"=4.158976ms] [phyTblIDs=\"[454637]\"] [actionTypes=\"[8]\"]",

The schema version has 585442, 585443, and 585445, but 585444 does not exist, and there's a hole that causes schema cache misses, so TiDB needs to load snapshot schema from TiKV. Related codes are following:

if h.cache[i-1].infoschema.SchemaMetaVersion() == is.infoschema.SchemaMetaVersion()+1 && uint64(h.cache[i-1].timestamp) > ts {

Well, Why does the schema version 585444 not exist? From the current code implementation, since GenSchemaVersion uses a separate transaction to generate schema version, but SetSchemaDiff use another transaction, so if GenSchemaVersion transaction succeed, but SetSchemaDiff transaction fails, then will have schema version gap in this issue.

Why use two separate transactions? this change is introduced by PR: support concurrent ddl

tidb/pkg/ddl/ddl.go

Lines 448 to 453 in 5d990c6

err = kv.RunInNewTxn(kv.WithInternalSourceType(context.Background(), kv.InternalTxnDDL), store, true, func(_ context.Context, txn kv.Transaction) error {
var err error
m := meta.NewMeta(txn)
schemaVersion, err = m.GenSchemaVersion()
return err
})

err = t.SetSchemaDiff(diff)

tidb/domain/domain.go

Lines 423 to 424 in 2abf83d

// Empty diff means the txn of generating schema version is committed, but the txn of `runDDLJob` is not or fail.
// It is safe to skip the empty diff because the infoschema is new enough and consistent.

2. What did you expect to see? (Required)

No information schema cache miss.

3. What did you see instead (Required)

information schema cache miss then cause many operation which is load snapshot schema for stale-read query.

4. What is your TiDB version? (Required)

nightly: e1a6b1d

@crazycs520 crazycs520 added the type/bug This issue is a bug. label May 21, 2024
@crazycs520
Copy link
Contributor Author

@lcwangchao PTAL

@lcwangchao lcwangchao added the component/ddl This issue is related to DDL of TiDB. label May 21, 2024
@D3Hunter
Copy link
Contributor

run admin show ddl jobs to checkout which ddl is cancelled or failed

@crazycs520
Copy link
Contributor Author

run admin show ddl jobs to checkout which ddl is cancelled or failed

It doesn't matter what DDL cancels or fails, the important thing is that there is a schema version gap that causes the schema cache miss, then cause stale-read query latency 10x spike.

@crazycs520
Copy link
Contributor Author

crazycs520 commented May 22, 2024

Another possible cause of this problem is when tryLoadSchemaDiffs loads multiple versions of schema in one time, such as following log:

[2024/05/24 09:00:09.638 +00:00] [INFO] [domain.go:287] ["diff load InfoSchema success"] [currentSchemaVersion=14899] [neededSchemaVersion=14901] [gap=2] ["start time"=3.152629ms] [gotSchemaVersion=14901] [phyTblIDs="[17245,17184]"] [actionTypes="[3,7]"] [diffTypes="[\"create table\",\"add index\"]"]

This is because create table DDL job and add index DDL job can can be executed concurrently.

@tiancaiamao
Copy link
Contributor

I'd like to point out that in the new design of fast create table, the schema diff for create table could be missing.
And we will not have all the schema history eventually.

@D3Hunter
Copy link
Contributor

I'd like to point out that in the new design of fast create table, the schema diff for create table could be missing. And we will not have all the schema history eventually.

you mean tidb_enable_fast_create_table?, it will set diff too, but don't wait it apply

@ti-chi-bot ti-chi-bot bot closed this as completed in 0ac2ad0 May 27, 2024
@crazycs520 crazycs520 reopened this May 27, 2024
crazycs520 added a commit to crazycs520/tidb that referenced this issue May 27, 2024
…a version gap (pingcap#53445)

close pingcap#53428

Signed-off-by: crazycs520 <crazycs520@gmail.com>
RidRisR pushed a commit to RidRisR/tidb that referenced this issue May 27, 2024
crazycs520 added a commit to ti-chi-bot/tidb that referenced this issue May 27, 2024
…a version gap (pingcap#53445)

close pingcap#53428

Signed-off-by: crazycs520 <crazycs520@gmail.com>
crazycs520 added a commit to ti-chi-bot/tidb that referenced this issue May 27, 2024
…a version gap (pingcap#53445)

close pingcap#53428

Signed-off-by: crazycs520 <crazycs520@gmail.com>
ti-chi-bot bot pushed a commit that referenced this issue May 27, 2024
crazycs520 added a commit that referenced this issue May 28, 2024
…a version gap (#53445) (#53562)

* infoschema: fix issue of information schema cache miss cause by schema version gap (#53445)

close #53428

Signed-off-by: crazycs520 <crazycs520@gmail.com>
ti-chi-bot bot pushed a commit that referenced this issue May 28, 2024
@crazycs520
Copy link
Contributor Author

Another possible cause of this problem is when tryLoadSchemaDiffs loads multiple versions of schema in one time, such as following log:

[2024/05/24 09:00:09.638 +00:00] [INFO] [domain.go:287] ["diff load InfoSchema success"] [currentSchemaVersion=14899] [neededSchemaVersion=14901] [gap=2] ["start time"=3.152629ms] [gotSchemaVersion=14901] [phyTblIDs="[17245,17184]"] [actionTypes="[3,7]"] [diffTypes="[\"create table\",\"add index\"]"]

This is because create table DDL job and add index DDL job can can be executed concurrently.

#53620 should fix this.

mittalrishabh pushed a commit to mittalrishabh/tidb that referenced this issue May 30, 2024
…a version gap (pingcap#53445) (pingcap#53583) (pingcap#97)

close pingcap#53428

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
@mittalrishabh
Copy link
Contributor

mittalrishabh commented Jun 4, 2024

I have few questions about this issue

  1. when does set schema transaction failed. Is it because of conflict ?
  2. Why it generated large qps on a tikv node to read metadata. Should there be rate limit from tidb ? We have seen it generating 60k qps on a tikv node.
  3. How it repopulates the schema cache after tidb is rebooted ?

ti-chi-bot bot pushed a commit that referenced this issue Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment