Redo log: data was lost or damaged in some test cases, and sometimes changefeed failed: "redo log flush fail" #5486

Tammyxia · 2022-05-20T01:45:35Z

What did you do?

Create a changefeed with redo enabled, config:

[consistent]
level = "eventual"
storage = "s3://nfs/test-infra-redolog/redo-apply-cdc-all-node-restart-sync?access-key=minioadmin&secret-access-key=minioadmin&endpoint=http://minio.pingcap.net:9000&force-path-style=true"
max-log-size = 1

Run sysbench : sysbench oltp_insert prepare --tables=10 --table-size=500 --threads=10 && sysbench oltp_insert run --tables=10 --table-size=500 --threads=10
Run upstream cluster chaos step by step:

All PD Restart
All capture restart
All TiKV restart

What did you expect to see?

data keeps exactly the same between upstream and downstream, also changefeed status is normal eventually.

No response

What did you see instead?

data was lost or damaged in test case:

redo_apply_cdc_allnode_restart_sync （failure ratio > 50% ）

redo_apply_cdc_scale_sync (failure in v6.1.0-pre, http://rms.pingcap.net:31714/artifacts/testground/plan-exec-810836/plan-exec-810836-3745080740/main-logs)

Somethimes changefeed failed, case log: http://rms.pingcap.net:31714/artifacts/testground/plan-exec-840037/plan-exec-840037-493892576/main-logs
{

"id": "redo-apply-cdc-all-node-restart-sync",
"summary": {
"state": "failed",
"tso": 433315158137241606,
"checkpoint": "2022-05-19 13:15:48.900",
"error": {
"addr": "upstream-ticdc-1.upstream-ticdc-peer.cdc-testbed-tps-840037-1-931.svc:8301",
"code": "CDC:ErrProcessorUnknown",
"message": "[CDC:ErrS3StorageAPI]s3 storage api: RequestCanceled: request context canceled\ncaused by: context deadline exceeded"
}

cdc.log ERROR:
[2022/05/19 13:20:33.310 +00:00] [ERROR] [file.go:199] ["redo log flush fail"] [namespace=default] [changefeed=redo-apply-cdc-all-node-restart-sync] [error="[CDC:ErrS3StorageAPI]s3 storage api: RequestCanceled: request context canceled\ncaused by: context deadline exceeded"]

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)

Upstream TiKV version (execute tikv-server --version):

(paste TiKV version here)
TiKV
Release Version:   6.1.0-alpha
Edition:           Community
Git Commit Hash:   2accd27c2a3882d3d45b6958565efb6de4d6e33c
Git Commit Branch: heads/refs/tags/v6.1.0-nightly
UTC Build Time:    2022-05-19 11:14:52
Rust Version:      rustc 1.60.0-nightly (1e12aef3f 2022-02-13)
Enable Features:   jemalloc mem-profiling portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine cloud-aws cloud-gcp cloud-azure
Profile:           dist_release

TiCDC version (execute cdc version):

(paste TiCDC version here)
Release Version: v6.1.0-nightly
Git Commit Hash: 5e07c8b6bbefa7b087933ab2c12358cf5c05d76d
Git Branch: heads/refs/tags/v6.1.0-nightly
UTC Build Time: 2022-05-19 11:12:29
Go Version: go version go1.18.2 linux/amd64
Failpoint Build: false

The text was updated successfully, but these errors were encountered:

ref #5486

…5615) ref #5486

…maintained redo writer (#5587) close #5486

…maintained redo writer (#5587) (#5619) close #5486

…5621) ref #5486

CharlesCheung96 · 2022-05-29T13:52:41Z

Closed by #5621

…5621) (#5630) ref #5486

…maintained redo writer (#5587) (#5617) close #5486

…5621) (#5628) ref #5486

…5613) ref #5486

…maintained redo writer (#5587) (#5618) close #5486

…5614) ref #5486

…5621) (#5629) ref #5486

Tammyxia added type/bug This is a bug. area/ticdc Issues or PRs related to TiCDC. labels May 20, 2022

github-actions bot added this to Need Triage in Question and Bug Reports May 20, 2022

Tammyxia added the severity/major This is a major bug. label May 20, 2022

ti-chi-bot added may-affects-4.0 may-affects-5.0 may-affects-5.1 may-affects-5.2 may-affects-5.3 may-affects-5.4 may-affects-6.0 labels May 20, 2022

Tammyxia assigned CharlesCheung96 May 20, 2022

VelocityLight added the affects-6.1 label May 20, 2022

Tammyxia changed the title ~~Redo log changefeed failed: "redo log flush fail" [CDC:ErrS3StorageAPI]s3 storage api: RequestCanceled~~ Redo log: data was lost or damaged in some test cases, and sometimes changefeed failed: "redo log flush fail" May 25, 2022

amyangfei removed may-affects-4.0 may-affects-5.1 may-affects-5.0 labels May 25, 2022

amyangfei mentioned this issue May 25, 2022

redo(ticdc): fix resolved moves too fast when part of tables are not maintained redo writer #5587

Merged

amyangfei added the affects-5.3 label May 25, 2022

ti-chi-bot removed the may-affects-5.3 label May 25, 2022

amyangfei added affects-5.4 may-affects-5.3 and removed may-affects-5.4 may-affects-5.2 may-affects-5.3 may-affects-6.0 labels May 25, 2022

amyangfei mentioned this issue May 25, 2022

redo(ticdc): use uuid in s3 log file to avoid name conflict #5595

Merged

ti-chi-bot pushed a commit that referenced this issue May 27, 2022

redo(ticdc): use uuid in s3 log file to avoid name conflict (#5595)

c467834

ref #5486

This was referenced May 27, 2022

redo(ticdc): use uuid in s3 log file to avoid name conflict (#5595) #5613

Merged

redo(ticdc): use uuid in s3 log file to avoid name conflict (#5595) #5614

Merged

ti-chi-bot mentioned this issue May 27, 2022

redo(ticdc): use uuid in s3 log file to avoid name conflict (#5595) #5615

Merged

ti-chi-bot added a commit that referenced this issue May 27, 2022

redo(ticdc): use uuid in s3 log file to avoid name conflict (#5595) (#…

679061c

…5615) ref #5486

ti-chi-bot closed this as completed in #5587 May 27, 2022

Question and Bug Reports automation moved this from Need Triage to Done May 27, 2022

ti-chi-bot pushed a commit that referenced this issue May 27, 2022

redo(ticdc): fix resolved moves too fast when part of tables are not …

80c1532

…maintained redo writer (#5587) close #5486

ti-chi-bot added a commit that referenced this issue May 27, 2022

redo(ticdc): fix resolved moves too fast when part of tables are not …

8394e79

…maintained redo writer (#5587) (#5619) close #5486

amyangfei reopened this May 28, 2022

Question and Bug Reports automation moved this from Done to In Progress May 28, 2022

amyangfei mentioned this issue May 28, 2022

redo(ticdc): fix a bug that flush log executed before writing logs #5621

Merged

ti-chi-bot pushed a commit that referenced this issue May 29, 2022

redo(ticdc): fix a bug that flush log executed before writing logs (#…

9b29eef

…5621) ref #5486

CharlesCheung96 closed this as completed May 29, 2022

Question and Bug Reports automation moved this from In Progress to Done May 29, 2022

ti-chi-bot added a commit that referenced this issue May 30, 2022

redo(ticdc): fix a bug that flush log executed before writing logs (#…

e4d5019

…5621) (#5630) ref #5486

ti-chi-bot added a commit that referenced this issue Jun 15, 2022

redo(ticdc): fix resolved moves too fast when part of tables are not …

8887939

…maintained redo writer (#5587) (#5617) close #5486

ti-chi-bot added a commit that referenced this issue Jun 15, 2022

redo(ticdc): fix a bug that flush log executed before writing logs (#…

8af519b

…5621) (#5628) ref #5486

ti-chi-bot added a commit that referenced this issue Jun 15, 2022

redo(ticdc): use uuid in s3 log file to avoid name conflict (#5595) (#…

97e9cde

…5613) ref #5486

ti-chi-bot added a commit that referenced this issue Jun 21, 2022

redo(ticdc): fix resolved moves too fast when part of tables are not …

24ffdd8

…maintained redo writer (#5587) (#5618) close #5486

ti-chi-bot added a commit that referenced this issue Jun 21, 2022

redo(ticdc): use uuid in s3 log file to avoid name conflict (#5595) (#…

4a0319b

…5614) ref #5486

This was referenced Jun 23, 2022

releases: add tidb 5.3.2 release notes pingcap/docs#9029

Merged

releases: add v5.3.2 release notes pingcap/docs-cn#9914

Merged

ti-chi-bot added a commit that referenced this issue Jun 24, 2022

redo(ticdc): fix a bug that flush log executed before writing logs (#…

f775a76

…5621) (#5629) ref #5486

Tammyxia added the found/automation Bugs found by automation cases label Aug 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redo log: data was lost or damaged in some test cases, and sometimes changefeed failed: "redo log flush fail" #5486

Redo log: data was lost or damaged in some test cases, and sometimes changefeed failed: "redo log flush fail" #5486

Tammyxia commented May 20, 2022 •

edited

Loading

CharlesCheung96 commented May 29, 2022

Redo log: data was lost or damaged in some test cases, and sometimes changefeed failed: "redo log flush fail" #5486

Redo log: data was lost or damaged in some test cases, and sometimes changefeed failed: "redo log flush fail" #5486

Comments

Tammyxia commented May 20, 2022 • edited Loading

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

CharlesCheung96 commented May 29, 2022

Tammyxia commented May 20, 2022 •

edited

Loading