Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redo log: data was lost or damaged in some test cases, and sometimes changefeed failed: "redo log flush fail" #5486

Closed
Tammyxia opened this issue May 20, 2022 · 1 comment · Fixed by #5587
Assignees
Labels
affects-5.3 affects-5.4 affects-6.1 area/ticdc Issues or PRs related to TiCDC. found/automation Bugs found by automation cases severity/major This is a major bug. type/bug This is a bug.

Comments

@Tammyxia
Copy link

Tammyxia commented May 20, 2022

What did you do?

  • Create a changefeed with redo enabled, config:

[consistent]
level = "eventual"
storage = "s3://nfs/test-infra-redolog/redo-apply-cdc-all-node-restart-sync?access-key=minioadmin&secret-access-key=minioadmin&endpoint=http://minio.pingcap.net:9000&force-path-style=true"
max-log-size = 1

  • Run sysbench : sysbench oltp_insert prepare --tables=10 --table-size=500 --threads=10 && sysbench oltp_insert run --tables=10 --table-size=500 --threads=10

  • Run upstream cluster chaos step by step:

All PD Restart
All capture restart
All TiKV restart

What did you expect to see?

  • data keeps exactly the same between upstream and downstream, also changefeed status is normal eventually.

No response

What did you see instead?

  • data was lost or damaged in test case:

redo_apply_cdc_allnode_restart_sync (failure ratio > 50% )

redo_apply_cdc_scale_sync (failure in v6.1.0-pre, http://rms.pingcap.net:31714/artifacts/testground/plan-exec-810836/plan-exec-810836-3745080740/main-logs)

  • Somethimes changefeed failed, case log: http://rms.pingcap.net:31714/artifacts/testground/plan-exec-840037/plan-exec-840037-493892576/main-logs

  • {

    "id": "redo-apply-cdc-all-node-restart-sync",
    "summary": {
    "state": "failed",
    "tso": 433315158137241606,
    "checkpoint": "2022-05-19 13:15:48.900",
    "error": {
    "addr": "upstream-ticdc-1.upstream-ticdc-peer.cdc-testbed-tps-840037-1-931.svc:8301",
    "code": "CDC:ErrProcessorUnknown",
    "message": "[CDC:ErrS3StorageAPI]s3 storage api: RequestCanceled: request context canceled\ncaused by: context deadline exceeded"
    }

cdc.log ERROR:
[2022/05/19 13:20:33.310 +00:00] [ERROR] [file.go:199] ["redo log flush fail"] [namespace=default] [changefeed=redo-apply-cdc-all-node-restart-sync] [error="[CDC:ErrS3StorageAPI]s3 storage api: RequestCanceled: request context canceled\ncaused by: context deadline exceeded"]

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)

Upstream TiKV version (execute tikv-server --version):

(paste TiKV version here)
TiKV
Release Version:   6.1.0-alpha
Edition:           Community
Git Commit Hash:   2accd27c2a3882d3d45b6958565efb6de4d6e33c
Git Commit Branch: heads/refs/tags/v6.1.0-nightly
UTC Build Time:    2022-05-19 11:14:52
Rust Version:      rustc 1.60.0-nightly (1e12aef3f 2022-02-13)
Enable Features:   jemalloc mem-profiling portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine cloud-aws cloud-gcp cloud-azure
Profile:           dist_release

TiCDC version (execute cdc version):

(paste TiCDC version here)
Release Version: v6.1.0-nightly
Git Commit Hash: 5e07c8b6bbefa7b087933ab2c12358cf5c05d76d
Git Branch: heads/refs/tags/v6.1.0-nightly
UTC Build Time: 2022-05-19 11:12:29
Go Version: go version go1.18.2 linux/amd64
Failpoint Build: false
@Tammyxia Tammyxia added type/bug This is a bug. area/ticdc Issues or PRs related to TiCDC. labels May 20, 2022
@github-actions github-actions bot added this to Need Triage in Question and Bug Reports May 20, 2022
@Tammyxia Tammyxia added the severity/major This is a major bug. label May 20, 2022
@Tammyxia Tammyxia changed the title Redo log changefeed failed: "redo log flush fail" [CDC:ErrS3StorageAPI]s3 storage api: RequestCanceled Redo log: data was lost or damaged in some test cases, and sometimes changefeed failed: "redo log flush fail" May 25, 2022
Question and Bug Reports automation moved this from Need Triage to Done May 27, 2022
ti-chi-bot pushed a commit that referenced this issue May 27, 2022
ti-chi-bot added a commit that referenced this issue May 27, 2022
@amyangfei amyangfei reopened this May 28, 2022
Question and Bug Reports automation moved this from Done to In Progress May 28, 2022
@CharlesCheung96
Copy link
Contributor

Closed by #5621

Question and Bug Reports automation moved this from In Progress to Done May 29, 2022
ti-chi-bot added a commit that referenced this issue Jun 15, 2022
ti-chi-bot added a commit that referenced this issue Jun 21, 2022
@Tammyxia Tammyxia added the found/automation Bugs found by automation cases label Aug 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.3 affects-5.4 affects-6.1 area/ticdc Issues or PRs related to TiCDC. found/automation Bugs found by automation cases severity/major This is a major bug. type/bug This is a bug.
5 participants