Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ticdc lag reached more than 10min and ticdc crash when inject pdleader data io hang #9054

Closed
Lily2025 opened this issue May 25, 2023 · 13 comments · Fixed by #10881
Closed

ticdc lag reached more than 10min and ticdc crash when inject pdleader data io hang #9054

Lily2025 opened this issue May 25, 2023 · 13 comments · Fixed by #10881
Assignees
Labels
affects-7.5 area/ticdc Issues or PRs related to TiCDC. severity/major This is a major bug. type/bug This is a bug.

Comments

@Lily2025
Copy link

Lily2025 commented May 25, 2023

What did you do?

1、run tpcc with threads 10 and warehouse 1000
2、After 10 minutes, simulates the io of pd leader is hang but the pod is still active
fault start time:2023-05-15 12:05:38
3、After 10 minutes, recovery the fault
fault recover time:2023-05-15 12:15:38

What did you expect to see?

changefeed lag is less than 30s

What did you see instead?

1 、changefeed lag reached more than 10min after inject fault
2、 ticdc crash

image

image

image

Versions of the cluster

git hash:1335f98cdbaf77239bbcbc6b61561e4254449ffe

current status of DM cluster (execute query-status <task-name> in dmctl)

No response

@Lily2025 Lily2025 added area/dm Issues or PRs related to DM. type/bug This is a bug. labels May 25, 2023
@github-actions github-actions bot added this to Need Triage in Question and Bug Reports May 25, 2023
@Lily2025
Copy link
Author

/remove-area dm
/area ticdc

@ti-chi-bot ti-chi-bot bot added area/ticdc Issues or PRs related to TiCDC. and removed area/dm Issues or PRs related to DM. labels May 25, 2023
@Lily2025
Copy link
Author

/severity major

@asddongmen
Copy link
Contributor

#9106 alleviate this problem, in testing, it was found that after merging this PR, there is only a 50% chance of encountering the cdc stuck issue.

@nongfushanquan
Copy link
Contributor

/assign @fubinzh

@zhangjinpeng87
Copy link
Contributor

What is the root cause of the lag since TiCDC doesn't depends on PD leader's IO, TiCDC just asynchronously push the checkpoint to PD. cc @nongfushanquan

@fubinzh
Copy link

fubinzh commented Jun 27, 2023

This issue still seen with v7.2.0
chaos injection time: 2023-06-21 19:14 - 19:19
7ecce631-bb00-4f54-99b9-eef0496bbdce
4457effb-e6c1-40c3-b5c8-757dc94a2a26

@nongfushanquan
Copy link
Contributor

nongfushanquan commented Jul 3, 2023

TiCDC nodes have to keep connection with the PD , but in this scenario , PD can't read the leader's information from etcd , which may be caused by the following issue
etcd-io/etcd#12528.

@asddongmen
Copy link
Contributor

asddongmen commented Jul 19, 2023

It's a etcd issue, we can't fix now.

@Lily2025
Copy link
Author

ticdc crash when inject pd leader io delay 1s
[2023/10/18 17:36:14.719 +08:00] [INFO] [chaos.go:64] ["Run chaos"] [name=iochaos_io_delay] [selectors="[endless-ha-test-ticdc-tps-3240159-1-638/tc-pd-2]"] [selectorsRetainPolicy(selectors)="[endless-ha-test-ticdc-tps-3240159-1-638/tc-pd-2]"] [targetSelectors="[nil]"] [targetSelectorsRetainPolicy(targetSelectors)="[nil]"] [experimentSpec="IODelaySpec{Duration: "", Scheduler: , Delay: "1s", Path: "/var/lib/pd/data/**/*", Percent: 100}"]
image
image

@nongfushanquan
Copy link
Contributor

nongfushanquan commented Nov 1, 2023

/remove-label affects-7.5

Copy link
Contributor

ti-chi-bot bot commented Nov 1, 2023

@nongfushanquan: The label(s) affect-7.5 cannot be applied. These labels are supported: duplicate, bug-from-internal-test, bug-from-user, ok-to-test, needs-ok-to-test, affects-5.3, affects-5.4, affects-6.1, affects-6.5, affects-7.1, affects-7.5, may-affects-5.3, may-affects-5.4, may-affects-6.1, may-affects-6.5, may-affects-7.1, may-affects-7.5, needs-cherry-pick-release-5.3, needs-cherry-pick-release-5.4, needs-cherry-pick-release-6.1, needs-cherry-pick-release-6.5, needs-cherry-pick-release-7.1, needs-cherry-pick-release-7.5, question, release-blocker, wontfix.

In response to this:

/remove-label affect-7.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot ti-chi-bot bot removed the affects-7.5 label Nov 1, 2023
@Lily2025 Lily2025 changed the title ticdc lag reached more than 10min when run ha_pdleader_data_io_hang and ticdc crash ticdc lag reached more than 10min when inject pdleader data io hang and ticdc crash Nov 8, 2023
@Lily2025 Lily2025 changed the title ticdc lag reached more than 10min when inject pdleader data io hang and ticdc crash ticdc lag reached more than 10min and ticdc crash when inject pdleader data io hang Dec 5, 2023
@asddongmen
Copy link
Contributor

asddongmen commented Apr 2, 2024

If PD is able to upgrade its etcd version to 3.4.31, it's likely that the issue will be resolved.
Ref: etcd-io/etcd#17465 (comment)
cc @flowbehappy

@asddongmen
Copy link
Contributor

asddongmen commented Apr 15, 2024

After the merge of #10881, the checkpointTs lag during pd-leader-io-hang cases was reduced to less than 120s, meeting the requirement.
image

cc @Lily2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-7.5 area/ticdc Issues or PRs related to TiCDC. severity/major This is a major bug. type/bug This is a bug.
Development

Successfully merging a pull request may close this issue.

7 participants