Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDC cloud: [CDC:ErrMetaListDatabases][tikv:9006]GC life time is shorter than transaction duration, transaction starts at 2021-08-05 16:54:32.127 +0800 CST, GC safe point is 2021-08-05 17:25:38.127 +0800 CST"] #2470

Closed
Tammyxia opened this issue Aug 5, 2021 · 9 comments · Fixed by #2512
Assignees
Labels
area/ticdc Issues or PRs related to TiCDC. severity/major This is a major bug. type/bug This is a bug.

Comments

@Tammyxia
Copy link

Tammyxia commented Aug 5, 2021

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.
  • Create new changefeed:

image

  1. What did you expect to see?
  • No error
  1. What did you see instead?
  • Too many logs:
    image
  1. Versions of the cluster

    • Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

      4.0.14
      
    • TiCDC version (execute cdc version):

      4.0.14 cloud Aug/03
      
@Tammyxia Tammyxia added type/bug This is a bug. severity/major This is a major bug. labels Aug 5, 2021
@3AceShowHand 3AceShowHand added this to Need Triage in Question and Bug Reports via automation Aug 5, 2021
@amyangfei
Copy link
Contributor

Could you provide cdc log and pd log, we can check the service GC safepoint and GC safepoint from these logs @Tammyxia

@Tammyxia
Copy link
Author

Tammyxia commented Aug 6, 2021

I cannot upload logs to here, you know what the reason is. Private message to you...

@Tammyxia
Copy link
Author

Tammyxia commented Aug 6, 2021

This issue caused changefeed cannot sync data any more, when gc-ttl was timeout , this changefeed status became to failed.

@Tammyxia Tammyxia added severity/critical This is a critical bug. and removed severity/major This is a major bug. labels Aug 6, 2021
@amyangfei
Copy link
Contributor

[2021/08/05 16:15:38.166 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=426809947095564288]
[2021/08/05 16:25:38.166 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=426810104381964288]
[2021/08/05 16:35:38.177 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=426810261668364288]
[2021/08/05 16:45:38.166 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=426810418954764288]
[2021/08/05 16:54:06.378 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=ticdc] [expire-at=1628240046] [safepoint=18446744073709551615]
[2021/08/05 16:54:06.382 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=ticdc] [expire-at=1628240046] [safepoint=18446744073709551615]
[2021/08/05 16:54:06.387 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=ticdc] [expire-at=1628240046] [safepoint=18446744073709551615]
[2021/08/05 16:54:06.449 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=ticdc] [expire-at=1628240046] [safepoint=18446744073709551615]
[2021/08/05 16:54:32.151 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=ticdc-creating-cc-task-1] [expire-at=1628154272] [safepoint=426810716226060299]
[2021/08/05 16:54:48.856 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=ticdc-creating-cc-task-1] [expire-at=1628154288] [safepoint=426810716226060299]
[2021/08/05 16:55:38.172 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=426810576241164288]
[2021/08/05 17:05:38.230 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=426810733527564288]
[2021/08/05 17:15:38.169 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=426810890813964288]
[2021/08/05 17:25:38.166 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=426811048100364288]
[2021/08/05 17:35:38.166 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=426811205386764288]
[2021/08/05 17:45:38.166 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=426811362673164288]
[2021/08/05 17:55:38.166 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=426811519959564288]
[2021/08/05 18:05:38.165 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=426811677245964288]
[2021/08/05 18:15:38.165 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=426811834532364288]
[2021/08/05 18:25:38.165 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=426811991818764288]
[2021/08/05 18:35:38.167 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=426812149105164288]
[2021/08/05 18:45:38.172 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=426812306391564288]
[2021/08/05 18:46:13.791 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=ticdc] [expire-at=1628246773] [safepoint=18446744073709551615]
[2021/08/05 18:47:13.789 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=ticdc] [expire-at=1628246833] [safepoint=18446744073709551615]

Notice these logs:

[2021/08/05 16:55:38.172 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=426810576241164288]
[2021/08/05 17:05:38.230 +08:00] [INFO] [grpc_service.go:815] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=426810733527564288]

[2021/08/05 16:55:38.172 +08:00] gc_worker tso=426810576241164288, system: 2021-08-05 16:45:38.127 +0800 CST
[2021/08/05 17:05:38.230 +08:00] gc_worker tso=426810733527564288, system: 2021-08-05 16:55:38.127 +0800 CST

@amyangfei
Copy link
Contributor

amyangfei commented Aug 6, 2021

Some ticdc owner log

add table log, first and last, 60010 tables totally
[2021/08/05 16:54:58.836 +08:00] [INFO] [owner.go:268] ["add a table"] [tableId=167822] [boundaryTs=426810716226060299]
[2021/08/05 16:57:24.124 +08:00] [INFO] [owner.go:268] ["add a table"] [tableId=187326] [boundaryTs=426810716226060299]

add tables in owner costs 146s

[2021/08/05 16:57:24.129 +08:00] [INFO] [scheduler.go:346] ["Start rebalancing"] [changefeed=cc-task-1] [table-num=60010] [capture-num=9] [target-limit=6668]
[2021/08/05 17:00:33.137 +08:00] [INFO] [etcd_worker.go:571] ["Committing big txn successful"]
[2021/08/05 17:10:21.491 +08:00] [WARN] [gc_manager.go:101] ["update gc safe point failed, the gc safe point is larger than checkpointTs"] [actual=426810733527564288] [checkpointTs=426810716226060299]
[2021/08/05 17:10:21.492 +08:00] [INFO] [feed_state_manager.go:90] ["handle admin job"] [changefeedID=cc-task-1] [job="{\"CfID\":\"cc-task-1\",\"Type\":1,\"Opts\":{\"ForceRemove\":false},\"Error\":null}"]
[2021/08/05 17:10:21.567 +08:00] [ERROR] [changefeed.go:106] ["an error occurred in Owner"] [changefeedID=cc-task-1] [error="[CDC:ErrSnapshotLostByGC]fail to create or maintain changefeed due to snapshot loss caused by GC. checkpoint-ts 426810716226060299 is earlier than GC safepoint at 426810733527564288"] [errorVerbose="[CDC:ErrSnapshotLostByGC]fail to create or maintain changefeed due to snapshot loss caused by GC. checkpoint-ts 426810716226060299 is earlier than GC safepoint at 426810733527564288\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:156\ngithub.com/pingcap/ticdc/cdc/owner.(*gcManager).CheckStaleCheckpointTs\n\tgithub.com/pingcap/ticdc/cdc/owner/gc_manager.go:136\ngithub.com/pingcap/ticdc/cdc/owner.(*changefeed).tick\n\tgithub.com/pingcap/ticdc/cdc/owner/changefeed.go:128\ngithub.com/pingcap/ticdc/cdc/owner.(*changefeed).Tick\n\tgithub.com/pingcap/ticdc/cdc/owner/changefeed.go:105\ngithub.com/pingcap/ticdc/cdc/owner.(*Owner).Tick\n\tgithub.com/pingcap/ticdc/cdc/owner/owner.go:134\ngithub.com/pingcap/ticdc/pkg/orchestrator.(*EtcdWorker).Run\n\tgithub.com/pingcap/ticdc/pkg/orchestrator/etcd_worker.go:217\ngithub.com/pingcap/ticdc/cdc/capture.(*Capture).runEtcdWorker\n\tgithub.com/pingcap/ticdc/cdc/capture/capture.go:264\ngithub.com/pingcap/ticdc/cdc/capture.(*Capture).campaignOwner\n\tgithub.com/pingcap/ticdc/cdc/capture/capture.go:243\ngithub.com/pingcap/ticdc/cdc/capture.(*Capture).run.func2\n\tgithub.com/pingcap/ticdc/cdc/capture/capture.go:172\nruntime.goexit\n\truntime/asm_amd64.s:1371"][2021/08/05 17:10:21.567 +08:00] [INFO] [changefeed.go:239] ["close changefeed"] [changefeed=cc-task-1] [info="{\"sink-uri\":\"***\",\"opts\":{\"_changefeed_id\":\"cc-task-1\"},\"create-time\":\"2021-08-05T16:54:32.157934864+08:00\",\"start-ts\":426810716226060299,\"target-ts\":0,\"admin-job-type\":0,\"sort-engine\":\"unified\",\"sort-dir\":\"\",\"config\":{\"case-sensitive\":true,\"enable-old-value\":false,\"force-replicate\":false,\"check-gc-safe-point\":true,\"filter\":{\"rules\":[\"*.*\"],\"ignore-txn-start-ts\":null},\"mounter\":{\"worker-num\":16},\"sink\":{\"dispatchers\":null,\"protocol\":\"default\"},\"cyclic-replication\":{\"enable\":false,\"replica-id\":0,\"filter-replica-ids\":null,\"id-buckets\":0,\"sync-ddl\":false},\"scheduler\":{\"type\":\"table-number\",\"polling-time\":-1}},\"state\":\"normal\",\"history\":null,\"error\":null,\"sync-point-enabled\":false,\"sync-point-interval\":600000000000,\"creator-version\":\"v4.0.14\"}"][2021/08/05 17:10:21.624 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=33148] [regionID=2037989] [startKey=6d44444c4a6f6241ff64ff644964784c69ff7374ff0000000000ff000000f700000000ff0000006c00000000fb] [endKey=6d44444c4a6f6241ff64ff644964784c69ff7374ff0000000000ff000000f700000000ff0000006d00000000fb] [checkpointTs=426810964974501922]

From log [2021/08/05 17:10:21.491 +08:00] [WARN] [gc_manager.go:101] ["update gc safe point failed, the gc safe point is larger than checkpointTs"] [actual=426810733527564288] [checkpointTs=426810716226060299], the first time cdc owner udpates service GC safepoint, the safepoint has been moved forward.

It happens when table count is large, in this scenario 60K, and the initialize phase is slower than cdcChangefeedCreatingServiceGCSafePointTTL, which is 10 minutes by default.

Suggestion, we can increase the cdcChangefeedCreatingServiceGCSafePointTTL to a reasonable value as a workaround, before TiCDC can guarantee the initialization phase is less than 10 minutes.

@amyangfei
Copy link
Contributor

Another problem is not clear, what happens between [2021/08/05 17:00:33.137 +08:00] and [2021/08/05 17:10:21.491 +08:00], seems TiCDC owner blocked in somewhere, but we can't find the root cause from logs only.

@overvenus
Copy link
Member

Suggestion, we can increase the cdcChangefeedCreatingServiceGCSafePointTTL to a reasonable value as a workaround, before TiCDC can guarantee the initialization phase is less than 10 minutes.

As a workaround, let's increase the the ttl to 1 hour to mitigate the issue. (We have tested in a cluster that captures 60k tables, it works).

@overvenus
Copy link
Member

Set severity to major as it only occurs when PD is extremely slow (means it's hard to reproduce in production).

@overvenus overvenus added severity/major This is a major bug. and removed severity/critical This is a critical bug. labels Sep 1, 2021
@overvenus
Copy link
Member

FYI, I am working on #2512 this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. severity/major This is a major bug. type/bug This is a bug.
Projects
Development

Successfully merging a pull request may close this issue.

4 participants