Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeed stucks for open protocol changefeed with claim-check on #9775

Closed
Tracked by #9153
fubinzh opened this issue Sep 20, 2023 · 1 comment · Fixed by #9778
Closed
Tracked by #9153

changefeed stucks for open protocol changefeed with claim-check on #9775

fubinzh opened this issue Sep 20, 2023 · 1 comment · Fixed by #9778
Labels
area/ticdc Issues or PRs related to TiCDC. severity/major This is a major bug. type/bug This is a bug.

Comments

@fubinzh
Copy link

fubinzh commented Sep 20, 2023

What did you do?

  1. Create Kafka changefeed
[root@upstream-ticdc-0 /]# cat /tmp/changefeed.toml
[sink.kafka-config.large-message-handle]
large-message-handle-option = "claim-check"
claim-check-storage-uri = "s3://tmp/claim-check-0920?access_key=xxxx&secret_access_key=xxxx&endpoint=http://xxx:9000"
large-message-handle-compression = "none"

/cdc  cli  changefeed  create "--server=127.0.0.1:8301" "--sink-uri=kafka://downstream-kafka.testbed-fb-h9dr9:9092/kafka-large-msg-claim-check-open-protocol?protocol=open-protocol&replication-factor=1" "--changefeed-id=kafka-large-msg-claim-check-open-protocol" "--config=/tmp/changefeed.toml"

  1. Run workload
/workload --tps=2000 --row-count=4000  --table-count=1 --thread=10 --workload-type=large_row --database-host=upstream-tidb.testbed-fb-h9dr9 --database-port=4000 --database-user=root --database-db-name=workload --large-row-size=1048576 --large-ratio=0.100000
  1. Check changefeed checkpoint status

What did you expect to see?

CDC checkpint should be advance.

What did you see instead?

  1. CDC checkpoint not advance
  2. Sink manager restarted frequent, and everytime it is restarted, large kafka message sent to external storage
[span={table_id:124,start_key:7480000000000000ff7c5f720000000000fa,end_key:7480000000000000ff7c5f730000000000fa}] [checkpointTs="{\"Mode\":0,\"Ts\":444384705225162752,\"BatchID\":18446744073709551615}"] [stuckCheck=150] [factoryVersion=8]
[2023/09/20 07:22:10.860 +00:00] [WARN] [manager.go:993] ["Table checkpoint is stuck too long, will restart the sink backend"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [span={table_id:124,start_key:7480000000000000ff7c5f720000000000fa,end_key:7480000000000000ff7c5f730000000000fa}] [checkpointTs="{\"Mode\":0,\"Ts\":444384705225162752,\"BatchID\":18446744073709551615}"] [stuckCheck=150] [factoryVersion=8]
[2023/09/20 07:22:10.860 +00:00] [WARN] [manager.go:264] ["Sink manager backend sink fails"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [factoryVersion=8] [error="table sink stuck"] [errorVerbose="table sink stuck\ngithub.com/pingcap/tiflow/cdc/processor/sinkmanager.(*SinkManager).GetTableStats\n\tgithub.com/pingcap/tiflow/cdc/processor/sinkmanager/manager.go:992\ngithub.com/pingcap/tiflow/cdc/processor.(*processor).GetTableSpanStatus\n\tgithub.com/pingcap/tiflow/cdc/processor/processor.go:333\ngithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*tableSpan).getTableSpanStatus\n\tgithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent/table.go:73\ngithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*agent).handleMessageHeartbeat.func1\n\tgithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent/agent.go:283\ngithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*agent).handleMessageHeartbeat.(*BtreeMap[...]).Ascend.func2\n\tgithub.com/pingcap/tiflow/pkg/spanz/btree_map.go:110\ngithub.com/google/btree.(*node[...]).iterate\n\tgithub.com/google/btree@v1.1.2/btree_generic.go:522\ngithub.com/google/btree.(*BTreeG[...]).Ascend\n\tgithub.com/google/btree@v1.1.2/btree_generic.go:779\ngithub.com/pingcap/tiflow/pkg/spanz.(*BtreeMap[...]).Ascend\n\tgithub.com/pingcap/tiflow/pkg/spanz/btree_map.go:109\ngithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*agent).handleMessageHeartbeat\n\tgithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent/agent.go:282\ngithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*agent).handleMessage\n\tgithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent/agent.go:257\ngithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*agent).Tick\n\tgithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent/agent.go:209\ngithub.com/pingcap/tiflow/cdc/processor.(*processor).tick\n\tgithub.com/pingcap/tiflow/cdc/processor/processor.go:586\ngithub.com/pingcap/tiflow/cdc/processor.(*processor).Tick\n\tgithub.com/pingcap/tiflow/cdc/processor/processor.go:474\ngithub.com/pingcap/tiflow/cdc/processor.(*managerImpl).Tick\n\tgithub.com/pingcap/tiflow/cdc/processor/manager.go:141\ngithub.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).Run\n\tgithub.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:290\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).runEtcdWorker\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:580\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run.func4\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:391\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.3.0/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2023/09/20 07:22:10.860 +00:00] [INFO] [manager.go:362] ["Sink manager closing sink factory"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [factoryVersion=8]
[2023/09/20 07:22:10.861 +00:00] [INFO] [encoder_group.go:87] ["encoder group exited"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol]
[2023/09/20 07:22:10.861 +00:00] [INFO] [worker.go:114] ["MQ sink worker exited"] [error="context canceled"] [errorVerbose="context canceled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20221009092201-b66cddb77c32/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/errors@v0.11.5-0.20221009092201-b66cddb77c32/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/cdc/sink/dmlsink/mq.(*worker).batchEncodeRun\n\tgithub.com/pingcap/tiflow/cdc/sink/dmlsink/mq/worker.go:188\ngithub.com/pingcap/tiflow/cdc/sink/dmlsink/mq.(*worker).run.func3\n\tgithub.com/pingcap/tiflow/cdc/sink/dmlsink/mq/worker.go:127\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.3.0/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1650"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [protocol=open-protocol]
[2023/09/20 07:22:10.861 +00:00] [INFO] [metrics_collector.go:96] ["Kafka metrics collector stopped"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol]
[2023/09/20 07:22:10.861 +00:00] [INFO] [manager.go:368] ["Sink manager has closed sink factory"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [factoryVersion=8]
[2023/09/20 07:22:10.861 +00:00] [INFO] [factory.go:229] ["async producer exit since context is done"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol]
[2023/09/20 07:22:10.861 +00:00] [INFO] [kafka_manager.go:99] ["Background refresh Kafka metadata goroutine exit."] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol]
[2023/09/20 07:22:10.861 +00:00] [INFO] [manager.go:273] ["Sink manager is closing all table sinks"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol]
[2023/09/20 07:22:10.861 +00:00] [INFO] [factory.go:201] ["Close kafka async producer client success"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [duration=17.62µs]
[2023/09/20 07:22:10.861 +00:00] [INFO] [table_sink_impl.go:194] ["Table sink stopped"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [span={table_id:124,start_key:7480000000000000ff7c5f720000000000fa,end_key:7480000000000000ff7c5f730000000000fa}] [checkpointTs=0]
[2023/09/20 07:22:10.861 +00:00] [INFO] [factory.go:215] ["Close kafka async producer success"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [duration=51.992µs]
[2023/09/20 07:22:10.861 +00:00] [INFO] [table_sink_impl.go:176] ["Stopping table sink"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [span={table_id:124,start_key:7480000000000000ff7c5f720000000000fa,end_key:7480000000000000ff7c5f730000000000fa}] [checkpointTs=0]
[2023/09/20 07:22:10.861 +00:00] [INFO] [manager.go:281] ["Sink manager has closed all table sinks"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [cost=230.016µs]
[2023/09/20 07:22:10.911 +00:00] [WARN] [feed_state_manager.go:465] ["processor reports a warning"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [captureID=5fd26744-aec5-487d-b418-cab4d96bd1c6] [warning="{\"time\":\"2023-09-20T07:22:10.910582055Z\",\"addr\":\"upstream-ticdc-0.upstream-ticdc-peer.testbed-fb-h9dr9.svc:8301\",\"code\":\"CDC:ErrProcessorUnknown\",\"message\":\"table sink stuck\"}"]
[2023/09/20 07:22:11.518 +00:00] [INFO] [table_sink_wrapper.go:362] ["Sink is restarted"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [span={table_id:124,start_key:7480000000000000ff7c5f720000000000fa,end_key:7480000000000000ff7c5f730000000000fa}] [replicateTs=444385075267895298]
[2023/09/20 07:22:11.518 +00:00] [INFO] [table_sink_worker.go:183] ["table sink has been restarted"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [span={table_id:124,start_key:7480000000000000ff7c5f720000000000fa,end_key:7480000000000000ff7c5f730000000000fa}] [lastWrittenPos="{\"StartTs\":444384705225162751,\"CommitTs\":444384705225162752}"] [sinkError="dead dmlSink"]
[2023/09/20 07:22:15.875 +00:00] [INFO] [admin.go:100] ["Kafka config item found"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [configName=min.insync.replicas] [configValue=1]
[2023/09/20 07:22:15.879 +00:00] [INFO] [admin.go:100] ["Kafka config item found"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [configName=max.message.bytes] [configValue=1048588]
[2023/09/20 07:22:15.879 +00:00] [WARN] [options.go:600] ["topic's `max.message.bytes` less than the `max-message-bytes`,use topic's `max.message.bytes` to initialize the Kafka producer"] [max.message.bytes=1048588] [max-message-bytes=10485760]
[2023/09/20 07:22:15.879 +00:00] [WARN] [options.go:610] ["topic already exist, TiCDC will not create the topic"] [topic=kafka-large-msg-claim-check-open-protocol] [detail="{\"Name\":\"kafka-large-msg-claim-check-open-protocol\",\"NumPartitions\":3,\"ReplicationFactor\":0}"]
[2023/09/20 07:22:15.879 +00:00] [INFO] [options.go:190] ["partitionNum is not set, set by topic's partition-num"] [partitionNum=3]
[2023/09/20 07:22:15.880 +00:00] [INFO] [kafka_manager.go:132] ["store topic partition number"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [topic=kafka-large-msg-claim-check-open-protocol] [partitionNumber=3]
[2023/09/20 07:22:15.880 +00:00] [INFO] [claim_check.go:51] ["claim check enabled, start create the external storage"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [storageURI="s3://tmp/claim-check-0920?access_key=minioadmin&endpoint=http%3A%2F%2F172.16.6.5%3A9000&secret_access_key=minioadmin"]
[2023/09/20 07:22:15.887 +00:00] [INFO] [s3.go:403] ["succeed to get bucket region from s3"] ["bucket region"=us-east-1]
[2023/09/20 07:22:15.891 +00:00] [INFO] [claim_check.go:68] ["claim-check create the external storage success"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [storageURI="s3://tmp/claim-check-0920?access_key=minioadmin&endpoint=http%3A%2F%2F172.16.6.5%3A9000&secret_access_key=minioadmin"] [duration=10.257239ms]
[2023/09/20 07:22:15.894 +00:00] [INFO] [kafka_dml_producer.go:62] ["Starting kafka DML producer ..."] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol]
[2023/09/20 07:22:15.894 +00:00] [INFO] [kafka_dml_sink.go:122] ["DML sink producer created"] [namespace=default] [changefeedID=kafka-large-msg-claim-check-open-protocol]
[2023/09/20 07:22:15.894 +00:00] [INFO] [manager.go:351] ["Sink manager inits sink factory success"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [factoryVersion=9]
[2023/09/20 07:22:15.894 +00:00] [INFO] [worker.go:177] ["MQ sink batch worker started"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [protocol=open-protocol]
[2023/09/20 07:22:28.713 +00:00] [INFO] [replication_manager.go:663] ["schedulerv3: slow table"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [tableID=124] [tableStatus=] [checkpointTs=444384705225162752] [resolvedTs=444385079698128901] [checkpointLag=23m48.759771511s]
[2023/09/20 07:23:28.762 +00:00] [INFO] [replication_manager.go:663] ["schedulerv3: slow table"] [namespace=default] [changefeed=kafka-large-msg-claim-check-open-protocol] [tableID=124] [tableStatus=] [checkpointTs=444384705225162752] [resolvedTs=444385095478935558] [checkpointLag=24m48.808800445s]

img_v2_93407a9a-ad2e-4a54-b39a-03e62d86e29g

image

image

Versions of the cluster

[root@upstream-ticdc-0 /]# /cdc version
Release Version: v7.4.0-alpha
Git Commit Hash: bd83f8d
Git Branch: heads/refs/tags/v7.4.0-alpha
UTC Build Time: 2023-09-18 11:36:23
Go Version: go version go1.21.1 linux/amd64
Failpoint Build: false

@fubinzh fubinzh added area/ticdc Issues or PRs related to TiCDC. type/bug This is a bug. labels Sep 20, 2023
@github-actions github-actions bot added this to Need Triage in Question and Bug Reports Sep 20, 2023
@fubinzh
Copy link
Author

fubinzh commented Sep 20, 2023

/severity major

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. severity/major This is a major bug. type/bug This is a bug.
2 participants