Summary
Under workloads with syncpoint enabled and many DDLs, the maintainer can spend excessive time handling BlockStatusRequest messages and repeatedly log maintainer is too slow. The current suspicion is that stale block-status resend tasks from obsolete dispatchers are never terminated after topology changes, so they keep resending WAITING barrier statuses for hours.
Evidence
- Slow maintainer handling:
[2026/04/30 09:45:38.278 +08:00] [INFO] [maintainer.go:318] ["maintainer is too slow"] [changefeedID=default/cdc-primary-to-secondary] [eventType=1] [duration=1h50m39.986192884s] [from=34dce5a8-64b8-4b20-b4ab-e57a4c0419e9] [to=] [type=BlockStatusRequest] [topic=]
- Extremely long-lived resend task for a syncpoint
WAITING status:
[2026/04/30 10:16:20.980 +08:00] [INFO] [helper.go:293] ["resend task periodic resend"] [dispatcherID=127019155305713984904401939242792550840] [message="ID:<...> state:<IsBlocked:true BlockTs:465954228142080000 BlockTables:<> IsSyncPoint:true stage:WAITING > "] [executeCount=8160]
With a 5-second resend interval, executeCount=8160 means the task has been retrying for about 11 hours.
Suspected root cause
After dispatcher replacement, split/merge, or DDL-driven topology changes, old dispatchers may continue resending stale WAITING block statuses. Maintainer currently ignores some of these requests as coming from non-replicating or nonexistent dispatchers, but ignoring them does not terminate the resend loop. Over time, these stale retries accumulate and create sustained BlockStatusRequest pressure on the maintainer event loop.
Impact
- excessive maintainer event handling latency
- noisy slow logs
- risk of barrier backlog growth under syncpoint + heavy DDL workloads
Summary
Under workloads with
syncpointenabled and many DDLs, the maintainer can spend excessive time handlingBlockStatusRequestmessages and repeatedly logmaintainer is too slow. The current suspicion is that stale block-status resend tasks from obsolete dispatchers are never terminated after topology changes, so they keep resendingWAITINGbarrier statuses for hours.Evidence
WAITINGstatus:With a 5-second resend interval,
executeCount=8160means the task has been retrying for about 11 hours.Suspected root cause
After dispatcher replacement, split/merge, or DDL-driven topology changes, old dispatchers may continue resending stale
WAITINGblock statuses. Maintainer currently ignores some of these requests as coming from non-replicating or nonexistent dispatchers, but ignoring them does not terminate the resend loop. Over time, these stale retries accumulate and create sustainedBlockStatusRequestpressure on the maintainer event loop.Impact