Skip to content

eventcollector: decouple dispatcher session from dispatcher stat#5007

Open
lidezhu wants to merge 4 commits intopingcap:masterfrom
lidezhu:ldz/refactor-event-collector002
Open

eventcollector: decouple dispatcher session from dispatcher stat#5007
lidezhu wants to merge 4 commits intopingcap:masterfrom
lidezhu:ldz/refactor-event-collector002

Conversation

@lidezhu
Copy link
Copy Markdown
Collaborator

@lidezhu lidezhu commented May 7, 2026

What problem does this PR solve?

Issue Number: close #xxx

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

  • Refactor

    • Improved internal architecture by decoupling session management from collector dependencies, enabling better modularity and testability.
    • Streamlined dispatcher request handling through dependency injection of key operations.
  • Tests

    • Enhanced test utilities to better support isolated component testing scenarios.

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 7, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 7, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenfyzhong for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack

Warning

Rate limit exceeded

@lidezhu has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 12 minutes and 17 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 243441ff-99f9-4596-a2a4-674be56838fe

📥 Commits

Reviewing files that changed from the base of the PR and between 298ba37 and b54d22c.

📒 Files selected for processing (1)
  • downstreamadapter/eventcollector/dispatcher_session.go
📝 Walkthrough

Walkthrough

Dispatcher session is refactored to eliminate its dependency on dispatcherStat by receiving injected dispatcher metadata, message callbacks, and epoch management instead. Dispatcher stat now provides these dependencies at session creation time and delegates request construction to the session. Tests gain a specialized helper for state-only validation.

Changes

Dispatcher Session Dependency Injection

Layer / File(s) Summary
Session Constructor & Fields
downstreamadapter/eventcollector/dispatcher_session.go
Constructor accepts dispatcher target, localServerID, sendMessage, nextResetEpoch, and readyCallback directly instead of holding an owner reference. Session struct now manages connection state internally.
Session Methods Using Injected Dependencies
downstreamadapter/eventcollector/dispatcher_session.go
clear(), registerTo(), commitReady(), reset(), remove(), and event handling (handleSignalEvent, handleReadyEvent, handleLocalReadyEvent, handleRemoteReadyEvent, handleNotReusableEvent) now use injected callbacks and local fields instead of owner references.
Dispatcher Stat Dependency Provision
downstreamadapter/eventcollector/dispatcher_stat.go
newDispatcherStat wrapper validates event-collector and delegates to newDispatcherStatInternal, which provides explicit dependencies to session creation. New nextResetEpoch() method atomically advances epoch state. Request helpers delegate to session methods.
Test Updates
downstreamadapter/eventcollector/dispatcher_stat_test.go
New newDispatcherStatForStateTest() helper constructs stat instances for state validation without event-collector wiring. Multiple tests switch from newDispatcherStat(..., nil, nil) to use the new helper.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • pingcap/ticdc#4991: Concurrent refactor moving dispatcher connection and lifecycle state from dispatcherStat into dispatcherSession with signature and call-site updates.

Suggested labels

lgtm, approved

Suggested reviewers

  • hongyunyan
  • flowbehappy
  • wk989898

Poem

🐰 A session breaks free from its master's grip,
With injected deps and a cleaner ship,
Stat provides the fuel, the epoch, the way,
Tests simplified for a brighter day! 🌟

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete and lacks critical information. The 'Issue Number', 'What is changed and how it works?', test type selection, and release note are either placeholders or unanswered. Complete the description by: (1) providing a linked issue number, (2) explaining the changes and rationale, (3) selecting applicable test types and confirming tests exist, (4) answering compatibility/documentation questions, and (5) providing an actual release note or 'None'.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main refactoring change: decoupling dispatcher session from dispatcher stat in the eventcollector component.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 7, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the dispatcher management and heartbeat mechanism by introducing a dispatcherSession to handle event service interactions and implementing epoch-based progress tracking. Key changes include the addition of a background heartbeat sender in the EventCollector and updates to the wire format to include epoch information, ensuring stale progress updates are ignored. Review feedback identifies a critical missing length validation in the Unmarshal method for DispatcherProgressLegacy that could lead to panics, and notes a design concern regarding stale heartbeat responses in the session management logic.

func (dp *DispatcherProgressLegacy) Unmarshal(data []byte) error {
buf := bytes.NewBuffer(data)
dp.DispatcherID.Unmarshal(buf.Next(dp.DispatcherID.GetSize()))
dp.CheckpointTs = binary.BigEndian.Uint64(buf.Next(8))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Unmarshal method does not validate the length of the input data before calling buf.Next(). If data is shorter than expected, buf.Next() will return a shorter slice, and binary.BigEndian.Uint64 will panic. This is a high-severity issue as it can lead to service crashes with malformed input.

func (dp *DispatcherProgressLegacy) Unmarshal(data []byte) error {
	if len(data) < dp.DispatcherID.GetSize()+8 {
		return fmt.Errorf("data too short")
	}
	dp.DispatcherID.Unmarshal(data[:dp.DispatcherID.GetSize()])
	dp.CheckpointTs = binary.BigEndian.Uint64(data[dp.DispatcherID.GetSize():dp.DispatcherID.GetSize()+8])
	return nil
}

Comment on lines +140 to +142
// TODO: this design is bad because we may receive stale heartbeat response,
// which make us call clear and register again. But the register may be ignore,
// so we will not receive any ready event.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The clear() method contains a TODO indicating a known design flaw regarding stale heartbeat responses. This should be addressed to ensure the dispatcher state is managed correctly and to avoid potential issues with registration.

@lidezhu lidezhu force-pushed the ldz/refactor-event-collector002 branch from 2d1ecd2 to fe27b6e Compare May 7, 2026 09:36
@ti-chi-bot ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 7, 2026
@lidezhu lidezhu force-pushed the ldz/refactor-event-collector002 branch from fe27b6e to 298ba37 Compare May 7, 2026 13:37
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@downstreamadapter/eventcollector/dispatcher_session.go`:
- Around line 292-299: When handleNotReusableEvent in dispatcherSession observes
no replacement candidate (connState.getNextRemoteCandidate() returns empty),
clear the failed remote binding so the session no longer appears attached;
specifically, if candidate == "" then reset the remote binding (e.g., clear
connState.eventServiceID or call the connState clear/unbind helper) before
returning so trySetRemoteCandidates can accept new candidates later; keep the
existing registerTo(path) behavior when a candidate exists and optionally log
the clear action for debugging.
- Around line 160-179: commitReady and reset currently call doReset with
s.target.GetCheckpointTs(), which can advance reset epochs past
collector-observed progress and violate dispatcherEpochState.maxEventTs; change
both commitReady and reset to compute resetTs via the safe, capped progress (use
the dispatcher state helper, e.g. s.state.getSafeResetTs() or the equivalent
getSafeResetTs method wired from dispatcherStat used for EventService
heartbeats) and pass that value into doReset(serverID, resetTs), leaving doReset
unchanged; update imports/struct wiring if needed to expose getSafeResetTs on
s.state so commitReady/reset use the capped progress instead of
s.target.GetCheckpointTs().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f5703def-593e-412f-ab4e-7da1b9f8898c

📥 Commits

Reviewing files that changed from the base of the PR and between 5bf13f7 and 298ba37.

📒 Files selected for processing (3)
  • downstreamadapter/eventcollector/dispatcher_session.go
  • downstreamadapter/eventcollector/dispatcher_stat.go
  • downstreamadapter/eventcollector/dispatcher_stat_test.go

Comment on lines 160 to +179
// commitReady is used to notify the event service to start sending events.
func (s *dispatcherSession) commitReady(serverID node.ID) {
s.doReset(serverID, s.owner.getResetTs())
s.doReset(serverID, s.target.GetCheckpointTs())
}

// reset is used to reset the dispatcher to the specified commitTs,
// it will remove the dispatcher from the dynamic stream and add it back.
func (s *dispatcherSession) reset(serverID node.ID) {
s.doReset(serverID, s.owner.getResetTs())
s.doReset(serverID, s.target.GetCheckpointTs())
}

func (s *dispatcherSession) doReset(serverID node.ID, resetTs uint64) {
var epoch uint64
for {
currentState := s.owner.loadCurrentEpochState()
nextState := newDispatcherEpochState(currentState.epoch+1, 0, resetTs)
if s.owner.currentEpoch.CompareAndSwap(currentState, nextState) {
epoch = nextState.epoch
break
}
}
resetRequest := s.owner.newDispatcherResetRequest(
s.owner.eventCollector.getLocalServerID().String(),
epoch := s.nextResetEpoch(resetTs)
resetRequest := s.newDispatcherResetRequest(
s.localServerID.String(),
resetTs,
epoch,
)
msg := messaging.NewSingleTargetMessage(serverID, messaging.EventServiceTopic, resetRequest)
s.owner.eventCollector.enqueueMessageForSend(msg)
s.sendMessage(msg)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Clamp reset ts to collector-observed progress, not raw sink checkpoint.

Lines 162 and 168 use s.target.GetCheckpointTs() directly. That breaks the safety invariant documented in dispatcherEpochState.maxEventTs: old in-flight events can advance the sink checkpoint after a reset, even when the collector has not accepted that progress in the new epoch yet. If another commitReady/reset happens in that window, this code will start the next epoch too far ahead and can permanently skip data.

Possible fix shape
type dispatcherSession struct {
    target dispatcher.DispatcherService
    localServerID node.ID
    sendMessage func(*messaging.TargetMessage)
+   getSafeResetTs func() uint64
    nextResetEpoch func(resetTs uint64) uint64
    readyCallback func()
}

-func newDispatcherSession(
+func newDispatcherSession(
    target dispatcher.DispatcherService,
    localServerID node.ID,
    sendMessage func(*messaging.TargetMessage),
+   getSafeResetTs func() uint64,
    nextResetEpoch func(resetTs uint64) uint64,
    readyCallback func(),
) *dispatcherSession {
    return &dispatcherSession{
        target:         target,
        localServerID:  localServerID,
        sendMessage:    sendMessage,
+       getSafeResetTs: getSafeResetTs,
        nextResetEpoch: nextResetEpoch,
        readyCallback:  readyCallback,
    }
}

func (s *dispatcherSession) commitReady(serverID node.ID) {
-   s.doReset(serverID, s.target.GetCheckpointTs())
+   s.doReset(serverID, s.getSafeResetTs())
}

func (s *dispatcherSession) reset(serverID node.ID) {
-   s.doReset(serverID, s.target.GetCheckpointTs())
+   s.doReset(serverID, s.getSafeResetTs())
}

Wire getSafeResetTs from dispatcherStat using the same capped progress used for EventService heartbeats.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@downstreamadapter/eventcollector/dispatcher_session.go` around lines 160 -
179, commitReady and reset currently call doReset with
s.target.GetCheckpointTs(), which can advance reset epochs past
collector-observed progress and violate dispatcherEpochState.maxEventTs; change
both commitReady and reset to compute resetTs via the safe, capped progress (use
the dispatcher state helper, e.g. s.state.getSafeResetTs() or the equivalent
getSafeResetTs method wired from dispatcherStat used for EventService
heartbeats) and pass that value into doReset(serverID, resetTs), leaving doReset
unchanged; update imports/struct wiring if needed to expose getSafeResetTs on
s.state so commitReady/reset use the capped progress instead of
s.target.GetCheckpointTs().

Comment on lines +292 to +299
func (s *dispatcherSession) handleNotReusableEvent(event dispatcher.DispatcherEvent) {
if *event.From == s.localServerID {
log.Panic("should not happen: local event service should not send not reusable event")
}
candidate := s.connState.getNextRemoteCandidate()
if candidate != "" {
s.registerTo(candidate)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clear the failed remote binding when no replacement candidate exists.

If getNextRemoteCandidate() returns empty here, connState.eventServiceID still points at the rejected remote. After that, trySetRemoteCandidates() will refuse any later candidate list because the session still looks attached, so this dispatcher can get stuck until some unrelated clear path runs.

Minimal fix
func (s *dispatcherSession) handleNotReusableEvent(event dispatcher.DispatcherEvent) {
    if *event.From == s.localServerID {
        log.Panic("should not happen: local event service should not send not reusable event")
    }
    candidate := s.connState.getNextRemoteCandidate()
-   if candidate != "" {
-       s.registerTo(candidate)
-   }
+   if candidate == "" {
+       s.connState.setEventServiceID("")
+       return
+   }
+   s.registerTo(candidate)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@downstreamadapter/eventcollector/dispatcher_session.go` around lines 292 -
299, When handleNotReusableEvent in dispatcherSession observes no replacement
candidate (connState.getNextRemoteCandidate() returns empty), clear the failed
remote binding so the session no longer appears attached; specifically, if
candidate == "" then reset the remote binding (e.g., clear
connState.eventServiceID or call the connState clear/unbind helper) before
returning so trySetRemoteCandidates can accept new candidates later; keep the
existing registerTo(path) behavior when a candidate exists and optionally log
the clear action for debugging.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 7, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 7, 2026

@lidezhu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-error-log-review b54d22c link true /test pull-error-log-review

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant