Skip to content

coordinator,maintainer: add session epoch handoff#4816

Closed
hongyunyan wants to merge 9 commits intopingcap:masterfrom
hongyunyan:feat/maintainer-dispatcher-session-epoch
Closed

coordinator,maintainer: add session epoch handoff#4816
hongyunyan wants to merge 9 commits intopingcap:masterfrom
hongyunyan:feat/maintainer-dispatcher-session-epoch

Conversation

@hongyunyan
Copy link
Copy Markdown
Collaborator

@hongyunyan hongyunyan commented Apr 14, 2026

Background

Maintainer and dispatcher manager currently rely on resend and node placement to converge during maintainer restart or move. Without an explicit runtime session boundary, late control messages and late runtime reports can still be processed by the wrong maintainer session during handoff.

Motivation

This change adds a minimal runtime-only session epoch so stale maintainer and dispatcher-manager messages stop mutating the active state machine, while preserving rolling upgrade compatibility and avoiding any new etcd persistent fields.

Summary of Changes

  • add session_epoch to the maintainer/dispatcher-manager runtime protobuf messages used in add/remove, bootstrap, post-bootstrap, close, heartbeat, block-status, redo progress, schedule, and merge flows
  • generate strict session epochs from PD TSO only and refuse to create a new maintainer session when PD TSO is unavailable
  • track the active maintainer session epoch in coordinator runtime state and wire it through add, move, and stop operators, including move fallback handling
  • tag maintainer and dispatcher-manager runtime messages with session epoch and filter stale or unexpected messages on both sides
  • keep bootstrap as the only dispatcher-manager path that can install or advance a non-zero session, while preserving the legacy sessionEpoch == 0 path for mixed-version rolling upgrade compatibility
  • add focused unit tests for session epoch generation, coordinator operators, maintainer filtering, dispatcher-manager filtering, and the coordinator legacy remove fallback path

Testing

  • make fmt
  • go test ./pkg/pdutil ./coordinator/changefeed ./coordinator/operator ./maintainer ./downstreamadapter/dispatchermanager ./downstreamadapter/dispatcherorchestrator
  • go test ./coordinator -count=1 -run 'Test(OnPeriodTaskAdvanceLiveness|HandleNonExistentChangefeedUsesLegacyRemoveWhenLocalMetadataMissing|ResumeChangefeed|ResumeChangefeedNormalState|ResumeChangefeedOverwriteUpdatesLastSavedCheckpointTs|ResumeChangefeedIgnoresStaleMaintainerErrorAndSchedules|PauseChangefeed|UpdateChangefeed|GetChangefeed|RemoveChangefeed|ListChangefeed|CreateChangefeed|CreateChangefeedDoesNotUpdateGCSafepoint|UpdateGCSafepointCallsGCManagerUpdate|UpdateGCSafepointDeletesServiceSafepointWhenNoChangefeed|RemoveLastChangefeedDeletesServiceSafepointImmediately|ConcurrentDeleteLastChangefeedAndCreateNewOneKeepsExpectedGCSafepoint)$'
  • go test ./coordinator -count=1 -run 'TestCoordinatorScheduling$'

Summary by CodeRabbit

  • New Features

    • Runtime session-epoch tracking for maintainers; session epochs are propagated in heartbeat and dispatcher messages and used by controller/operator flows.
    • Controller/operator flows now allocate PD-derived strict session epochs and include them in add/move/stop operations.
  • Bug Fixes

    • Session-aware gating prevents stale, legacy or future maintainer/dispatcher actions from being applied.
  • Documentation

    • Messaging protocol extended with session_epoch fields.
  • Chores

    • Added metric to count session-rejection events.
  • Tests

    • Extensive new tests covering session-epoch handling, gating, and operator behaviors.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 14, 2026

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot Bot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Apr 14, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 14, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign asddongmen for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a PD-backed monotonic maintainer session epoch and propagates it through operator allocation, controller flows, heartbeat protobufs, maintainer runtime, dispatcher managers/orchestrator, scheduler, and tests; introduces session-gating validation for inbound messages and epoch-aware operator/stop semantics.

Changes

Cohort / File(s) Summary
Coordinator Changefeed
coordinator/changefeed/changefeed.go, coordinator/changefeed/changefeed_test.go
Add runtime fields currentMaintainerSessionEpoch and acceptZeroMaintainerSession with nil-safe APIs; NewAdd/RemoveMaintainerMessage signatures now accept sessionEpoch and set protobuf SessionEpoch.
Coordinator Controller & Tests
coordinator/controller.go, coordinator/controller_test.go, coordinator/coordinator_test.go, coordinator/create_changefeed_gc_test.go
Controller gains pdClient, validates maintainer session epochs on status updates, restores/pass epochs during bootstrap/remove flows; many tests and constructor call sites updated for new params and semantics.
Operator Layer & Tests
coordinator/operator/operator_*.go, coordinator/operator/*_test.go
Add/Move/Stop operators carry sessionEpoch (constructors/signatures changed); some operator factories now return (op, error) after PD epoch allocation; Start/Schedule/PostFinish publish/enforce sessionEpoch; tests updated/added.
Operator Controller Changes
coordinator/operator/operator_controller.go, coordinator/operator/operator_controller_test.go
Controller allocates strict session epochs via PD (GenerateStrictSessionEpoch) with timeout, passes epochs into operator constructors, StopChangefeed forwards sessionEpoch/allowZeroEpoch into stop operators; tests validate deadline usage.
Scheduler
coordinator/scheduler/basic.go, coordinator/scheduler/balance.go
Scheduler now uses operatorController NewAdd/NewMove which can return errors; logs and skips when epoch allocation fails; added structured warnings.
Dispatcher Manager & Helpers
downstreamadapter/dispatchermanager/dispatcher_manager.go, .../dispatcher_manager_info.go, .../helper.go, .../dispatcher_manager_redo.go, .../helper_test.go
Manager stores maintainerSessionEpoch, exposes Get/Set/AcceptBootstrap/AcceptMaintainerSession; handlers gate/drop messages by SessionEpoch and log reasons; outgoing dispatcher heartbeats/redo messages include SessionEpoch; tests added for gating and legacy behavior.
Dispatcher Orchestrator
downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
Bootstrap/post-bootstrap/close paths perform session gating via manager.Accept*; responses/errors include manager session epoch; legacy bootstrap cannot steal ownership once session-aware.
Maintainer Runtime & Manager
maintainer/maintainer.go, maintainer/maintainer_manager.go, maintainer/*_test.go
Maintainer holds sessionEpoch, gates inbound dispatcher messages via shouldAcceptDispatcherMessage, attaches SessionEpoch to outbound messages; manager handles duplicate/stale/higher-session add requests and increments a new metric on higher-session rejects.
Protocol & PD Utilities
heartbeatpb/heartbeat.proto, pkg/pdutil/utils.go, pkg/pdutil/utils_test.go, pkg/metrics/maintainer.go
Add session_epoch field to many heartbeat/dispatcher messages; add GenerateStrictSessionEpoch to derive monotonic epochs from PD with tests; add MaintainerSessionRejectCounter metric.
Downstream Dispatcher ctor & constructor sites
downstreamadapter/dispatchermanager/dispatcher_manager.go, various call sites
Constructor accepts initial maintainerSessionEpoch and initializes manager meta; many downstream call sites updated to pass/propagate SessionEpoch.

Sequence Diagram(s)

sequenceDiagram
    participant Controller
    participant OperatorController
    participant PD as PDClient
    participant Changefeed
    participant Maintainer
    participant DispatcherManager

    Controller->>OperatorController: request NewAdd/NewMove/NewStop operator
    OperatorController->>PD: GenerateStrictSessionEpoch(ctx, lastIssued)
    PD-->>OperatorController: sessionEpoch
    OperatorController->>Changefeed: construct operator with sessionEpoch
    OperatorController->>OperatorController: AddOperator(op)
    OperatorController->>Maintainer: send Add/RemoveMaintainerRequest(SessionEpoch)
    Maintainer->>Maintainer: shouldAcceptDispatcherMessage(incomingSessionEpoch)
    alt accepted
        Maintainer->>DispatcherManager: bootstrap/post/heartbeat (SessionEpoch)
        DispatcherManager->>DispatcherManager: AcceptBootstrapSession/AcceptMaintainerSession
        alt accepted
            DispatcherManager-->DispatcherManager: apply update
        else rejected
            DispatcherManager-->>Maintainer: drop/reject (log reason)
        end
    else rejected
        Maintainer-->>OperatorController: ignore/drop (log)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

lgtm, approved, release-note

Suggested reviewers

  • 3AceShowHand
  • wk989898
  • flowbehappy

Poem

🐇 I hop with a tidy little streak,
I stamp epochs upon each heartbeat,
PD whispers numbers, controllers align,
Gates drop the stale, keep every session fine,
Hooray — logs and heartbeats now all in sync!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.20% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description follows the template with Issue Number, Background/Motivation, Summary of Changes, and Testing sections. However, the Issue Number line uses a placeholder format and the Release Notes section is missing or incomplete. Add the actual issue number (close #XXXX) and complete the Release Notes section according to the template guidelines.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding session epoch handoff between coordinator and maintainer, which is the core objective of this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 14, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a session-aware communication protocol between TiCDC components to ensure system consistency by filtering stale control messages. It introduces a monotonically increasing session epoch, generated via PD TSO, which is carried by heartbeat and bootstrap messages and validated by receiving handlers. The review feedback identifies several improvement opportunities: extending session validation to status reports sent from the maintainer to the coordinator, refining bootstrap logic to avoid suppressing maintainer change logs, enhancing nil safety and efficiency in changefeed lookups, resolving package name shadowing, and ensuring network-bound TSO requests use contexts with appropriate timeouts to prevent scheduler blocking.

Comment thread coordinator/changefeed/changefeed.go Outdated
logCoordinatorResolvedTs *atomic.Uint64
// currentMaintainerSessionEpoch is runtime-only coordinator state.
// It tracks the maintainer session that has already become active.
currentMaintainerSessionEpoch *atomic.Uint64
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While currentMaintainerSessionEpoch is added to track the active session in the coordinator, it appears that heartbeatpb.MaintainerStatus (the message sent from Maintainer to Coordinator) has not been updated to include the session epoch. This means the Coordinator cannot verify if incoming status reports belong to the current active session, which could allow stale reports from a previous session to mutate the Coordinator's state machine or cause operators to finish prematurely. Consider adding session_epoch to MaintainerStatus and validating it in the Coordinator's heartbeat handler.

Comment on lines +102 to +104
if e.meta.maintainerID != maintainerID {
e.meta.maintainerID = maintainerID
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Updating e.meta.maintainerID inside AcceptBootstrapSession when the session epoch matches the current one will cause the subsequent check in DispatcherOrchestrator.handleBootstrapRequest (manager.GetMaintainerID() != from) to always be false. This suppresses the "maintainer changed" log message and any associated monitoring logic in the orchestrator for session-aware bootstraps. It is better to let the orchestrator handle the ID update and logging consistently, or have this method return a flag indicating if the ID was changed.

Comment thread coordinator/controller.go Outdated
zap.String("status", common.FormatMaintainerStatus(status)))

keyspaceID := c.getChangefeed(cfID).GetKeyspaceID()
sessionEpoch := c.getChangefeed(cfID).GetCurrentMaintainerSessionEpoch()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line performs a redundant lookup of the changefeed from the database. The changefeed was already fetched in the previous line (line 498). Additionally, if the changefeed is missing from the database (which is expected in handleNonExistentChangefeed), the previous line might cause a nil pointer dereference if GetKeyspaceID() is not nil-safe. Consider fetching the changefeed once and checking for nil before accessing its fields.

Comment on lines +135 to +139
changefeed := oc.changefeedDB.GetByID(cfID)
keyspaceID := uint32(0)
if changefeed != nil {
keyspaceID = changefeed.GetKeyspaceID()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable name changefeed shadows the package name changefeed imported in this file. This is generally discouraged in Go as it can lead to confusion. Additionally, the changefeed is looked up again in pushStopChangefeedOperator, which is redundant since the lookup is already performed here. Consider using a different name like cf and potentially passing the object to avoid redundant lookups.

Suggested change
changefeed := oc.changefeedDB.GetByID(cfID)
keyspaceID := uint32(0)
if changefeed != nil {
keyspaceID = changefeed.GetKeyspaceID()
}
cf := oc.changefeedDB.GetByID(cfID)
keyspaceID := uint32(0)
if cf != nil {
keyspaceID = cf.GetKeyspaceID()
}

func (oc *Controller) NewAddMaintainerOperator(cf *changefeed.Changefeed, dest node.ID) operator.Operator[common.ChangeFeedID, *heartbeatpb.MaintainerStatus] {
return NewAddMaintainerOperator(oc.changefeedDB, cf, dest)
func (oc *Controller) NewAddMaintainerOperator(cf *changefeed.Changefeed, dest node.ID) (operator.Operator[common.ChangeFeedID, *heartbeatpb.MaintainerStatus], error) {
sessionEpoch, err := pdutil.GenerateStrictSessionEpoch(context.Background(), oc.pdClient, cf.GetCurrentMaintainerSessionEpoch())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using context.Background() for a network request (PD TSO) can block the scheduler loop indefinitely if the PD client hangs or experiences high latency. It is recommended to use a context with a reasonable timeout or one that is tied to the lifecycle of the coordinator to ensure the scheduler remains responsive.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
coordinator/operator/operator_controller.go (1)

131-160: ⚠️ Potential issue | 🟠 Major

Capture the session epoch before StopByChangefeedID can remove the changefeed.

The comment on StopChangefeed says removed == true deletes the changefeed from changefeedDB. In that path, the second GetByID in pushStopChangefeedOperator returns nil, so the stop/remove operator is built with sessionEpoch == 0 instead of the active epoch. That drops the new session fence on the exact remove flow this PR is hardening.

🛠️ Suggested fix
 func (oc *Controller) StopChangefeed(_ context.Context, cfID common.ChangeFeedID, removed bool) operator.Operator[common.ChangeFeedID, *heartbeatpb.MaintainerStatus] {
 	oc.mu.Lock()
 	defer oc.mu.Unlock()
 
 	changefeed := oc.changefeedDB.GetByID(cfID)
 	keyspaceID := uint32(0)
+	sessionEpoch := uint64(0)
 	if changefeed != nil {
 		keyspaceID = changefeed.GetKeyspaceID()
+		sessionEpoch = changefeed.GetCurrentMaintainerSessionEpoch()
 	}
 	scheduledNode := oc.changefeedDB.StopByChangefeedID(cfID, removed)
@@
-	return oc.pushStopChangefeedOperator(keyspaceID, cfID, scheduledNode, removed)
+	return oc.pushStopChangefeedOperator(keyspaceID, cfID, scheduledNode, sessionEpoch, removed)
 }
@@
-func (oc *Controller) pushStopChangefeedOperator(keyspaceID uint32, cfID common.ChangeFeedID, nodeID node.ID, remove bool) operator.Operator[common.ChangeFeedID, *heartbeatpb.MaintainerStatus] {
-	sessionEpoch := uint64(0)
-	if cf := oc.changefeedDB.GetByID(cfID); cf != nil {
-		sessionEpoch = cf.GetCurrentMaintainerSessionEpoch()
-	}
+func (oc *Controller) pushStopChangefeedOperator(keyspaceID uint32, cfID common.ChangeFeedID, nodeID node.ID, sessionEpoch uint64, remove bool) operator.Operator[common.ChangeFeedID, *heartbeatpb.MaintainerStatus] {
 	op := NewStopChangefeedOperator(keyspaceID, cfID, nodeID, sessionEpoch, oc.selfNode.ID, oc.backend, remove)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/operator/operator_controller.go` around lines 131 - 160,
StopChangefeed can call StopByChangefeedID which may remove the changefeed
before pushStopChangefeedOperator reads it, causing sessionEpoch to be zero;
capture the current maintainer session epoch from changefeed (via GetByID and
GetCurrentMaintainerSessionEpoch) in StopChangefeed before calling
StopByChangefeedID and pass that epoch into pushStopChangefeedOperator (update
pushStopChangefeedOperator signature to accept sessionEpoch uint64 and use it
when calling NewStopChangefeedOperator instead of re-reading the changefeed).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@coordinator/controller.go`:
- Around line 630-632: In finishBootstrap, avoid calling c.getChangefeed(id) for
entries in runningChangefeeds because the changefeed may not exist and
c.getChangefeed(id) can return nil; instead obtain keyspaceID and sessionEpoch
directly from the remoteMaintainer (rm) or its status (e.g., rm.KeyspaceID or
rm.GetKeyspaceID and rm.CurrentMaintainerSessionEpoch or
rm.Status.GetCurrentMaintainerSessionEpoch) and use those values when
constructing changefeed.RemoveMaintainerMessage and calling
c.messageCenter.SendCommand, keeping the loop over runningChangefeeds and the rm
variable intact.
- Around line 498-502: handleNonExistentChangefeed must not call
c.getChangefeed(cfID) because cf is nil; instead obtain keyspaceID and
sessionEpoch from the incoming status/message (or from the function parameter
that contains the remote changefeed status) and pass those into
changefeed.RemoveMaintainerMessage; locate the call site in
handleNonExistentChangefeed and replace keyspaceID :=
c.getChangefeed(cfID).GetKeyspaceID() and sessionEpoch :=
c.getChangefeed(cfID).GetCurrentMaintainerSessionEpoch() with values extracted
from the provided status object (or use safe defaults if the status lacks them),
then send the RemoveMaintainerMessage using those variables to avoid
dereferencing a nil changefeed.

In `@downstreamadapter/dispatchermanager/dispatcher_manager_info.go`:
- Around line 83-129: Both AcceptBootstrapSession and AcceptMaintainerSession
currently allow sessionEpoch==0 forever; change them to reject legacy (0) once a
non-zero session is installed: in AcceptBootstrapSession, after the initial zero
check, inspect e.meta.maintainerSessionEpoch (under e.meta.Lock) and if it's
non-zero then return false with a descriptive reason (e.g., "stale:legacy" or
similar) instead of accepting; in AcceptMaintainerSession, replace the
unconditional return true for sessionEpoch==0 with a conditional that returns
true only if GetMaintainerSessionEpoch() == 0, otherwise return false with a
descriptive reason so zero-epoch messages are blocked after a session has been
installed.

---

Outside diff comments:
In `@coordinator/operator/operator_controller.go`:
- Around line 131-160: StopChangefeed can call StopByChangefeedID which may
remove the changefeed before pushStopChangefeedOperator reads it, causing
sessionEpoch to be zero; capture the current maintainer session epoch from
changefeed (via GetByID and GetCurrentMaintainerSessionEpoch) in StopChangefeed
before calling StopByChangefeedID and pass that epoch into
pushStopChangefeedOperator (update pushStopChangefeedOperator signature to
accept sessionEpoch uint64 and use it when calling NewStopChangefeedOperator
instead of re-reading the changefeed).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8a4a7712-20f8-4d40-bb33-4110fcec5828

📥 Commits

Reviewing files that changed from the base of the PR and between 0a418b4 and d6ab767.

⛔ Files ignored due to path filters (1)
  • heartbeatpb/heartbeat.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (30)
  • coordinator/changefeed/changefeed.go
  • coordinator/changefeed/changefeed_test.go
  • coordinator/controller.go
  • coordinator/controller_test.go
  • coordinator/coordinator_test.go
  • coordinator/create_changefeed_gc_test.go
  • coordinator/operator/operator_add.go
  • coordinator/operator/operator_add_test.go
  • coordinator/operator/operator_controller.go
  • coordinator/operator/operator_controller_test.go
  • coordinator/operator/operator_move.go
  • coordinator/operator/operator_move_test.go
  • coordinator/operator/operator_stop.go
  • coordinator/operator/operator_stop_test.go
  • coordinator/scheduler/balance.go
  • coordinator/scheduler/basic.go
  • downstreamadapter/dispatchermanager/dispatcher_manager.go
  • downstreamadapter/dispatchermanager/dispatcher_manager_info.go
  • downstreamadapter/dispatchermanager/dispatcher_manager_redo.go
  • downstreamadapter/dispatchermanager/helper.go
  • downstreamadapter/dispatchermanager/helper_test.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
  • heartbeatpb/heartbeat.proto
  • maintainer/maintainer.go
  • maintainer/maintainer_manager.go
  • maintainer/maintainer_manager_test.go
  • maintainer/maintainer_test.go
  • pkg/metrics/maintainer.go
  • pkg/pdutil/utils.go
  • pkg/pdutil/utils_test.go

Comment thread coordinator/controller.go Outdated
Comment on lines +498 to +502
keyspaceID := c.getChangefeed(cfID).GetKeyspaceID()
sessionEpoch := c.getChangefeed(cfID).GetCurrentMaintainerSessionEpoch()

// Remove working changefeed from maintainer if it's not in changefeedDB
_ = c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, cfID, from, true, true))
_ = c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, cfID, from, true, true, sessionEpoch))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for getChangefeed implementation to confirm it queries changefeedDB
ast-grep --pattern $'func (c *Controller) getChangefeed($_) $_ {
  $$$
}'

Repository: pingcap/ticdc

Length of output: 267


🏁 Script executed:

# Read the context around lines 460-510 to understand when handleNonExistentChangefeed is called
# and the condition that leads to it
sed -n '460,510p' coordinator/controller.go

Repository: pingcap/ticdc

Length of output: 1602


Nil pointer dereference: getChangefeed will return nil here.

The handleNonExistentChangefeed function is called when cf == nil (line 469), meaning the changefeed is not in changefeedDB. However, lines 498-499 call c.getChangefeed(cfID) which queries the same changefeedDB and will return nil, causing a panic when accessing GetKeyspaceID() or GetCurrentMaintainerSessionEpoch().

The keyspaceID and sessionEpoch need to be obtained from the incoming status or another source, not from the non-existent changefeed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/controller.go` around lines 498 - 502,
handleNonExistentChangefeed must not call c.getChangefeed(cfID) because cf is
nil; instead obtain keyspaceID and sessionEpoch from the incoming status/message
(or from the function parameter that contains the remote changefeed status) and
pass those into changefeed.RemoveMaintainerMessage; locate the call site in
handleNonExistentChangefeed and replace keyspaceID :=
c.getChangefeed(cfID).GetKeyspaceID() and sessionEpoch :=
c.getChangefeed(cfID).GetCurrentMaintainerSessionEpoch() with values extracted
from the provided status object (or use safe defaults if the status lacks them),
then send the RemoveMaintainerMessage using those variables to avoid
dereferencing a nil changefeed.

Comment thread coordinator/controller.go Outdated
Comment on lines +630 to +632
keyspaceID := c.getChangefeed(id).GetKeyspaceID()
_ = c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, id, rm.nodeID, true, true))
sessionEpoch := c.getChangefeed(id).GetCurrentMaintainerSessionEpoch()
_ = c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, id, rm.nodeID, true, true, sessionEpoch))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Same nil pointer dereference risk in finishBootstrap.

This code path handles changefeeds that are NOT in allChangefeeds (line 625: for id, rm := range runningChangefeeds—these are changefeeds remaining after deletion at line 612). Calling c.getChangefeed(id) will return nil since the changefeed was never added to changefeedDB in this bootstrap flow.

The keyspaceID and sessionEpoch should be obtained from the remoteMaintainer struct or the status information, not from the changefeedDB.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/controller.go` around lines 630 - 632, In finishBootstrap, avoid
calling c.getChangefeed(id) for entries in runningChangefeeds because the
changefeed may not exist and c.getChangefeed(id) can return nil; instead obtain
keyspaceID and sessionEpoch directly from the remoteMaintainer (rm) or its
status (e.g., rm.KeyspaceID or rm.GetKeyspaceID and
rm.CurrentMaintainerSessionEpoch or rm.Status.GetCurrentMaintainerSessionEpoch)
and use those values when constructing changefeed.RemoveMaintainerMessage and
calling c.messageCenter.SendCommand, keeping the loop over runningChangefeeds
and the rm variable intact.

Comment on lines +83 to +129
func (e *DispatcherManager) AcceptBootstrapSession(maintainerID node.ID, sessionEpoch uint64) (bool, string) {
if sessionEpoch == 0 {
// Zero means the peer is still running the legacy protocol. We must keep
// accepting that path during rolling upgrade, otherwise a new dispatcher
// manager can no longer interoperate with an old maintainer.
return true, "legacy"
}

e.meta.Lock()
defer e.meta.Unlock()

switch {
case e.meta.maintainerSessionEpoch == 0:
e.meta.maintainerID = maintainerID
e.meta.maintainerSessionEpoch = sessionEpoch
return true, "install"
case sessionEpoch < e.meta.maintainerSessionEpoch:
return false, fmt.Sprintf("stale:%d<%d", sessionEpoch, e.meta.maintainerSessionEpoch)
case sessionEpoch == e.meta.maintainerSessionEpoch:
if e.meta.maintainerID != maintainerID {
e.meta.maintainerID = maintainerID
}
return true, "current"
default:
e.meta.maintainerID = maintainerID
e.meta.maintainerSessionEpoch = sessionEpoch
return true, "advance"
}
}

func (e *DispatcherManager) AcceptMaintainerSession(sessionEpoch uint64) (bool, string) {
if sessionEpoch == 0 {
// Zero-session control messages intentionally stay on the legacy path for
// mixed-version upgrade compatibility. Strict stale-session filtering only
// applies after both ends speak the session-aware protocol.
return true, "legacy"
}

current := e.GetMaintainerSessionEpoch()
switch {
case sessionEpoch < current:
return false, fmt.Sprintf("stale:%d<%d", sessionEpoch, current)
case sessionEpoch == current:
return true, "current"
default:
return false, fmt.Sprintf("future:%d>%d", sessionEpoch, current)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject legacy epoch 0 after a non-zero session is installed.

AcceptBootstrapSession and AcceptMaintainerSession both treat sessionEpoch == 0 as valid forever. Once the dispatcher manager has already advanced maintainerSessionEpoch, that still lets stale legacy traffic through the bootstrap/post-bootstrap/close gates; handleCloseRequest will then reach TryClose for a zero-epoch request. The legacy path should only stay open while the current session is still 0.

🛠️ Suggested fix
 func (e *DispatcherManager) AcceptBootstrapSession(maintainerID node.ID, sessionEpoch uint64) (bool, string) {
-	if sessionEpoch == 0 {
-		// Zero means the peer is still running the legacy protocol. We must keep
-		// accepting that path during rolling upgrade, otherwise a new dispatcher
-		// manager can no longer interoperate with an old maintainer.
-		return true, "legacy"
-	}
-
 	e.meta.Lock()
 	defer e.meta.Unlock()
+
+	if sessionEpoch == 0 {
+		if e.meta.maintainerSessionEpoch == 0 {
+			return true, "legacy"
+		}
+		return false, fmt.Sprintf("stale:%d<%d", sessionEpoch, e.meta.maintainerSessionEpoch)
+	}
@@
 func (e *DispatcherManager) AcceptMaintainerSession(sessionEpoch uint64) (bool, string) {
-	if sessionEpoch == 0 {
-		// Zero-session control messages intentionally stay on the legacy path for
-		// mixed-version upgrade compatibility. Strict stale-session filtering only
-		// applies after both ends speak the session-aware protocol.
-		return true, "legacy"
-	}
-
 	current := e.GetMaintainerSessionEpoch()
+	if sessionEpoch == 0 {
+		if current == 0 {
+			return true, "legacy"
+		}
+		return false, fmt.Sprintf("stale:%d<%d", sessionEpoch, current)
+	}
+
 	switch {
 	case sessionEpoch < current:
 		return false, fmt.Sprintf("stale:%d<%d", sessionEpoch, current)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@downstreamadapter/dispatchermanager/dispatcher_manager_info.go` around lines
83 - 129, Both AcceptBootstrapSession and AcceptMaintainerSession currently
allow sessionEpoch==0 forever; change them to reject legacy (0) once a non-zero
session is installed: in AcceptBootstrapSession, after the initial zero check,
inspect e.meta.maintainerSessionEpoch (under e.meta.Lock) and if it's non-zero
then return false with a descriptive reason (e.g., "stale:legacy" or similar)
instead of accepting; in AcceptMaintainerSession, replace the unconditional
return true for sessionEpoch==0 with a conditional that returns true only if
GetMaintainerSessionEpoch() == 0, otherwise return false with a descriptive
reason so zero-epoch messages are blocked after a session has been installed.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
coordinator/controller.go (2)

498-502: ⚠️ Potential issue | 🔴 Critical

Nil pointer dereference: c.getChangefeed(cfID) returns nil here.

This function is called when the changefeed doesn't exist in changefeedDB (line 468-470 checks cf == nil). Calling c.getChangefeed(cfID) again will return nil, causing a panic when accessing GetKeyspaceID() or GetCurrentMaintainerSessionEpoch().

The sessionEpoch should be obtained from status.SessionEpoch. For keyspaceID, check if MaintainerStatus includes the field, or use a default value of 0 for legacy compatibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/controller.go` around lines 498 - 502, The call to
c.getChangefeed(cfID) can return nil and cause a panic when invoking
GetKeyspaceID()/GetCurrentMaintainerSessionEpoch(); update the removal call so
it does not call c.getChangefeed(cfID): obtain sessionEpoch from
status.SessionEpoch and determine keyspaceID from status.MaintainerStatus if
present (otherwise use 0 for legacy compatibility), then call
c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, cfID,
from, true, true, sessionEpoch)); ensure you remove the c.getChangefeed(cfID)
usage and rely solely on the already-available status and its MaintainerStatus
field.

633-635: ⚠️ Potential issue | 🔴 Critical

Nil pointer dereference: c.getChangefeed(id) returns nil for stale maintainers.

At this point in finishBootstrap, runningChangefeeds contains only entries that were NOT in allChangefeeds (deleted at line 615). These changefeeds were never added to changefeedDB, so c.getChangefeed(id) returns nil.

The fix should obtain values from rm (the remoteMaintainer):

  • sessionEpoch from rm.status.SessionEpoch
  • keyspaceID from rm.status if available, otherwise use a default
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/controller.go` around lines 633 - 635, finishBootstrap
dereferences c.getChangefeed(id) which can be nil for stale maintainers; instead
use values from the remoteMaintainer (rm) passed into the loop: replace uses of
c.getChangefeed(id).GetCurrentMaintainerSessionEpoch() and .GetKeyspaceID() with
rm.status.SessionEpoch and rm.status.KeyspaceID (falling back to a sensible
default if rm.status or KeyspaceID is nil/empty), and then call
c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, id,
rm.nodeID, true, true, sessionEpoch)); ensure you reference finishBootstrap,
runningChangefeeds, rm (remoteMaintainer), and GetChangefeed only when non-nil.
🧹 Nitpick comments (2)
downstreamadapter/dispatchermanager/helper_test.go (2)

269-273: Add an immutability assertion for stale bootstrap requests.

The stale case validates epoch immutability, but it should also assert maintainerID is not changed by a stale bootstrap attempt.

Proposed test assertion
 	accepted, reason := dm.AcceptBootstrapSession(node.ID("new"), 9)
 	require.False(t, accepted)
 	require.Contains(t, reason, "stale")
 	require.Equal(t, uint64(10), dm.GetMaintainerSessionEpoch())
+	require.Equal(t, node.ID("old"), dm.GetMaintainerID())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@downstreamadapter/dispatchermanager/helper_test.go` around lines 269 - 273,
Before calling dm.AcceptBootstrapSession for the stale case, capture the current
maintainer ID (e.g., via dm.GetMaintainerID() or the existing maintainerID
variable) and after the call assert it was not changed; specifically, add a
require.Equal comparing the pre-call maintainer ID to the post-call maintainer
ID (alongside the existing epoch immutability checks) to ensure a stale
bootstrap attempt does not modify maintainerID.

405-432: Expand redo-forward filter coverage to include legacy and future epochs.

This test verifies stale and current epochs, but it should also explicitly assert that SessionEpoch == 0 and SessionEpoch > current are rejected without mutating redoGlobalTs.

Proposed additional cases
 	blocking := handler.Handle(dm, NewRedoResolvedTsForwardMessage(&heartbeatpb.RedoResolvedTsForwardMessage{
 		ChangefeedID: changefeedID.ToPB(),
 		ResolvedTs:   150,
 		SessionEpoch: 9,
 	}))
 	require.False(t, blocking)
 	require.Equal(t, uint64(100), dm.redoGlobalTs.Load())
+
+	blocking = handler.Handle(dm, NewRedoResolvedTsForwardMessage(&heartbeatpb.RedoResolvedTsForwardMessage{
+		ChangefeedID: changefeedID.ToPB(),
+		ResolvedTs:   150,
+		SessionEpoch: 0,
+	}))
+	require.False(t, blocking)
+	require.Equal(t, uint64(100), dm.redoGlobalTs.Load())
+
+	blocking = handler.Handle(dm, NewRedoResolvedTsForwardMessage(&heartbeatpb.RedoResolvedTsForwardMessage{
+		ChangefeedID: changefeedID.ToPB(),
+		ResolvedTs:   150,
+		SessionEpoch: 11,
+	}))
+	require.False(t, blocking)
+	require.Equal(t, uint64(100), dm.redoGlobalTs.Load())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@downstreamadapter/dispatchermanager/helper_test.go` around lines 405 - 432,
Update TestRedoResolvedTsForwardMessageHandlerFiltersSessionEpoch to add two
more assertions: send a RedoResolvedTsForwardMessage with SessionEpoch == 0 and
one with SessionEpoch > dm.meta.maintainerSessionEpoch (e.g., 11) using
NewRedoResolvedTsForwardMessage and the same changefeedID/ResolvedTs, call
handler.Handle for each, assert it returns false, and assert
dm.redoGlobalTs.Load() remains unchanged (still 100) after each call; this
ensures RedoResolvedTsForwardMessage handling in
RedoResolvedTsForwardMessageHandler correctly rejects legacy (0) and future
epochs without mutating redoGlobalTs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@coordinator/controller.go`:
- Around line 498-502: The call to c.getChangefeed(cfID) can return nil and
cause a panic when invoking GetKeyspaceID()/GetCurrentMaintainerSessionEpoch();
update the removal call so it does not call c.getChangefeed(cfID): obtain
sessionEpoch from status.SessionEpoch and determine keyspaceID from
status.MaintainerStatus if present (otherwise use 0 for legacy compatibility),
then call
c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, cfID,
from, true, true, sessionEpoch)); ensure you remove the c.getChangefeed(cfID)
usage and rely solely on the already-available status and its MaintainerStatus
field.
- Around line 633-635: finishBootstrap dereferences c.getChangefeed(id) which
can be nil for stale maintainers; instead use values from the remoteMaintainer
(rm) passed into the loop: replace uses of
c.getChangefeed(id).GetCurrentMaintainerSessionEpoch() and .GetKeyspaceID() with
rm.status.SessionEpoch and rm.status.KeyspaceID (falling back to a sensible
default if rm.status or KeyspaceID is nil/empty), and then call
c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, id,
rm.nodeID, true, true, sessionEpoch)); ensure you reference finishBootstrap,
runningChangefeeds, rm (remoteMaintainer), and GetChangefeed only when non-nil.

---

Nitpick comments:
In `@downstreamadapter/dispatchermanager/helper_test.go`:
- Around line 269-273: Before calling dm.AcceptBootstrapSession for the stale
case, capture the current maintainer ID (e.g., via dm.GetMaintainerID() or the
existing maintainerID variable) and after the call assert it was not changed;
specifically, add a require.Equal comparing the pre-call maintainer ID to the
post-call maintainer ID (alongside the existing epoch immutability checks) to
ensure a stale bootstrap attempt does not modify maintainerID.
- Around line 405-432: Update
TestRedoResolvedTsForwardMessageHandlerFiltersSessionEpoch to add two more
assertions: send a RedoResolvedTsForwardMessage with SessionEpoch == 0 and one
with SessionEpoch > dm.meta.maintainerSessionEpoch (e.g., 11) using
NewRedoResolvedTsForwardMessage and the same changefeedID/ResolvedTs, call
handler.Handle for each, assert it returns false, and assert
dm.redoGlobalTs.Load() remains unchanged (still 100) after each call; this
ensures RedoResolvedTsForwardMessage handling in
RedoResolvedTsForwardMessageHandler correctly rejects legacy (0) and future
epochs without mutating redoGlobalTs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e2d71c9d-83f3-4da7-8356-4aabecefc623

📥 Commits

Reviewing files that changed from the base of the PR and between d6ab767 and 1c50eaf.

⛔ Files ignored due to path filters (1)
  • heartbeatpb/heartbeat.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (9)
  • coordinator/controller.go
  • coordinator/controller_test.go
  • coordinator/operator/operator_controller_test.go
  • downstreamadapter/dispatchermanager/dispatcher_manager_info.go
  • downstreamadapter/dispatchermanager/helper.go
  • downstreamadapter/dispatchermanager/helper_test.go
  • heartbeatpb/heartbeat.proto
  • maintainer/maintainer.go
  • maintainer/maintainer_test.go
✅ Files skipped from review due to trivial changes (1)
  • downstreamadapter/dispatchermanager/helper.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • coordinator/operator/operator_controller_test.go
  • coordinator/controller_test.go
  • maintainer/maintainer_test.go
  • heartbeatpb/heartbeat.proto
  • downstreamadapter/dispatchermanager/dispatcher_manager_info.go
  • maintainer/maintainer.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
coordinator/operator/operator_controller.go (1)

324-339: Add caller-side timeout for PD epoch generation.

GenerateStrictSessionEpoch in pkg/pdutil/utils.go passes the context directly to pdClient.GetTS(ctx) without creating a timeout-bounded child context. Currently, both callers at lines 325 and 334 use context.Background(), which has no timeout. Under PD degradation, these operations could block indefinitely.

Proposed caller-side bounded context
 func (oc *Controller) NewAddMaintainerOperator(cf *changefeed.Changefeed, dest node.ID) (operator.Operator[common.ChangeFeedID, *heartbeatpb.MaintainerStatus], error) {
-	sessionEpoch, err := pdutil.GenerateStrictSessionEpoch(context.Background(), oc.pdClient, cf.GetCurrentMaintainerSessionEpoch())
+	ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
+	defer cancel()
+	sessionEpoch, err := pdutil.GenerateStrictSessionEpoch(ctx, oc.pdClient, cf.GetCurrentMaintainerSessionEpoch())
 	if err != nil {
 		return nil, err
 	}
 	return NewAddMaintainerOperator(oc.changefeedDB, cf, dest, sessionEpoch), nil
 }
 
 func (oc *Controller) NewMoveMaintainerOperator(cf *changefeed.Changefeed, origin, dest node.ID) (operator.Operator[common.ChangeFeedID, *heartbeatpb.MaintainerStatus], error) {
 	activeSessionEpoch := cf.GetCurrentMaintainerSessionEpoch()
-	destSessionEpoch, err := pdutil.GenerateStrictSessionEpoch(context.Background(), oc.pdClient, activeSessionEpoch)
+	ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
+	defer cancel()
+	destSessionEpoch, err := pdutil.GenerateStrictSessionEpoch(ctx, oc.pdClient, activeSessionEpoch)
 	if err != nil {
 		return nil, err
 	}
 	return NewMoveMaintainerOperator(oc.changefeedDB, cf, origin, dest, activeSessionEpoch, destSessionEpoch), nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/operator/operator_controller.go` around lines 324 - 339, The
calls in Controller.NewAddMaintainerOperator and
Controller.NewMoveMaintainerOperator currently pass context.Background() into
pdutil.GenerateStrictSessionEpoch (which delegates to oc.pdClient.GetTS) and can
block; change both to create a timeout-bounded context (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable timeout>) with defer
cancel()) and pass that ctx to pdutil.GenerateStrictSessionEpoch; ensure you
choose a shared constant timeout name (e.g., pdEpochTimeout) visible in the
package and always defer cancel() after creating the child context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@coordinator/operator/operator_controller.go`:
- Around line 324-339: The calls in Controller.NewAddMaintainerOperator and
Controller.NewMoveMaintainerOperator currently pass context.Background() into
pdutil.GenerateStrictSessionEpoch (which delegates to oc.pdClient.GetTS) and can
block; change both to create a timeout-bounded context (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable timeout>) with defer
cancel()) and pass that ctx to pdutil.GenerateStrictSessionEpoch; ensure you
choose a shared constant timeout name (e.g., pdEpochTimeout) visible in the
package and always defer cancel() after creating the child context.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 50f06f97-a8d2-4b5b-8fc4-bc7ae78137cf

📥 Commits

Reviewing files that changed from the base of the PR and between 1c50eaf and 10e70b2.

📒 Files selected for processing (7)
  • coordinator/changefeed/changefeed.go
  • coordinator/operator/operator_add.go
  • coordinator/operator/operator_add_test.go
  • coordinator/operator/operator_controller.go
  • coordinator/operator/operator_controller_test.go
  • coordinator/operator/operator_move.go
  • coordinator/operator/operator_move_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • coordinator/operator/operator_add.go
  • coordinator/operator/operator_add_test.go
  • coordinator/operator/operator_move.go
  • coordinator/operator/operator_controller_test.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@coordinator/controller.go`:
- Around line 506-512: removeReportedMaintainer currently dereferences
c.getChangefeed(cfID) which can be nil; update removeReportedMaintainer to first
grab cf := c.getChangefeed(cfID) and if cf == nil set keyspaceID := uint32(0)
(default for the handleNonExistentChangefeed path) else keyspaceID =
cf.GetKeyspaceID(); then call
c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, cfID,
from, true, true, sessionEpoch)). Additionally, update the finishBootstrap
caller to try to extract keyspaceID from the remote status and pass it into
removeReportedMaintainer (or ensure the code path populates the changefeed so
GetKeyspaceID is non-nil) so finishBootstrap doesn't rely on a nil getChangefeed
result.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 77ecc061-1646-40d5-ac5a-bc8585f8428a

📥 Commits

Reviewing files that changed from the base of the PR and between 10e70b2 and 0683497.

📒 Files selected for processing (11)
  • coordinator/controller.go
  • coordinator/controller_test.go
  • coordinator/coordinator_test.go
  • coordinator/operator/operator_add.go
  • coordinator/operator/operator_add_test.go
  • coordinator/operator/operator_controller.go
  • coordinator/operator/operator_controller_test.go
  • coordinator/operator/operator_move.go
  • coordinator/operator/operator_move_test.go
  • coordinator/operator/operator_stop.go
  • coordinator/operator/operator_stop_test.go
✅ Files skipped from review due to trivial changes (1)
  • coordinator/operator/operator_add_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • coordinator/operator/operator_stop_test.go
  • coordinator/operator/operator_add.go
  • coordinator/operator/operator_move.go

Comment thread coordinator/controller.go
Comment on lines +506 to +512
// removeReportedMaintainer removes a working maintainer that is only visible via
// remote runtime state. Once local metadata is gone, the reporter's session is
// the only fenced owner token we can still trust for cleanup.
func (c *Controller) removeReportedMaintainer(cfID common.ChangeFeedID, from node.ID, sessionEpoch uint64) {
keyspaceID := c.getChangefeed(cfID).GetKeyspaceID()
_ = c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, cfID, from, true, true, sessionEpoch))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Nil pointer dereference: getChangefeed returns nil when local metadata is missing.

removeReportedMaintainer is called from handleNonExistentChangefeed (line 502) and finishBootstrap (line 680) when the changefeed doesn't exist locally. However, line 510 calls c.getChangefeed(cfID).GetKeyspaceID() which will panic because getChangefeed returns nil.

For the handleNonExistentChangefeed path, keyspaceID should default to 0 since local metadata is unavailable. For finishBootstrap, the keyspace info could potentially be extracted from the remote status if available.

🐛 Proposed fix
 func (c *Controller) removeReportedMaintainer(cfID common.ChangeFeedID, from node.ID, sessionEpoch uint64) {
-	keyspaceID := c.getChangefeed(cfID).GetKeyspaceID()
+	keyspaceID := uint32(0)
+	if cf := c.getChangefeed(cfID); cf != nil {
+		keyspaceID = cf.GetKeyspaceID()
+	}
 	_ = c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, cfID, from, true, true, sessionEpoch))
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/controller.go` around lines 506 - 512, removeReportedMaintainer
currently dereferences c.getChangefeed(cfID) which can be nil; update
removeReportedMaintainer to first grab cf := c.getChangefeed(cfID) and if cf ==
nil set keyspaceID := uint32(0) (default for the handleNonExistentChangefeed
path) else keyspaceID = cf.GetKeyspaceID(); then call
c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, cfID,
from, true, true, sessionEpoch)). Additionally, update the finishBootstrap
caller to try to extract keyspaceID from the remote status and pass it into
removeReportedMaintainer (or ensure the code path populates the changefeed so
GetKeyspaceID is non-nil) so finishBootstrap doesn't rely on a nil getChangefeed
result.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
coordinator/controller.go (1)

508-510: ⚠️ Potential issue | 🔴 Critical

Nil dereference in stale-maintainer cleanup.

Line 509 dereferences c.getChangefeed(cfID) even though this helper is only used after we've already concluded local metadata is missing or stale. That panics before the remove request is sent, so late maintainer reports can still break the coordinator during the exact cleanup flow this PR is adding.

🐛 Minimal fix
 func (c *Controller) removeReportedMaintainer(cfID common.ChangeFeedID, from node.ID, sessionEpoch uint64) {
-	keyspaceID := c.getChangefeed(cfID).GetKeyspaceID()
+	keyspaceID := uint32(0)
+	if cf := c.getChangefeed(cfID); cf != nil {
+		keyspaceID = cf.GetKeyspaceID()
+	}
 	_ = c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, cfID, from, true, true, sessionEpoch))
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/controller.go` around lines 508 - 510, The
removeReportedMaintainer function dereferences c.getChangefeed(cfID)
unconditionally; change it to call c.getChangefeed(cfID) once into a local
variable, check for nil, and only call GetKeyspaceID when the result is
non-nil—otherwise use a sensible zero/empty keyspaceID (or omit it if
RemoveMaintainerMessage accepts that) and still invoke
c.messageCenter.SendCommand with changefeed.RemoveMaintainerMessage(cfID, from,
true, true, sessionEpoch) (referencing removeReportedMaintainer,
c.getChangefeed, GetKeyspaceID, messageCenter.SendCommand, and
changefeed.RemoveMaintainerMessage).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@coordinator/controller.go`:
- Around line 508-510: The removeReportedMaintainer function dereferences
c.getChangefeed(cfID) unconditionally; change it to call c.getChangefeed(cfID)
once into a local variable, check for nil, and only call GetKeyspaceID when the
result is non-nil—otherwise use a sensible zero/empty keyspaceID (or omit it if
RemoveMaintainerMessage accepts that) and still invoke
c.messageCenter.SendCommand with changefeed.RemoveMaintainerMessage(cfID, from,
true, true, sessionEpoch) (referencing removeReportedMaintainer,
c.getChangefeed, GetKeyspaceID, messageCenter.SendCommand, and
changefeed.RemoveMaintainerMessage).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6f0d1c55-cb0a-4513-bcaf-af2876da7084

📥 Commits

Reviewing files that changed from the base of the PR and between 0683497 and 356c2e3.

📒 Files selected for processing (9)
  • coordinator/changefeed/changefeed.go
  • coordinator/controller.go
  • coordinator/controller_test.go
  • coordinator/operator/operator_add.go
  • coordinator/operator/operator_controller.go
  • coordinator/operator/operator_controller_test.go
  • coordinator/operator/operator_move.go
  • coordinator/operator/operator_stop.go
  • coordinator/operator/operator_stop_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • coordinator/operator/operator_add.go
  • coordinator/operator/operator_move.go
  • coordinator/operator/operator_controller_test.go
  • coordinator/operator/operator_stop.go
  • coordinator/changefeed/changefeed.go

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 15, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.

@hongyunyan hongyunyan closed this Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant