coordinator,maintainer: add session epoch handoff by hongyunyan · Pull Request #4816 · pingcap/ticdc

hongyunyan · 2026-04-14T04:34:29Z

Background

Maintainer and dispatcher manager currently rely on resend and node placement to converge during maintainer restart or move. Without an explicit runtime session boundary, late control messages and late runtime reports can still be processed by the wrong maintainer session during handoff.

Motivation

This change adds a minimal runtime-only session epoch so stale maintainer and dispatcher-manager messages stop mutating the active state machine, while preserving rolling upgrade compatibility and avoiding any new etcd persistent fields.

Summary of Changes

add session_epoch to the maintainer/dispatcher-manager runtime protobuf messages used in add/remove, bootstrap, post-bootstrap, close, heartbeat, block-status, redo progress, schedule, and merge flows
generate strict session epochs from PD TSO only and refuse to create a new maintainer session when PD TSO is unavailable
track the active maintainer session epoch in coordinator runtime state and wire it through add, move, and stop operators, including move fallback handling
tag maintainer and dispatcher-manager runtime messages with session epoch and filter stale or unexpected messages on both sides
keep bootstrap as the only dispatcher-manager path that can install or advance a non-zero session, while preserving the legacy sessionEpoch == 0 path for mixed-version rolling upgrade compatibility
add focused unit tests for session epoch generation, coordinator operators, maintainer filtering, dispatcher-manager filtering, and the coordinator legacy remove fallback path

Testing

make fmt
go test ./pkg/pdutil ./coordinator/changefeed ./coordinator/operator ./maintainer ./downstreamadapter/dispatchermanager ./downstreamadapter/dispatcherorchestrator
go test ./coordinator -count=1 -run 'Test(OnPeriodTaskAdvanceLiveness|HandleNonExistentChangefeedUsesLegacyRemoveWhenLocalMetadataMissing|ResumeChangefeed|ResumeChangefeedNormalState|ResumeChangefeedOverwriteUpdatesLastSavedCheckpointTs|ResumeChangefeedIgnoresStaleMaintainerErrorAndSchedules|PauseChangefeed|UpdateChangefeed|GetChangefeed|RemoveChangefeed|ListChangefeed|CreateChangefeed|CreateChangefeedDoesNotUpdateGCSafepoint|UpdateGCSafepointCallsGCManagerUpdate|UpdateGCSafepointDeletesServiceSafepointWhenNoChangefeed|RemoveLastChangefeedDeletesServiceSafepointImmediately|ConcurrentDeleteLastChangefeedAndCreateNewOneKeepsExpectedGCSafepoint)$'
go test ./coordinator -count=1 -run 'TestCoordinatorScheduling$'

Summary by CodeRabbit

New Features
- Runtime session-epoch tracking for maintainers; session epochs are propagated in heartbeat and dispatcher messages and used by controller/operator flows.
- Controller/operator flows now allocate PD-derived strict session epochs and include them in add/move/stop operations.
Bug Fixes
- Session-aware gating prevents stale, legacy or future maintainer/dispatcher actions from being applied.
Documentation
- Messaging protocol extended with session_epoch fields.
Chores
- Added metric to count session-rejection events.
Tests
- Extensive new tests covering session-epoch handling, gating, and operator behaviors.

ti-chi-bot · 2026-04-14T04:34:34Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot · 2026-04-14T04:34:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign asddongmen for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-04-14T04:34:45Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a PD-backed monotonic maintainer session epoch and propagates it through operator allocation, controller flows, heartbeat protobufs, maintainer runtime, dispatcher managers/orchestrator, scheduler, and tests; introduces session-gating validation for inbound messages and epoch-aware operator/stop semantics.

Changes

Cohort / File(s)	Summary
Coordinator Changefeed `coordinator/changefeed/changefeed.go`, `coordinator/changefeed/changefeed_test.go`	Add runtime fields `currentMaintainerSessionEpoch` and `acceptZeroMaintainerSession` with nil-safe APIs; NewAdd/RemoveMaintainerMessage signatures now accept `sessionEpoch` and set protobuf `SessionEpoch`.
Coordinator Controller & Tests `coordinator/controller.go`, `coordinator/controller_test.go`, `coordinator/coordinator_test.go`, `coordinator/create_changefeed_gc_test.go`	Controller gains `pdClient`, validates maintainer session epochs on status updates, restores/pass epochs during bootstrap/remove flows; many tests and constructor call sites updated for new params and semantics.
Operator Layer & Tests `coordinator/operator/operator_.go`, `coordinator/operator/_test.go`	Add/Move/Stop operators carry sessionEpoch (constructors/signatures changed); some operator factories now return `(op, error)` after PD epoch allocation; Start/Schedule/PostFinish publish/enforce sessionEpoch; tests updated/added.
Operator Controller Changes `coordinator/operator/operator_controller.go`, `coordinator/operator/operator_controller_test.go`	Controller allocates strict session epochs via PD (`GenerateStrictSessionEpoch`) with timeout, passes epochs into operator constructors, StopChangefeed forwards sessionEpoch/allowZeroEpoch into stop operators; tests validate deadline usage.
Scheduler `coordinator/scheduler/basic.go`, `coordinator/scheduler/balance.go`	Scheduler now uses operatorController NewAdd/NewMove which can return errors; logs and skips when epoch allocation fails; added structured warnings.
Dispatcher Manager & Helpers `downstreamadapter/dispatchermanager/dispatcher_manager.go`, `.../dispatcher_manager_info.go`, `.../helper.go`, `.../dispatcher_manager_redo.go`, `.../helper_test.go`	Manager stores `maintainerSessionEpoch`, exposes Get/Set/AcceptBootstrap/AcceptMaintainerSession; handlers gate/drop messages by SessionEpoch and log reasons; outgoing dispatcher heartbeats/redo messages include SessionEpoch; tests added for gating and legacy behavior.
Dispatcher Orchestrator `downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go`	Bootstrap/post-bootstrap/close paths perform session gating via manager.Accept*; responses/errors include manager session epoch; legacy bootstrap cannot steal ownership once session-aware.
Maintainer Runtime & Manager `maintainer/maintainer.go`, `maintainer/maintainer_manager.go`, `maintainer/*_test.go`	Maintainer holds `sessionEpoch`, gates inbound dispatcher messages via shouldAcceptDispatcherMessage, attaches SessionEpoch to outbound messages; manager handles duplicate/stale/higher-session add requests and increments a new metric on higher-session rejects.
Protocol & PD Utilities `heartbeatpb/heartbeat.proto`, `pkg/pdutil/utils.go`, `pkg/pdutil/utils_test.go`, `pkg/metrics/maintainer.go`	Add `session_epoch` field to many heartbeat/dispatcher messages; add `GenerateStrictSessionEpoch` to derive monotonic epochs from PD with tests; add `MaintainerSessionRejectCounter` metric.
Downstream Dispatcher ctor & constructor sites `downstreamadapter/dispatchermanager/dispatcher_manager.go`, various call sites	Constructor accepts initial `maintainerSessionEpoch` and initializes manager meta; many downstream call sites updated to pass/propagate SessionEpoch.

Sequence Diagram(s)

sequenceDiagram
    participant Controller
    participant OperatorController
    participant PD as PDClient
    participant Changefeed
    participant Maintainer
    participant DispatcherManager

    Controller->>OperatorController: request NewAdd/NewMove/NewStop operator
    OperatorController->>PD: GenerateStrictSessionEpoch(ctx, lastIssued)
    PD-->>OperatorController: sessionEpoch
    OperatorController->>Changefeed: construct operator with sessionEpoch
    OperatorController->>OperatorController: AddOperator(op)
    OperatorController->>Maintainer: send Add/RemoveMaintainerRequest(SessionEpoch)
    Maintainer->>Maintainer: shouldAcceptDispatcherMessage(incomingSessionEpoch)
    alt accepted
        Maintainer->>DispatcherManager: bootstrap/post/heartbeat (SessionEpoch)
        DispatcherManager->>DispatcherManager: AcceptBootstrapSession/AcceptMaintainerSession
        alt accepted
            DispatcherManager-->DispatcherManager: apply update
        else rejected
            DispatcherManager-->>Maintainer: drop/reject (log reason)
        end
    else rejected
        Maintainer-->>OperatorController: ignore/drop (log)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

eventcollector: move dispatcher heartbeat to collector and add epoch #4566: introduces mantainer/session-epoch propagation and epoch-based filtering/gating in dispatcher/heartbeat paths (strong overlap in protocol and gating logic).

Suggested labels

lgtm, approved, release-note

Suggested reviewers

3AceShowHand
wk989898
flowbehappy

Poem

🐇 I hop with a tidy little streak,
I stamp epochs upon each heartbeat,
PD whispers numbers, controllers align,
Gates drop the stale, keep every session fine,
Hooray — logs and heartbeats now all in sync!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.20% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The description follows the template with Issue Number, Background/Motivation, Summary of Changes, and Testing sections. However, the Issue Number line uses a placeholder format and the Release Notes section is missing or incomplete.	Add the actual issue number (close `#XXXX`) and complete the Release Notes section according to the template guidelines.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding session epoch handoff between coordinator and maintainer, which is the core objective of this PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request implements a session-aware communication protocol between TiCDC components to ensure system consistency by filtering stale control messages. It introduces a monotonically increasing session epoch, generated via PD TSO, which is carried by heartbeat and bootstrap messages and validated by receiving handlers. The review feedback identifies several improvement opportunities: extending session validation to status reports sent from the maintainer to the coordinator, refining bootstrap logic to avoid suppressing maintainer change logs, enhancing nil safety and efficiency in changefeed lookups, resolving package name shadowing, and ensuring network-bound TSO requests use contexts with appropriate timeouts to prevent scheduler blocking.

gemini-code-assist · 2026-04-14T04:40:04Z

 	logCoordinatorResolvedTs *atomic.Uint64
+	// currentMaintainerSessionEpoch is runtime-only coordinator state.
+	// It tracks the maintainer session that has already become active.
+	currentMaintainerSessionEpoch *atomic.Uint64


While currentMaintainerSessionEpoch is added to track the active session in the coordinator, it appears that heartbeatpb.MaintainerStatus (the message sent from Maintainer to Coordinator) has not been updated to include the session epoch. This means the Coordinator cannot verify if incoming status reports belong to the current active session, which could allow stale reports from a previous session to mutate the Coordinator's state machine or cause operators to finish prematurely. Consider adding session_epoch to MaintainerStatus and validating it in the Coordinator's heartbeat handler.

gemini-code-assist · 2026-04-14T04:40:04Z

+		if e.meta.maintainerID != maintainerID {
+			e.meta.maintainerID = maintainerID
+		}


Updating e.meta.maintainerID inside AcceptBootstrapSession when the session epoch matches the current one will cause the subsequent check in DispatcherOrchestrator.handleBootstrapRequest (manager.GetMaintainerID() != from) to always be false. This suppresses the "maintainer changed" log message and any associated monitoring logic in the orchestrator for session-aware bootstraps. It is better to let the orchestrator handle the ID update and logging consistently, or have this method return a flag indicating if the ID was changed.

gemini-code-assist · 2026-04-14T04:40:04Z

 			zap.String("status", common.FormatMaintainerStatus(status)))

 		keyspaceID := c.getChangefeed(cfID).GetKeyspaceID()
+		sessionEpoch := c.getChangefeed(cfID).GetCurrentMaintainerSessionEpoch()


This line performs a redundant lookup of the changefeed from the database. The changefeed was already fetched in the previous line (line 498). Additionally, if the changefeed is missing from the database (which is expected in handleNonExistentChangefeed), the previous line might cause a nil pointer dereference if GetKeyspaceID() is not nil-safe. Consider fetching the changefeed once and checking for nil before accessing its fields.

gemini-code-assist · 2026-04-14T04:40:05Z

+	changefeed := oc.changefeedDB.GetByID(cfID)
+	keyspaceID := uint32(0)
+	if changefeed != nil {
+		keyspaceID = changefeed.GetKeyspaceID()
+	}


The variable name changefeed shadows the package name changefeed imported in this file. This is generally discouraged in Go as it can lead to confusion. Additionally, the changefeed is looked up again in pushStopChangefeedOperator, which is redundant since the lookup is already performed here. Consider using a different name like cf and potentially passing the object to avoid redundant lookups.

Suggested change

changefeed := oc.changefeedDB.GetByID(cfID)

keyspaceID := uint32(0)

if changefeed != nil {

keyspaceID = changefeed.GetKeyspaceID()

}

cf := oc.changefeedDB.GetByID(cfID)

keyspaceID := uint32(0)

if cf != nil {

keyspaceID = cf.GetKeyspaceID()

}

gemini-code-assist · 2026-04-14T04:40:05Z

-func (oc *Controller) NewAddMaintainerOperator(cf *changefeed.Changefeed, dest node.ID) operator.Operator[common.ChangeFeedID, *heartbeatpb.MaintainerStatus] {
-	return NewAddMaintainerOperator(oc.changefeedDB, cf, dest)
+func (oc *Controller) NewAddMaintainerOperator(cf *changefeed.Changefeed, dest node.ID) (operator.Operator[common.ChangeFeedID, *heartbeatpb.MaintainerStatus], error) {
+	sessionEpoch, err := pdutil.GenerateStrictSessionEpoch(context.Background(), oc.pdClient, cf.GetCurrentMaintainerSessionEpoch())


Using context.Background() for a network request (PD TSO) can block the scheduler loop indefinitely if the PD client hangs or experiences high latency. It is recommended to use a context with a reasonable timeout or one that is tied to the lifecycle of the coordinator to ensure the scheduler remains responsive.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

coordinator/operator/operator_controller.go (1)

131-160: ⚠️ Potential issue | 🟠 Major

Capture the session epoch before StopByChangefeedID can remove the changefeed.

The comment on StopChangefeed says removed == true deletes the changefeed from changefeedDB. In that path, the second GetByID in pushStopChangefeedOperator returns nil, so the stop/remove operator is built with sessionEpoch == 0 instead of the active epoch. That drops the new session fence on the exact remove flow this PR is hardening.

🛠️ Suggested fix

 func (oc *Controller) StopChangefeed(_ context.Context, cfID common.ChangeFeedID, removed bool) operator.Operator[common.ChangeFeedID, *heartbeatpb.MaintainerStatus] {
 	oc.mu.Lock()
 	defer oc.mu.Unlock()
 
 	changefeed := oc.changefeedDB.GetByID(cfID)
 	keyspaceID := uint32(0)
+	sessionEpoch := uint64(0)
 	if changefeed != nil {
 		keyspaceID = changefeed.GetKeyspaceID()
+		sessionEpoch = changefeed.GetCurrentMaintainerSessionEpoch()
 	}
 	scheduledNode := oc.changefeedDB.StopByChangefeedID(cfID, removed)
@@
-	return oc.pushStopChangefeedOperator(keyspaceID, cfID, scheduledNode, removed)
+	return oc.pushStopChangefeedOperator(keyspaceID, cfID, scheduledNode, sessionEpoch, removed)
 }
@@
-func (oc *Controller) pushStopChangefeedOperator(keyspaceID uint32, cfID common.ChangeFeedID, nodeID node.ID, remove bool) operator.Operator[common.ChangeFeedID, *heartbeatpb.MaintainerStatus] {
-	sessionEpoch := uint64(0)
-	if cf := oc.changefeedDB.GetByID(cfID); cf != nil {
-		sessionEpoch = cf.GetCurrentMaintainerSessionEpoch()
-	}
+func (oc *Controller) pushStopChangefeedOperator(keyspaceID uint32, cfID common.ChangeFeedID, nodeID node.ID, sessionEpoch uint64, remove bool) operator.Operator[common.ChangeFeedID, *heartbeatpb.MaintainerStatus] {
 	op := NewStopChangefeedOperator(keyspaceID, cfID, nodeID, sessionEpoch, oc.selfNode.ID, oc.backend, remove)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@coordinator/operator/operator_controller.go` around lines 131 - 160,
StopChangefeed can call StopByChangefeedID which may remove the changefeed
before pushStopChangefeedOperator reads it, causing sessionEpoch to be zero;
capture the current maintainer session epoch from changefeed (via GetByID and
GetCurrentMaintainerSessionEpoch) in StopChangefeed before calling
StopByChangefeedID and pass that epoch into pushStopChangefeedOperator (update
pushStopChangefeedOperator signature to accept sessionEpoch uint64 and use it
when calling NewStopChangefeedOperator instead of re-reading the changefeed).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@coordinator/controller.go`:
- Around line 630-632: In finishBootstrap, avoid calling c.getChangefeed(id) for
entries in runningChangefeeds because the changefeed may not exist and
c.getChangefeed(id) can return nil; instead obtain keyspaceID and sessionEpoch
directly from the remoteMaintainer (rm) or its status (e.g., rm.KeyspaceID or
rm.GetKeyspaceID and rm.CurrentMaintainerSessionEpoch or
rm.Status.GetCurrentMaintainerSessionEpoch) and use those values when
constructing changefeed.RemoveMaintainerMessage and calling
c.messageCenter.SendCommand, keeping the loop over runningChangefeeds and the rm
variable intact.
- Around line 498-502: handleNonExistentChangefeed must not call
c.getChangefeed(cfID) because cf is nil; instead obtain keyspaceID and
sessionEpoch from the incoming status/message (or from the function parameter
that contains the remote changefeed status) and pass those into
changefeed.RemoveMaintainerMessage; locate the call site in
handleNonExistentChangefeed and replace keyspaceID :=
c.getChangefeed(cfID).GetKeyspaceID() and sessionEpoch :=
c.getChangefeed(cfID).GetCurrentMaintainerSessionEpoch() with values extracted
from the provided status object (or use safe defaults if the status lacks them),
then send the RemoveMaintainerMessage using those variables to avoid
dereferencing a nil changefeed.

In `@downstreamadapter/dispatchermanager/dispatcher_manager_info.go`:
- Around line 83-129: Both AcceptBootstrapSession and AcceptMaintainerSession
currently allow sessionEpoch==0 forever; change them to reject legacy (0) once a
non-zero session is installed: in AcceptBootstrapSession, after the initial zero
check, inspect e.meta.maintainerSessionEpoch (under e.meta.Lock) and if it's
non-zero then return false with a descriptive reason (e.g., "stale:legacy" or
similar) instead of accepting; in AcceptMaintainerSession, replace the
unconditional return true for sessionEpoch==0 with a conditional that returns
true only if GetMaintainerSessionEpoch() == 0, otherwise return false with a
descriptive reason so zero-epoch messages are blocked after a session has been
installed.

---

Outside diff comments:
In `@coordinator/operator/operator_controller.go`:
- Around line 131-160: StopChangefeed can call StopByChangefeedID which may
remove the changefeed before pushStopChangefeedOperator reads it, causing
sessionEpoch to be zero; capture the current maintainer session epoch from
changefeed (via GetByID and GetCurrentMaintainerSessionEpoch) in StopChangefeed
before calling StopByChangefeedID and pass that epoch into
pushStopChangefeedOperator (update pushStopChangefeedOperator signature to
accept sessionEpoch uint64 and use it when calling NewStopChangefeedOperator
instead of re-reading the changefeed).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8a4a7712-20f8-4d40-bb33-4110fcec5828

📥 Commits

Reviewing files that changed from the base of the PR and between 0a418b4 and d6ab767.

⛔ Files ignored due to path filters (1)

heartbeatpb/heartbeat.pb.go is excluded by !**/*.pb.go

📒 Files selected for processing (30)

coordinator/changefeed/changefeed.go
coordinator/changefeed/changefeed_test.go
coordinator/controller.go
coordinator/controller_test.go
coordinator/coordinator_test.go
coordinator/create_changefeed_gc_test.go
coordinator/operator/operator_add.go
coordinator/operator/operator_add_test.go
coordinator/operator/operator_controller.go
coordinator/operator/operator_controller_test.go
coordinator/operator/operator_move.go
coordinator/operator/operator_move_test.go
coordinator/operator/operator_stop.go
coordinator/operator/operator_stop_test.go
coordinator/scheduler/balance.go
coordinator/scheduler/basic.go
downstreamadapter/dispatchermanager/dispatcher_manager.go
downstreamadapter/dispatchermanager/dispatcher_manager_info.go
downstreamadapter/dispatchermanager/dispatcher_manager_redo.go
downstreamadapter/dispatchermanager/helper.go
downstreamadapter/dispatchermanager/helper_test.go
downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
heartbeatpb/heartbeat.proto
maintainer/maintainer.go
maintainer/maintainer_manager.go
maintainer/maintainer_manager_test.go
maintainer/maintainer_test.go
pkg/metrics/maintainer.go
pkg/pdutil/utils.go
pkg/pdutil/utils_test.go

coderabbitai · 2026-04-14T04:50:27Z

 		keyspaceID := c.getChangefeed(cfID).GetKeyspaceID()
+		sessionEpoch := c.getChangefeed(cfID).GetCurrentMaintainerSessionEpoch()

 		// Remove working changefeed from maintainer if it's not in changefeedDB
-		_ = c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, cfID, from, true, true))
+		_ = c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, cfID, from, true, true, sessionEpoch))


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for getChangefeed implementation to confirm it queries changefeedDB ast-grep --pattern $'func (c *Controller) getChangefeed($_) $_ { $$$ }'

Repository: pingcap/ticdc

Length of output: 267

🏁 Script executed:

# Read the context around lines 460-510 to understand when handleNonExistentChangefeed is called # and the condition that leads to it sed -n '460,510p' coordinator/controller.go

Repository: pingcap/ticdc

Length of output: 1602

Nil pointer dereference: getChangefeed will return nil here.

The handleNonExistentChangefeed function is called when cf == nil (line 469), meaning the changefeed is not in changefeedDB. However, lines 498-499 call c.getChangefeed(cfID) which queries the same changefeedDB and will return nil, causing a panic when accessing GetKeyspaceID() or GetCurrentMaintainerSessionEpoch().

The keyspaceID and sessionEpoch need to be obtained from the incoming status or another source, not from the non-existent changefeed.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@coordinator/controller.go` around lines 498 - 502, handleNonExistentChangefeed must not call c.getChangefeed(cfID) because cf is nil; instead obtain keyspaceID and sessionEpoch from the incoming status/message (or from the function parameter that contains the remote changefeed status) and pass those into changefeed.RemoveMaintainerMessage; locate the call site in handleNonExistentChangefeed and replace keyspaceID := c.getChangefeed(cfID).GetKeyspaceID() and sessionEpoch := c.getChangefeed(cfID).GetCurrentMaintainerSessionEpoch() with values extracted from the provided status object (or use safe defaults if the status lacks them), then send the RemoveMaintainerMessage using those variables to avoid dereferencing a nil changefeed.

coderabbitai · 2026-04-14T04:50:27Z

 		keyspaceID := c.getChangefeed(id).GetKeyspaceID()
-		_ = c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, id, rm.nodeID, true, true))
+		sessionEpoch := c.getChangefeed(id).GetCurrentMaintainerSessionEpoch()
+		_ = c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, id, rm.nodeID, true, true, sessionEpoch))


⚠️ Potential issue | 🔴 Critical

Same nil pointer dereference risk in finishBootstrap.

This code path handles changefeeds that are NOT in allChangefeeds (line 625: for id, rm := range runningChangefeeds—these are changefeeds remaining after deletion at line 612). Calling c.getChangefeed(id) will return nil since the changefeed was never added to changefeedDB in this bootstrap flow.

The keyspaceID and sessionEpoch should be obtained from the remoteMaintainer struct or the status information, not from the changefeedDB.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@coordinator/controller.go` around lines 630 - 632, In finishBootstrap, avoid calling c.getChangefeed(id) for entries in runningChangefeeds because the changefeed may not exist and c.getChangefeed(id) can return nil; instead obtain keyspaceID and sessionEpoch directly from the remoteMaintainer (rm) or its status (e.g., rm.KeyspaceID or rm.GetKeyspaceID and rm.CurrentMaintainerSessionEpoch or rm.Status.GetCurrentMaintainerSessionEpoch) and use those values when constructing changefeed.RemoveMaintainerMessage and calling c.messageCenter.SendCommand, keeping the loop over runningChangefeeds and the rm variable intact.

coderabbitai · 2026-04-14T04:50:27Z

+func (e *DispatcherManager) AcceptBootstrapSession(maintainerID node.ID, sessionEpoch uint64) (bool, string) {
+	if sessionEpoch == 0 {
+		// Zero means the peer is still running the legacy protocol. We must keep
+		// accepting that path during rolling upgrade, otherwise a new dispatcher
+		// manager can no longer interoperate with an old maintainer.
+		return true, "legacy"
+	}
+
+	e.meta.Lock()
+	defer e.meta.Unlock()
+
+	switch {
+	case e.meta.maintainerSessionEpoch == 0:
+		e.meta.maintainerID = maintainerID
+		e.meta.maintainerSessionEpoch = sessionEpoch
+		return true, "install"
+	case sessionEpoch < e.meta.maintainerSessionEpoch:
+		return false, fmt.Sprintf("stale:%d<%d", sessionEpoch, e.meta.maintainerSessionEpoch)
+	case sessionEpoch == e.meta.maintainerSessionEpoch:
+		if e.meta.maintainerID != maintainerID {
+			e.meta.maintainerID = maintainerID
+		}
+		return true, "current"
+	default:
+		e.meta.maintainerID = maintainerID
+		e.meta.maintainerSessionEpoch = sessionEpoch
+		return true, "advance"
+	}
+}
+
+func (e *DispatcherManager) AcceptMaintainerSession(sessionEpoch uint64) (bool, string) {
+	if sessionEpoch == 0 {
+		// Zero-session control messages intentionally stay on the legacy path for
+		// mixed-version upgrade compatibility. Strict stale-session filtering only
+		// applies after both ends speak the session-aware protocol.
+		return true, "legacy"
+	}
+
+	current := e.GetMaintainerSessionEpoch()
+	switch {
+	case sessionEpoch < current:
+		return false, fmt.Sprintf("stale:%d<%d", sessionEpoch, current)
+	case sessionEpoch == current:
+		return true, "current"
+	default:
+		return false, fmt.Sprintf("future:%d>%d", sessionEpoch, current)
+	}


⚠️ Potential issue | 🟠 Major

Reject legacy epoch 0 after a non-zero session is installed.

AcceptBootstrapSession and AcceptMaintainerSession both treat sessionEpoch == 0 as valid forever. Once the dispatcher manager has already advanced maintainerSessionEpoch, that still lets stale legacy traffic through the bootstrap/post-bootstrap/close gates; handleCloseRequest will then reach TryClose for a zero-epoch request. The legacy path should only stay open while the current session is still 0.

🛠️ Suggested fix

func (e *DispatcherManager) AcceptBootstrapSession(maintainerID node.ID, sessionEpoch uint64) (bool, string) { - if sessionEpoch == 0 { - // Zero means the peer is still running the legacy protocol. We must keep - // accepting that path during rolling upgrade, otherwise a new dispatcher - // manager can no longer interoperate with an old maintainer. - return true, "legacy" - } - e.meta.Lock() defer e.meta.Unlock() + + if sessionEpoch == 0 { + if e.meta.maintainerSessionEpoch == 0 { + return true, "legacy" + } + return false, fmt.Sprintf("stale:%d<%d", sessionEpoch, e.meta.maintainerSessionEpoch) + } @@ func (e *DispatcherManager) AcceptMaintainerSession(sessionEpoch uint64) (bool, string) { - if sessionEpoch == 0 { - // Zero-session control messages intentionally stay on the legacy path for - // mixed-version upgrade compatibility. Strict stale-session filtering only - // applies after both ends speak the session-aware protocol. - return true, "legacy" - } - current := e.GetMaintainerSessionEpoch() + if sessionEpoch == 0 { + if current == 0 { + return true, "legacy" + } + return false, fmt.Sprintf("stale:%d<%d", sessionEpoch, current) + } + switch { case sessionEpoch < current: return false, fmt.Sprintf("stale:%d<%d", sessionEpoch, current)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@downstreamadapter/dispatchermanager/dispatcher_manager_info.go` around lines 83 - 129, Both AcceptBootstrapSession and AcceptMaintainerSession currently allow sessionEpoch==0 forever; change them to reject legacy (0) once a non-zero session is installed: in AcceptBootstrapSession, after the initial zero check, inspect e.meta.maintainerSessionEpoch (under e.meta.Lock) and if it's non-zero then return false with a descriptive reason (e.g., "stale:legacy" or similar) instead of accepting; in AcceptMaintainerSession, replace the unconditional return true for sessionEpoch==0 with a conditional that returns true only if GetMaintainerSessionEpoch() == 0, otherwise return false with a descriptive reason so zero-epoch messages are blocked after a session has been installed.

coderabbitai

♻️ Duplicate comments (2)

coordinator/controller.go (2)
498-502: ⚠️ Potential issue | 🔴 Critical

Nil pointer dereference: c.getChangefeed(cfID) returns nil here.

This function is called when the changefeed doesn't exist in changefeedDB (line 468-470 checks cf == nil). Calling c.getChangefeed(cfID) again will return nil, causing a panic when accessing GetKeyspaceID() or GetCurrentMaintainerSessionEpoch().

The sessionEpoch should be obtained from status.SessionEpoch. For keyspaceID, check if MaintainerStatus includes the field, or use a default value of 0 for legacy compatibility.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/controller.go` around lines 498 - 502, The call to
c.getChangefeed(cfID) can return nil and cause a panic when invoking
GetKeyspaceID()/GetCurrentMaintainerSessionEpoch(); update the removal call so
it does not call c.getChangefeed(cfID): obtain sessionEpoch from
status.SessionEpoch and determine keyspaceID from status.MaintainerStatus if
present (otherwise use 0 for legacy compatibility), then call
c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, cfID,
from, true, true, sessionEpoch)); ensure you remove the c.getChangefeed(cfID)
usage and rely solely on the already-available status and its MaintainerStatus
field.
633-635: ⚠️ Potential issue | 🔴 Critical

Nil pointer dereference: c.getChangefeed(id) returns nil for stale maintainers.

At this point in finishBootstrap, runningChangefeeds contains only entries that were NOT in allChangefeeds (deleted at line 615). These changefeeds were never added to changefeedDB, so c.getChangefeed(id) returns nil.

The fix should obtain values from rm (the remoteMaintainer):

sessionEpoch from rm.status.SessionEpoch

keyspaceID from rm.status if available, otherwise use a default
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/controller.go` around lines 633 - 635, finishBootstrap
dereferences c.getChangefeed(id) which can be nil for stale maintainers; instead
use values from the remoteMaintainer (rm) passed into the loop: replace uses of
c.getChangefeed(id).GetCurrentMaintainerSessionEpoch() and .GetKeyspaceID() with
rm.status.SessionEpoch and rm.status.KeyspaceID (falling back to a sensible
default if rm.status or KeyspaceID is nil/empty), and then call
c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, id,
rm.nodeID, true, true, sessionEpoch)); ensure you reference finishBootstrap,
runningChangefeeds, rm (remoteMaintainer), and GetChangefeed only when non-nil.

🧹 Nitpick comments (2)

downstreamadapter/dispatchermanager/helper_test.go (2)

269-273: Add an immutability assertion for stale bootstrap requests.

The stale case validates epoch immutability, but it should also assert maintainerID is not changed by a stale bootstrap attempt.

Proposed test assertion

 	accepted, reason := dm.AcceptBootstrapSession(node.ID("new"), 9)
 	require.False(t, accepted)
 	require.Contains(t, reason, "stale")
 	require.Equal(t, uint64(10), dm.GetMaintainerSessionEpoch())
+	require.Equal(t, node.ID("old"), dm.GetMaintainerID())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@downstreamadapter/dispatchermanager/helper_test.go` around lines 269 - 273,
Before calling dm.AcceptBootstrapSession for the stale case, capture the current
maintainer ID (e.g., via dm.GetMaintainerID() or the existing maintainerID
variable) and after the call assert it was not changed; specifically, add a
require.Equal comparing the pre-call maintainer ID to the post-call maintainer
ID (alongside the existing epoch immutability checks) to ensure a stale
bootstrap attempt does not modify maintainerID.

405-432: Expand redo-forward filter coverage to include legacy and future epochs.

This test verifies stale and current epochs, but it should also explicitly assert that SessionEpoch == 0 and SessionEpoch > current are rejected without mutating redoGlobalTs.

Proposed additional cases

 	blocking := handler.Handle(dm, NewRedoResolvedTsForwardMessage(&heartbeatpb.RedoResolvedTsForwardMessage{
 		ChangefeedID: changefeedID.ToPB(),
 		ResolvedTs:   150,
 		SessionEpoch: 9,
 	}))
 	require.False(t, blocking)
 	require.Equal(t, uint64(100), dm.redoGlobalTs.Load())
+
+	blocking = handler.Handle(dm, NewRedoResolvedTsForwardMessage(&heartbeatpb.RedoResolvedTsForwardMessage{
+		ChangefeedID: changefeedID.ToPB(),
+		ResolvedTs:   150,
+		SessionEpoch: 0,
+	}))
+	require.False(t, blocking)
+	require.Equal(t, uint64(100), dm.redoGlobalTs.Load())
+
+	blocking = handler.Handle(dm, NewRedoResolvedTsForwardMessage(&heartbeatpb.RedoResolvedTsForwardMessage{
+		ChangefeedID: changefeedID.ToPB(),
+		ResolvedTs:   150,
+		SessionEpoch: 11,
+	}))
+	require.False(t, blocking)
+	require.Equal(t, uint64(100), dm.redoGlobalTs.Load())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@downstreamadapter/dispatchermanager/helper_test.go` around lines 405 - 432,
Update TestRedoResolvedTsForwardMessageHandlerFiltersSessionEpoch to add two
more assertions: send a RedoResolvedTsForwardMessage with SessionEpoch == 0 and
one with SessionEpoch > dm.meta.maintainerSessionEpoch (e.g., 11) using
NewRedoResolvedTsForwardMessage and the same changefeedID/ResolvedTs, call
handler.Handle for each, assert it returns false, and assert
dm.redoGlobalTs.Load() remains unchanged (still 100) after each call; this
ensures RedoResolvedTsForwardMessage handling in
RedoResolvedTsForwardMessageHandler correctly rejects legacy (0) and future
epochs without mutating redoGlobalTs.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@coordinator/controller.go`:
- Around line 498-502: The call to c.getChangefeed(cfID) can return nil and
cause a panic when invoking GetKeyspaceID()/GetCurrentMaintainerSessionEpoch();
update the removal call so it does not call c.getChangefeed(cfID): obtain
sessionEpoch from status.SessionEpoch and determine keyspaceID from
status.MaintainerStatus if present (otherwise use 0 for legacy compatibility),
then call
c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, cfID,
from, true, true, sessionEpoch)); ensure you remove the c.getChangefeed(cfID)
usage and rely solely on the already-available status and its MaintainerStatus
field.
- Around line 633-635: finishBootstrap dereferences c.getChangefeed(id) which
can be nil for stale maintainers; instead use values from the remoteMaintainer
(rm) passed into the loop: replace uses of
c.getChangefeed(id).GetCurrentMaintainerSessionEpoch() and .GetKeyspaceID() with
rm.status.SessionEpoch and rm.status.KeyspaceID (falling back to a sensible
default if rm.status or KeyspaceID is nil/empty), and then call
c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, id,
rm.nodeID, true, true, sessionEpoch)); ensure you reference finishBootstrap,
runningChangefeeds, rm (remoteMaintainer), and GetChangefeed only when non-nil.

---

Nitpick comments:
In `@downstreamadapter/dispatchermanager/helper_test.go`:
- Around line 269-273: Before calling dm.AcceptBootstrapSession for the stale
case, capture the current maintainer ID (e.g., via dm.GetMaintainerID() or the
existing maintainerID variable) and after the call assert it was not changed;
specifically, add a require.Equal comparing the pre-call maintainer ID to the
post-call maintainer ID (alongside the existing epoch immutability checks) to
ensure a stale bootstrap attempt does not modify maintainerID.
- Around line 405-432: Update
TestRedoResolvedTsForwardMessageHandlerFiltersSessionEpoch to add two more
assertions: send a RedoResolvedTsForwardMessage with SessionEpoch == 0 and one
with SessionEpoch > dm.meta.maintainerSessionEpoch (e.g., 11) using
NewRedoResolvedTsForwardMessage and the same changefeedID/ResolvedTs, call
handler.Handle for each, assert it returns false, and assert
dm.redoGlobalTs.Load() remains unchanged (still 100) after each call; this
ensures RedoResolvedTsForwardMessage handling in
RedoResolvedTsForwardMessageHandler correctly rejects legacy (0) and future
epochs without mutating redoGlobalTs.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e2d71c9d-83f3-4da7-8356-4aabecefc623

📥 Commits

Reviewing files that changed from the base of the PR and between d6ab767 and 1c50eaf.

⛔ Files ignored due to path filters (1)

heartbeatpb/heartbeat.pb.go is excluded by !**/*.pb.go

📒 Files selected for processing (9)

coordinator/controller.go
coordinator/controller_test.go
coordinator/operator/operator_controller_test.go
downstreamadapter/dispatchermanager/dispatcher_manager_info.go
downstreamadapter/dispatchermanager/helper.go
downstreamadapter/dispatchermanager/helper_test.go
heartbeatpb/heartbeat.proto
maintainer/maintainer.go
maintainer/maintainer_test.go

✅ Files skipped from review due to trivial changes (1)

downstreamadapter/dispatchermanager/helper.go

🚧 Files skipped from review as they are similar to previous changes (6)

coordinator/operator/operator_controller_test.go
coordinator/controller_test.go
maintainer/maintainer_test.go
heartbeatpb/heartbeat.proto
downstreamadapter/dispatchermanager/dispatcher_manager_info.go
maintainer/maintainer.go

coderabbitai

🧹 Nitpick comments (1)

coordinator/operator/operator_controller.go (1)

324-339: Add caller-side timeout for PD epoch generation.

GenerateStrictSessionEpoch in pkg/pdutil/utils.go passes the context directly to pdClient.GetTS(ctx) without creating a timeout-bounded child context. Currently, both callers at lines 325 and 334 use context.Background(), which has no timeout. Under PD degradation, these operations could block indefinitely.

Proposed caller-side bounded context

 func (oc *Controller) NewAddMaintainerOperator(cf *changefeed.Changefeed, dest node.ID) (operator.Operator[common.ChangeFeedID, *heartbeatpb.MaintainerStatus], error) {
-	sessionEpoch, err := pdutil.GenerateStrictSessionEpoch(context.Background(), oc.pdClient, cf.GetCurrentMaintainerSessionEpoch())
+	ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
+	defer cancel()
+	sessionEpoch, err := pdutil.GenerateStrictSessionEpoch(ctx, oc.pdClient, cf.GetCurrentMaintainerSessionEpoch())
 	if err != nil {
 		return nil, err
 	}
 	return NewAddMaintainerOperator(oc.changefeedDB, cf, dest, sessionEpoch), nil
 }
 
 func (oc *Controller) NewMoveMaintainerOperator(cf *changefeed.Changefeed, origin, dest node.ID) (operator.Operator[common.ChangeFeedID, *heartbeatpb.MaintainerStatus], error) {
 	activeSessionEpoch := cf.GetCurrentMaintainerSessionEpoch()
-	destSessionEpoch, err := pdutil.GenerateStrictSessionEpoch(context.Background(), oc.pdClient, activeSessionEpoch)
+	ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
+	defer cancel()
+	destSessionEpoch, err := pdutil.GenerateStrictSessionEpoch(ctx, oc.pdClient, activeSessionEpoch)
 	if err != nil {
 		return nil, err
 	}
 	return NewMoveMaintainerOperator(oc.changefeedDB, cf, origin, dest, activeSessionEpoch, destSessionEpoch), nil
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@coordinator/operator/operator_controller.go` around lines 324 - 339, The
calls in Controller.NewAddMaintainerOperator and
Controller.NewMoveMaintainerOperator currently pass context.Background() into
pdutil.GenerateStrictSessionEpoch (which delegates to oc.pdClient.GetTS) and can
block; change both to create a timeout-bounded context (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable timeout>) with defer
cancel()) and pass that ctx to pdutil.GenerateStrictSessionEpoch; ensure you
choose a shared constant timeout name (e.g., pdEpochTimeout) visible in the
package and always defer cancel() after creating the child context.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@coordinator/operator/operator_controller.go`:
- Around line 324-339: The calls in Controller.NewAddMaintainerOperator and
Controller.NewMoveMaintainerOperator currently pass context.Background() into
pdutil.GenerateStrictSessionEpoch (which delegates to oc.pdClient.GetTS) and can
block; change both to create a timeout-bounded context (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable timeout>) with defer
cancel()) and pass that ctx to pdutil.GenerateStrictSessionEpoch; ensure you
choose a shared constant timeout name (e.g., pdEpochTimeout) visible in the
package and always defer cancel() after creating the child context.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 50f06f97-a8d2-4b5b-8fc4-bc7ae78137cf

📥 Commits

Reviewing files that changed from the base of the PR and between 1c50eaf and 10e70b2.

📒 Files selected for processing (7)

coordinator/changefeed/changefeed.go
coordinator/operator/operator_add.go
coordinator/operator/operator_add_test.go
coordinator/operator/operator_controller.go
coordinator/operator/operator_controller_test.go
coordinator/operator/operator_move.go
coordinator/operator/operator_move_test.go

🚧 Files skipped from review as they are similar to previous changes (4)

coordinator/operator/operator_add.go
coordinator/operator/operator_add_test.go
coordinator/operator/operator_move.go
coordinator/operator/operator_controller_test.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@coordinator/controller.go`:
- Around line 506-512: removeReportedMaintainer currently dereferences
c.getChangefeed(cfID) which can be nil; update removeReportedMaintainer to first
grab cf := c.getChangefeed(cfID) and if cf == nil set keyspaceID := uint32(0)
(default for the handleNonExistentChangefeed path) else keyspaceID =
cf.GetKeyspaceID(); then call
c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, cfID,
from, true, true, sessionEpoch)). Additionally, update the finishBootstrap
caller to try to extract keyspaceID from the remote status and pass it into
removeReportedMaintainer (or ensure the code path populates the changefeed so
GetKeyspaceID is non-nil) so finishBootstrap doesn't rely on a nil getChangefeed
result.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 77ecc061-1646-40d5-ac5a-bc8585f8428a

📥 Commits

Reviewing files that changed from the base of the PR and between 10e70b2 and 0683497.

📒 Files selected for processing (11)

coordinator/controller.go
coordinator/controller_test.go
coordinator/coordinator_test.go
coordinator/operator/operator_add.go
coordinator/operator/operator_add_test.go
coordinator/operator/operator_controller.go
coordinator/operator/operator_controller_test.go
coordinator/operator/operator_move.go
coordinator/operator/operator_move_test.go
coordinator/operator/operator_stop.go
coordinator/operator/operator_stop_test.go

✅ Files skipped from review due to trivial changes (1)

coordinator/operator/operator_add_test.go

🚧 Files skipped from review as they are similar to previous changes (3)

coordinator/operator/operator_stop_test.go
coordinator/operator/operator_add.go
coordinator/operator/operator_move.go

coderabbitai · 2026-04-14T23:56:24Z

+// removeReportedMaintainer removes a working maintainer that is only visible via
+// remote runtime state. Once local metadata is gone, the reporter's session is
+// the only fenced owner token we can still trust for cleanup.
+func (c *Controller) removeReportedMaintainer(cfID common.ChangeFeedID, from node.ID, sessionEpoch uint64) {
+	keyspaceID := c.getChangefeed(cfID).GetKeyspaceID()
+	_ = c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, cfID, from, true, true, sessionEpoch))
+}


⚠️ Potential issue | 🔴 Critical

Nil pointer dereference: getChangefeed returns nil when local metadata is missing.

removeReportedMaintainer is called from handleNonExistentChangefeed (line 502) and finishBootstrap (line 680) when the changefeed doesn't exist locally. However, line 510 calls c.getChangefeed(cfID).GetKeyspaceID() which will panic because getChangefeed returns nil.

For the handleNonExistentChangefeed path, keyspaceID should default to 0 since local metadata is unavailable. For finishBootstrap, the keyspace info could potentially be extracted from the remote status if available.

🐛 Proposed fix

func (c *Controller) removeReportedMaintainer(cfID common.ChangeFeedID, from node.ID, sessionEpoch uint64) { - keyspaceID := c.getChangefeed(cfID).GetKeyspaceID() + keyspaceID := uint32(0) + if cf := c.getChangefeed(cfID); cf != nil { + keyspaceID = cf.GetKeyspaceID() + } _ = c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, cfID, from, true, true, sessionEpoch)) }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@coordinator/controller.go` around lines 506 - 512, removeReportedMaintainer currently dereferences c.getChangefeed(cfID) which can be nil; update removeReportedMaintainer to first grab cf := c.getChangefeed(cfID) and if cf == nil set keyspaceID := uint32(0) (default for the handleNonExistentChangefeed path) else keyspaceID = cf.GetKeyspaceID(); then call c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, cfID, from, true, true, sessionEpoch)). Additionally, update the finishBootstrap caller to try to extract keyspaceID from the remote status and pass it into removeReportedMaintainer (or ensure the code path populates the changefeed so GetKeyspaceID is non-nil) so finishBootstrap doesn't rely on a nil getChangefeed result.

coderabbitai

♻️ Duplicate comments (1)

coordinator/controller.go (1)

508-510: ⚠️ Potential issue | 🔴 Critical

Nil dereference in stale-maintainer cleanup.

Line 509 dereferences c.getChangefeed(cfID) even though this helper is only used after we've already concluded local metadata is missing or stale. That panics before the remove request is sent, so late maintainer reports can still break the coordinator during the exact cleanup flow this PR is adding.

🐛 Minimal fix

 func (c *Controller) removeReportedMaintainer(cfID common.ChangeFeedID, from node.ID, sessionEpoch uint64) {
-	keyspaceID := c.getChangefeed(cfID).GetKeyspaceID()
+	keyspaceID := uint32(0)
+	if cf := c.getChangefeed(cfID); cf != nil {
+		keyspaceID = cf.GetKeyspaceID()
+	}
 	_ = c.messageCenter.SendCommand(changefeed.RemoveMaintainerMessage(keyspaceID, cfID, from, true, true, sessionEpoch))
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@coordinator/controller.go` around lines 508 - 510, The
removeReportedMaintainer function dereferences c.getChangefeed(cfID)
unconditionally; change it to call c.getChangefeed(cfID) once into a local
variable, check for nil, and only call GetKeyspaceID when the result is
non-nil—otherwise use a sensible zero/empty keyspaceID (or omit it if
RemoveMaintainerMessage accepts that) and still invoke
c.messageCenter.SendCommand with changefeed.RemoveMaintainerMessage(cfID, from,
true, true, sessionEpoch) (referencing removeReportedMaintainer,
c.getChangefeed, GetKeyspaceID, messageCenter.SendCommand, and
changefeed.RemoveMaintainerMessage).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@coordinator/controller.go`:
- Around line 508-510: The removeReportedMaintainer function dereferences
c.getChangefeed(cfID) unconditionally; change it to call c.getChangefeed(cfID)
once into a local variable, check for nil, and only call GetKeyspaceID when the
result is non-nil—otherwise use a sensible zero/empty keyspaceID (or omit it if
RemoveMaintainerMessage accepts that) and still invoke
c.messageCenter.SendCommand with changefeed.RemoveMaintainerMessage(cfID, from,
true, true, sessionEpoch) (referencing removeReportedMaintainer,
c.getChangefeed, GetKeyspaceID, messageCenter.SendCommand, and
changefeed.RemoveMaintainerMessage).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6f0d1c55-cb0a-4513-bcaf-af2876da7084

📥 Commits

Reviewing files that changed from the base of the PR and between 0683497 and 356c2e3.

📒 Files selected for processing (9)

coordinator/changefeed/changefeed.go
coordinator/controller.go
coordinator/controller_test.go
coordinator/operator/operator_add.go
coordinator/operator/operator_controller.go
coordinator/operator/operator_controller_test.go
coordinator/operator/operator_move.go
coordinator/operator/operator_stop.go
coordinator/operator/operator_stop_test.go

🚧 Files skipped from review as they are similar to previous changes (5)

coordinator/operator/operator_add.go
coordinator/operator/operator_move.go
coordinator/operator/operator_controller_test.go
coordinator/operator/operator_stop.go
coordinator/changefeed/changefeed.go

ti-chi-bot · 2026-04-15T02:15:34Z

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

_{📖 For more info, you can check the "Contribute Code" section in the development guide.}

coordinator,maintainer: add session epoch handoff

d6ab767

ti-chi-bot Bot added the do-not-merge/needs-linked-issue label Apr 14, 2026

ti-chi-bot Bot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Apr 14, 2026

ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 14, 2026

gemini-code-assist Bot reviewed Apr 14, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 14, 2026

View reviewed changes

hongyunyan added 2 commits April 15, 2026 01:39

coordinator,maintainer: restore session epoch on bootstrap

2b090a8

downstreamadapter: fence legacy zero-session control

1c50eaf

coderabbitai Bot reviewed Apr 14, 2026

View reviewed changes

hongyunyan added 2 commits April 15, 2026 05:59

coordinator,operator: publish owner session on handoff

10e70b2

coordinator,operator: remove duplicate session publication

6794f1f

coderabbitai Bot reviewed Apr 14, 2026

View reviewed changes

hongyunyan added 2 commits April 15, 2026 06:22

coordinator: keep reported session for stale cleanup

80d4ccc

coordinator,operator: fence maintainer status by session

0683497

coderabbitai Bot reviewed Apr 14, 2026

View reviewed changes

coordinator,operator: accept legacy zero session during rollout

356c2e3

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

coordinator,operator: fence operator completion by session

0c8ab90

hongyunyan closed this Apr 15, 2026

Conversation

hongyunyan commented Apr 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Motivation

Summary of Changes

Testing

Summary by CodeRabbit

Uh oh!

ti-chi-bot Bot commented Apr 14, 2026

Uh oh!

ti-chi-bot Bot commented Apr 14, 2026

Uh oh!

coderabbitai Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

hongyunyan commented Apr 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 14, 2026 •

edited

Loading