maintainer: delay finishing merge occupy operators by hongyunyan · Pull Request #4543 · pingcap/ticdc

hongyunyan · 2026-03-20T20:34:58Z

What problem does this PR solve?

Issue Number: close #4542

A merge can finish its occupy operators before the old replica sets are removed or restored. If a late Stopped / Removed status from an old dispatcher arrives in that window, maintainer fallback may observe GetOperator(dispatcherID) == nil while the old task still exists in spanController, mark that old replica absent again, and leave a ghost scheduler entry that stalls maintainer checkpoint progress.

What is changed and how it works?

This PR delays setOccupyOperatorsFinished() until the end of MergeDispatcherOperator.PostFinish().

That keeps old dispatcher IDs protected by their occupy operators until merge post-finish has fully removed or restored the old replica sets, so late terminal statuses from old dispatchers cannot be rescheduled by the generic maintainer fallback.

This PR also adds a focused unit test that exercises the post-finish ordering window and verifies occupy operators are not released before the old replicas disappear from spanController.

Check List

Tests

Unit test

Questions

Will it cause performance regression or break compatibility?

No. This only changes the in-memory ordering inside merge post-finish and keeps occupy operators alive slightly longer during the same post-finish call.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

Fix a merge ordering issue in maintainer that could reschedule removed replicas and stall changefeed checkpoint progress.

Summary by CodeRabbit

Bug Fixes
- Adjusted the timing of releasing occupying dispatchers during merge completion so cleanup observes prior operator states, preventing premature state changes and ensuring proper cleanup order.
Tests
- Added a large-scale test that verifies merge completion releases occupy dispatchers only after old replicas are removed and confirms all occupy operators finish.

coderabbitai · 2026-03-20T20:35:13Z

Warning

Rate limit exceeded

@hongyunyan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 16 minutes and 41 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 75342e18-390e-4f86-858a-025a08ee4585

📥 Commits

Reviewing files that changed from the base of the PR and between 670e77e and 7e0a869.

📒 Files selected for processing (1)

maintainer/operator/operator_merge_test.go

📝 Walkthrough

Walkthrough

Modified MergeDispatcherOperator.PostFinish() to defer setOccupyOperatorsFinished() until after old replica cleanup, ensuring occupy operators remain active during replica removal and restoration. Added test infrastructure and a regression test to verify the fix prevents late dispatcher statuses from resurrecting removed replicas.

Changes

Cohort / File(s)	Summary
Core Control Flow Fix `maintainer/operator/operator_merge.go`	Moved `setOccupyOperatorsFinished(m.occupyOperators)` from the start of `PostFinish()` to the end so occupy operators remain held while old replica sets are removed or restored.
Test Coverage & Helpers `maintainer/operator/operator_merge_test.go`	Added `setupLargeMergeTestEnvironment(t, spanCount)` helper, updated imports, and added `TestMergeOperator_PostFinishReleasesOccupyAfterRemovingOldReplicas` which builds a large merge scenario (4096 spans) and asserts occupy operators are only released after old replicas are removed.

Sequence Diagram(s)

(omitted — change is a bugfix and does not require a sequence diagram per guidelines)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

lgtm, approved, size/L

Suggested reviewers

3AceShowHand
lidezhu

Poem

🐰
Hop, defer, and hold the line,
Old replicas wait till cleanup's fine.
I guard the gates, then let them go,
Clean merge, no ghosts — my whiskers know.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: delaying the finishing of merge occupy operators.
Description check	✅ Passed	The PR description is well-structured, includes all required sections (issue number, what changed, checks, questions, release note), and comprehensively explains the problem, solution, and testing approach.
Linked Issues check	✅ Passed	The PR implementation directly addresses issue `#4542` by deferring setOccupyOperatorsFinished() to the end of PostFinish(), protecting old dispatcher IDs until old replicas are removed, and includes a focused unit test validating the fix.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing the merge post-finish ordering issue: code change in operator_merge.go and comprehensive test additions in operator_merge_test.go are both in-scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch 0320

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-03-20T20:35:16Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a merge ordering issue within the maintainer component that could lead to incorrect rescheduling of removed replicas and stalled changefeed checkpoint progress. By delaying the release of occupy operators until the very end of the merge post-finish process, the change prevents a race condition where late terminal statuses from old dispatchers could be mishandled, thereby eliminating ghost scheduler entries and improving the robustness of merge operations.

Highlights

Delayed Occupy Operator Release: Modified the MergeDispatcherOperator.PostFinish() method to defer the setOccupyOperatorsFinished() call, ensuring that occupy operators remain active until all old replica sets have been fully removed or restored.
New Unit Test for Ordering: Added a new unit test, TestMergeOperator_PostFinishReleasesOccupyAfterRemovingOldReplicas, to specifically verify that occupy operators are not released before old replicas disappear from the spanController during the merge post-finish phase.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses a race condition in the merge operator's post-finish logic by delaying when occupy operators are marked as finished. The change correctly uses a defer statement to ensure old replica sets are fully removed before the operators are released, preventing a potential stall in changefeed checkpoint progress. A new, focused unit test is added to verify this fix, using concurrency to simulate the race condition. My feedback focuses on improving the robustness of this new test by replacing a busy-wait loop with a more standard polling mechanism.

maintainer/operator/operator_merge_test.go

coderabbitai

🧹 Nitpick comments (2)

maintainer/operator/operator_merge_test.go (2)

261-261: Rename test function to camelCase to match Go naming guideline.

The new test name uses underscores; switch to camelCase for consistency with repository rules.

Suggested rename

-func TestMergeOperator_PostFinishReleasesOccupyAfterRemovingOldReplicas(t *testing.T) {
+func TestMergeOperatorPostFinishReleasesOccupyAfterRemovingOldReplicas(t *testing.T) {

As per coding guidelines, "**/*.go: Use camelCase for function names and do not include underscores (e.g., getPartitionNum, not get_partition_num) in Go".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@maintainer/operator/operator_merge_test.go` at line 261, The test function
name TestMergeOperator_PostFinishReleasesOccupyAfterRemovingOldReplicas uses
underscores and should be renamed to camelCase; change it to
TestMergeOperatorPostFinishReleasesOccupyAfterRemovingOldReplicas (or another
camelCase variant) and update any references/calls in the file (or other tests)
to the new name so tests compile and run.

283-295: Use a bounded receive to avoid long hangs when observer signaling fails.

<-windowObservedCh can block until global test timeout if the goroutine never sends. A timeout makes failures fail fast and keeps the test more deterministic.

Suggested fail-fast wait

 import (
 	"fmt"
 	"runtime"
 	"testing"
+	"time"
@@
-	require.False(t, <-windowObservedCh)
+	select {
+	case observed := <-windowObservedCh:
+		require.False(t, observed)
+	case <-time.After(2 * time.Second):
+		t.Fatal("timed out waiting for occupy-release observer")
+	}

As per coding guidelines, "**/*_test.go: Use unit test files named *_test.go in Go; favor deterministic tests and use testify/require".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@maintainer/operator/operator_merge_test.go` around lines 283 - 295, The
receive from windowObservedCh can hang indefinitely if the goroutine never
sends; change the test to use a bounded receive (e.g., select with time.After or
testify's require.Eventually/require.Eventuallyf) when waiting for
windowObservedCh instead of a naked `<-windowObservedCh`, so the test fails
fast; update the assertions that currently use `require.False(t,
<-windowObservedCh)` and `require.Nil(t,
spanController.GetTaskByID(lastReplicaID))` to first attempt a timed receive
from `windowObservedCh` (or assert within a timeout) and then assert the
received value is false and that `spanController.GetTaskByID(lastReplicaID)` is
nil after `op.PostFinish()` completes, referencing the existing symbols
`observerReadyCh`, `windowObservedCh`, `occupyOperators`, `spanController`,
`lastReplicaID`, and `op.PostFinish()` when making the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@maintainer/operator/operator_merge_test.go`:
- Line 261: The test function name
TestMergeOperator_PostFinishReleasesOccupyAfterRemovingOldReplicas uses
underscores and should be renamed to camelCase; change it to
TestMergeOperatorPostFinishReleasesOccupyAfterRemovingOldReplicas (or another
camelCase variant) and update any references/calls in the file (or other tests)
to the new name so tests compile and run.
- Around line 283-295: The receive from windowObservedCh can hang indefinitely
if the goroutine never sends; change the test to use a bounded receive (e.g.,
select with time.After or testify's require.Eventually/require.Eventuallyf) when
waiting for windowObservedCh instead of a naked `<-windowObservedCh`, so the
test fails fast; update the assertions that currently use `require.False(t,
<-windowObservedCh)` and `require.Nil(t,
spanController.GetTaskByID(lastReplicaID))` to first attempt a timed receive
from `windowObservedCh` (or assert within a timeout) and then assert the
received value is false and that `spanController.GetTaskByID(lastReplicaID)` is
nil after `op.PostFinish()` completes, referencing the existing symbols
`observerReadyCh`, `windowObservedCh`, `occupyOperators`, `spanController`,
`lastReplicaID`, and `op.PostFinish()` when making the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5ca868b6-ff98-4546-98df-0caaa8176937

📥 Commits

Reviewing files that changed from the base of the PR and between e8de3a6 and 6bbfb24.

📒 Files selected for processing (2)

maintainer/operator/operator_merge.go
maintainer/operator/operator_merge_test.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@maintainer/operator/operator_merge_test.go`:
- Around line 284-290: The goroutine uses require.Eventually which can call
t.FailNow() and hang; change it to use assert.Eventually so failures don't exit
the goroutine and ensure we always send a value to windowObservedCh.
Specifically, in the anonymous goroutine that closes observerReadyCh and checks
occupyOperators[0].IsFinished(), replace require.Eventually(t, ...) with ok :=
assert.Eventually(t, func() bool { return occupyOperators[0].IsFinished() },
5*time.Second, 10*time.Millisecond) (or equivalent) and then send the resulting
boolean (e.g., ok && spanController.GetTaskByID(lastReplicaID) != nil or send
two separate booleans) to windowObservedCh so the channel send always happens
even when the assertion fails; keep the rest of the logic
(close(observerReadyCh), spanController.GetTaskByID(lastReplicaID),
lastReplicaID, windowObservedCh) intact.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 36343a7a-9bcd-4f1a-ab47-c5b697ee5d86

📥 Commits

Reviewing files that changed from the base of the PR and between 6bbfb24 and 670e77e.

📒 Files selected for processing (1)

maintainer/operator/operator_merge_test.go

maintainer/operator/operator_merge_test.go

ti-chi-bot · 2026-03-23T08:03:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 3AceShowHand, wk989898

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [3AceShowHand,wk989898]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-03-23T08:03:14Z

[LGTM Timeline notifier]

Timeline:

2026-03-23 06:16:07.007135653 +0000 UTC m=+162563.043205913: ☑️ agreed by wk989898.
2026-03-23 08:03:13.92567345 +0000 UTC m=+168989.961743711: ☑️ agreed by 3AceShowHand.

hongyunyan · 2026-03-24T00:16:18Z

/cherry-pick release-8.5

ti-chi-bot · 2026-03-24T00:17:03Z

@hongyunyan: new pull request created to branch release-8.5: #4585.

Details

In response to this:

/cherry-pick release-8.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot · 2026-03-24T02:13:57Z

In response to a cherrypick label: new pull request could not be created: failed to create pull request against pingcap/ticdc#release-8.5 from head ti-chi-bot:cherry-pick-4543-to-release-8.5: status code 422 not one of [201], body: {"message":"Validation Failed","errors":[{"resource":"PullRequest","code":"custom","message":"A pull request already exists for ti-chi-bot:cherry-pick-4543-to-release-8.5."}],"documentation_url":"https://docs.github.com/rest/pulls/pulls#create-a-pull-request","status":"422"}

close #4542

maintainer: delay finishing merge occupy operators

6bbfb24

ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed labels Mar 20, 2026

ti-chi-bot bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 20, 2026

gemini-code-assist bot reviewed Mar 20, 2026

View reviewed changes

maintainer/operator/operator_merge_test.go Show resolved Hide resolved

maintainer/operator/operator_merge_test.go Show resolved Hide resolved

ti-chi-bot bot removed the do-not-merge/needs-triage-completed label Mar 20, 2026

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

hongyunyan and others added 3 commits March 21, 2026 06:16

Apply suggestion from @gemini-code-assist[bot]

0bfc471

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Apply suggestion from @gemini-code-assist[bot]

a32db87

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

maintainer: format merge operator test imports

670e77e

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

maintainer/operator/operator_merge_test.go Show resolved Hide resolved

maintainer: stabilize merge operator post finish test

7e0a869

ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 20, 2026

wk989898 approved these changes Mar 23, 2026

View reviewed changes

ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Mar 23, 2026

3AceShowHand approved these changes Mar 23, 2026

View reviewed changes

ti-chi-bot bot added the lgtm label Mar 23, 2026

ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Mar 23, 2026

ti-chi-bot bot merged commit d2189d5 into master Mar 23, 2026
26 checks passed

ti-chi-bot bot deleted the 0320 branch March 23, 2026 09:07

ti-chi-bot mentioned this pull request Mar 24, 2026

maintainer: delay finishing merge occupy operators (#4543) #4585

Merged

ti-chi-bot bot added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Mar 24, 2026

ti-chi-bot bot pushed a commit that referenced this pull request Mar 24, 2026

maintainer: delay finishing merge occupy operators (#4543) (#4585)

55aca47

close #4542

tenfyzhong pushed a commit that referenced this pull request Mar 25, 2026

maintainer: delay finishing merge occupy operators (#4543)

09563c3

close #4542

coderabbitai bot mentioned this pull request Apr 8, 2026

fix(*): merge operator inconsistent after maintainer move #3769

Open

lidezhu pushed a commit that referenced this pull request Apr 10, 2026

maintainer: delay finishing merge occupy operators (#4543) (#4585)

d74f6e4

close #4542

Conversation

hongyunyan commented Mar 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot commented Mar 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ti-chi-bot bot commented Mar 23, 2026

Uh oh!

ti-chi-bot bot commented Mar 23, 2026

[LGTM Timeline notifier]

Uh oh!

Uh oh!

hongyunyan commented Mar 24, 2026

Uh oh!

ti-chi-bot commented Mar 24, 2026

Uh oh!

ti-chi-bot commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hongyunyan commented Mar 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 20, 2026 •

edited

Loading