Skip to content

maintainer: delay finishing merge occupy operators#4543

Merged
ti-chi-bot[bot] merged 5 commits intomasterfrom
0320
Mar 23, 2026
Merged

maintainer: delay finishing merge occupy operators#4543
ti-chi-bot[bot] merged 5 commits intomasterfrom
0320

Conversation

@hongyunyan
Copy link
Copy Markdown
Collaborator

@hongyunyan hongyunyan commented Mar 20, 2026

What problem does this PR solve?

Issue Number: close #4542

A merge can finish its occupy operators before the old replica sets are removed or restored. If a late Stopped / Removed status from an old dispatcher arrives in that window, maintainer fallback may observe GetOperator(dispatcherID) == nil while the old task still exists in spanController, mark that old replica absent again, and leave a ghost scheduler entry that stalls maintainer checkpoint progress.

What is changed and how it works?

This PR delays setOccupyOperatorsFinished() until the end of MergeDispatcherOperator.PostFinish().

That keeps old dispatcher IDs protected by their occupy operators until merge post-finish has fully removed or restored the old replica sets, so late terminal statuses from old dispatchers cannot be rescheduled by the generic maintainer fallback.

This PR also adds a focused unit test that exercises the post-finish ordering window and verifies occupy operators are not released before the old replicas disappear from spanController.

Check List

Tests

  • Unit test

Questions

Will it cause performance regression or break compatibility?

No. This only changes the in-memory ordering inside merge post-finish and keeps occupy operators alive slightly longer during the same post-finish call.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

Fix a merge ordering issue in maintainer that could reschedule removed replicas and stall changefeed checkpoint progress.

Summary by CodeRabbit

  • Bug Fixes

    • Adjusted the timing of releasing occupying dispatchers during merge completion so cleanup observes prior operator states, preventing premature state changes and ensuring proper cleanup order.
  • Tests

    • Added a large-scale test that verifies merge completion releases occupy dispatchers only after old replicas are removed and confirms all occupy operators finish.

@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed labels Mar 20, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 20, 2026

Warning

Rate limit exceeded

@hongyunyan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 16 minutes and 41 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 75342e18-390e-4f86-858a-025a08ee4585

📥 Commits

Reviewing files that changed from the base of the PR and between 670e77e and 7e0a869.

📒 Files selected for processing (1)
  • maintainer/operator/operator_merge_test.go
📝 Walkthrough

Walkthrough

Modified MergeDispatcherOperator.PostFinish() to defer setOccupyOperatorsFinished() until after old replica cleanup, ensuring occupy operators remain active during replica removal and restoration. Added test infrastructure and a regression test to verify the fix prevents late dispatcher statuses from resurrecting removed replicas.

Changes

Cohort / File(s) Summary
Core Control Flow Fix
maintainer/operator/operator_merge.go
Moved setOccupyOperatorsFinished(m.occupyOperators) from the start of PostFinish() to the end so occupy operators remain held while old replica sets are removed or restored.
Test Coverage & Helpers
maintainer/operator/operator_merge_test.go
Added setupLargeMergeTestEnvironment(t, spanCount) helper, updated imports, and added TestMergeOperator_PostFinishReleasesOccupyAfterRemovingOldReplicas which builds a large merge scenario (4096 spans) and asserts occupy operators are only released after old replicas are removed.

Sequence Diagram(s)

(omitted — change is a bugfix and does not require a sequence diagram per guidelines)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

lgtm, approved, size/L

Suggested reviewers

  • 3AceShowHand
  • lidezhu

Poem

🐰
Hop, defer, and hold the line,
Old replicas wait till cleanup's fine.
I guard the gates, then let them go,
Clean merge, no ghosts — my whiskers know.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: delaying the finishing of merge occupy operators.
Description check ✅ Passed The PR description is well-structured, includes all required sections (issue number, what changed, checks, questions, release note), and comprehensively explains the problem, solution, and testing approach.
Linked Issues check ✅ Passed The PR implementation directly addresses issue #4542 by deferring setOccupyOperatorsFinished() to the end of PostFinish(), protecting old dispatcher IDs until old replicas are removed, and includes a focused unit test validating the fix.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the merge post-finish ordering issue: code change in operator_merge.go and comprehensive test additions in operator_merge_test.go are both in-scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 0320

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 20, 2026
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a merge ordering issue within the maintainer component that could lead to incorrect rescheduling of removed replicas and stalled changefeed checkpoint progress. By delaying the release of occupy operators until the very end of the merge post-finish process, the change prevents a race condition where late terminal statuses from old dispatchers could be mishandled, thereby eliminating ghost scheduler entries and improving the robustness of merge operations.

Highlights

  • Delayed Occupy Operator Release: Modified the MergeDispatcherOperator.PostFinish() method to defer the setOccupyOperatorsFinished() call, ensuring that occupy operators remain active until all old replica sets have been fully removed or restored.
  • New Unit Test for Ordering: Added a new unit test, TestMergeOperator_PostFinishReleasesOccupyAfterRemovingOldReplicas, to specifically verify that occupy operators are not released before old replicas disappear from the spanController during the merge post-finish phase.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a race condition in the merge operator's post-finish logic by delaying when occupy operators are marked as finished. The change correctly uses a defer statement to ensure old replica sets are fully removed before the operators are released, preventing a potential stall in changefeed checkpoint progress. A new, focused unit test is added to verify this fix, using concurrency to simulate the race condition. My feedback focuses on improving the robustness of this new test by replacing a busy-wait loop with a more standard polling mechanism.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
maintainer/operator/operator_merge_test.go (2)

261-261: Rename test function to camelCase to match Go naming guideline.

The new test name uses underscores; switch to camelCase for consistency with repository rules.

Suggested rename
-func TestMergeOperator_PostFinishReleasesOccupyAfterRemovingOldReplicas(t *testing.T) {
+func TestMergeOperatorPostFinishReleasesOccupyAfterRemovingOldReplicas(t *testing.T) {

As per coding guidelines, "**/*.go: Use camelCase for function names and do not include underscores (e.g., getPartitionNum, not get_partition_num) in Go".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maintainer/operator/operator_merge_test.go` at line 261, The test function
name TestMergeOperator_PostFinishReleasesOccupyAfterRemovingOldReplicas uses
underscores and should be renamed to camelCase; change it to
TestMergeOperatorPostFinishReleasesOccupyAfterRemovingOldReplicas (or another
camelCase variant) and update any references/calls in the file (or other tests)
to the new name so tests compile and run.

283-295: Use a bounded receive to avoid long hangs when observer signaling fails.

<-windowObservedCh can block until global test timeout if the goroutine never sends. A timeout makes failures fail fast and keeps the test more deterministic.

Suggested fail-fast wait
 import (
 	"fmt"
 	"runtime"
 	"testing"
+	"time"
@@
-	require.False(t, <-windowObservedCh)
+	select {
+	case observed := <-windowObservedCh:
+		require.False(t, observed)
+	case <-time.After(2 * time.Second):
+		t.Fatal("timed out waiting for occupy-release observer")
+	}

As per coding guidelines, "**/*_test.go: Use unit test files named *_test.go in Go; favor deterministic tests and use testify/require".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maintainer/operator/operator_merge_test.go` around lines 283 - 295, The
receive from windowObservedCh can hang indefinitely if the goroutine never
sends; change the test to use a bounded receive (e.g., select with time.After or
testify's require.Eventually/require.Eventuallyf) when waiting for
windowObservedCh instead of a naked `<-windowObservedCh`, so the test fails
fast; update the assertions that currently use `require.False(t,
<-windowObservedCh)` and `require.Nil(t,
spanController.GetTaskByID(lastReplicaID))` to first attempt a timed receive
from `windowObservedCh` (or assert within a timeout) and then assert the
received value is false and that `spanController.GetTaskByID(lastReplicaID)` is
nil after `op.PostFinish()` completes, referencing the existing symbols
`observerReadyCh`, `windowObservedCh`, `occupyOperators`, `spanController`,
`lastReplicaID`, and `op.PostFinish()` when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@maintainer/operator/operator_merge_test.go`:
- Line 261: The test function name
TestMergeOperator_PostFinishReleasesOccupyAfterRemovingOldReplicas uses
underscores and should be renamed to camelCase; change it to
TestMergeOperatorPostFinishReleasesOccupyAfterRemovingOldReplicas (or another
camelCase variant) and update any references/calls in the file (or other tests)
to the new name so tests compile and run.
- Around line 283-295: The receive from windowObservedCh can hang indefinitely
if the goroutine never sends; change the test to use a bounded receive (e.g.,
select with time.After or testify's require.Eventually/require.Eventuallyf) when
waiting for windowObservedCh instead of a naked `<-windowObservedCh`, so the
test fails fast; update the assertions that currently use `require.False(t,
<-windowObservedCh)` and `require.Nil(t,
spanController.GetTaskByID(lastReplicaID))` to first attempt a timed receive
from `windowObservedCh` (or assert within a timeout) and then assert the
received value is false and that `spanController.GetTaskByID(lastReplicaID)` is
nil after `op.PostFinish()` completes, referencing the existing symbols
`observerReadyCh`, `windowObservedCh`, `occupyOperators`, `spanController`,
`lastReplicaID`, and `op.PostFinish()` when making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5ca868b6-ff98-4546-98df-0caaa8176937

📥 Commits

Reviewing files that changed from the base of the PR and between e8de3a6 and 6bbfb24.

📒 Files selected for processing (2)
  • maintainer/operator/operator_merge.go
  • maintainer/operator/operator_merge_test.go

hongyunyan and others added 3 commits March 21, 2026 06:16
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@maintainer/operator/operator_merge_test.go`:
- Around line 284-290: The goroutine uses require.Eventually which can call
t.FailNow() and hang; change it to use assert.Eventually so failures don't exit
the goroutine and ensure we always send a value to windowObservedCh.
Specifically, in the anonymous goroutine that closes observerReadyCh and checks
occupyOperators[0].IsFinished(), replace require.Eventually(t, ...) with ok :=
assert.Eventually(t, func() bool { return occupyOperators[0].IsFinished() },
5*time.Second, 10*time.Millisecond) (or equivalent) and then send the resulting
boolean (e.g., ok && spanController.GetTaskByID(lastReplicaID) != nil or send
two separate booleans) to windowObservedCh so the channel send always happens
even when the assertion fails; keep the rest of the logic
(close(observerReadyCh), spanController.GetTaskByID(lastReplicaID),
lastReplicaID, windowObservedCh) intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 36343a7a-9bcd-4f1a-ab47-c5b697ee5d86

📥 Commits

Reviewing files that changed from the base of the PR and between 6bbfb24 and 670e77e.

📒 Files selected for processing (1)
  • maintainer/operator/operator_merge_test.go

@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 20, 2026
@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Mar 23, 2026
@ti-chi-bot ti-chi-bot bot added the lgtm label Mar 23, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Mar 23, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 3AceShowHand, wk989898

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [3AceShowHand,wk989898]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Mar 23, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Mar 23, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-03-23 06:16:07.007135653 +0000 UTC m=+162563.043205913: ☑️ agreed by wk989898.
  • 2026-03-23 08:03:13.92567345 +0000 UTC m=+168989.961743711: ☑️ agreed by 3AceShowHand.

@ti-chi-bot ti-chi-bot bot merged commit d2189d5 into master Mar 23, 2026
26 checks passed
@ti-chi-bot ti-chi-bot bot deleted the 0320 branch March 23, 2026 09:07
@hongyunyan
Copy link
Copy Markdown
Collaborator Author

/cherry-pick release-8.5

@ti-chi-bot
Copy link
Copy Markdown
Member

@hongyunyan: new pull request created to branch release-8.5: #4585.

Details

In response to this:

/cherry-pick release-8.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot ti-chi-bot bot added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Mar 24, 2026
@ti-chi-bot
Copy link
Copy Markdown
Member

In response to a cherrypick label: new pull request could not be created: failed to create pull request against pingcap/ticdc#release-8.5 from head ti-chi-bot:cherry-pick-4543-to-release-8.5: status code 422 not one of [201], body: {"message":"Validation Failed","errors":[{"resource":"PullRequest","code":"custom","message":"A pull request already exists for ti-chi-bot:cherry-pick-4543-to-release-8.5."}],"documentation_url":"https://docs.github.com/rest/pulls/pulls#create-a-pull-request","status":"422"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

maintainer: merge post-finish can resurrect removed replicas

4 participants