maintainer: delay finishing merge occupy operators#4543
maintainer: delay finishing merge occupy operators#4543ti-chi-bot[bot] merged 5 commits intomasterfrom
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughModified Changes
Sequence Diagram(s)(omitted — change is a bugfix and does not require a sequence diagram per guidelines) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a merge ordering issue within the maintainer component that could lead to incorrect rescheduling of removed replicas and stalled changefeed checkpoint progress. By delaying the release of occupy operators until the very end of the merge post-finish process, the change prevents a race condition where late terminal statuses from old dispatchers could be mishandled, thereby eliminating ghost scheduler entries and improving the robustness of merge operations. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a race condition in the merge operator's post-finish logic by delaying when occupy operators are marked as finished. The change correctly uses a defer statement to ensure old replica sets are fully removed before the operators are released, preventing a potential stall in changefeed checkpoint progress. A new, focused unit test is added to verify this fix, using concurrency to simulate the race condition. My feedback focuses on improving the robustness of this new test by replacing a busy-wait loop with a more standard polling mechanism.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
maintainer/operator/operator_merge_test.go (2)
261-261: Rename test function to camelCase to match Go naming guideline.The new test name uses underscores; switch to camelCase for consistency with repository rules.
Suggested rename
-func TestMergeOperator_PostFinishReleasesOccupyAfterRemovingOldReplicas(t *testing.T) { +func TestMergeOperatorPostFinishReleasesOccupyAfterRemovingOldReplicas(t *testing.T) {As per coding guidelines, "
**/*.go: Use camelCase for function names and do not include underscores (e.g.,getPartitionNum, notget_partition_num) in Go".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@maintainer/operator/operator_merge_test.go` at line 261, The test function name TestMergeOperator_PostFinishReleasesOccupyAfterRemovingOldReplicas uses underscores and should be renamed to camelCase; change it to TestMergeOperatorPostFinishReleasesOccupyAfterRemovingOldReplicas (or another camelCase variant) and update any references/calls in the file (or other tests) to the new name so tests compile and run.
283-295: Use a bounded receive to avoid long hangs when observer signaling fails.
<-windowObservedChcan block until global test timeout if the goroutine never sends. A timeout makes failures fail fast and keeps the test more deterministic.Suggested fail-fast wait
import ( "fmt" "runtime" "testing" + "time" @@ - require.False(t, <-windowObservedCh) + select { + case observed := <-windowObservedCh: + require.False(t, observed) + case <-time.After(2 * time.Second): + t.Fatal("timed out waiting for occupy-release observer") + }As per coding guidelines, "
**/*_test.go: Use unit test files named*_test.goin Go; favor deterministic tests and usetestify/require".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@maintainer/operator/operator_merge_test.go` around lines 283 - 295, The receive from windowObservedCh can hang indefinitely if the goroutine never sends; change the test to use a bounded receive (e.g., select with time.After or testify's require.Eventually/require.Eventuallyf) when waiting for windowObservedCh instead of a naked `<-windowObservedCh`, so the test fails fast; update the assertions that currently use `require.False(t, <-windowObservedCh)` and `require.Nil(t, spanController.GetTaskByID(lastReplicaID))` to first attempt a timed receive from `windowObservedCh` (or assert within a timeout) and then assert the received value is false and that `spanController.GetTaskByID(lastReplicaID)` is nil after `op.PostFinish()` completes, referencing the existing symbols `observerReadyCh`, `windowObservedCh`, `occupyOperators`, `spanController`, `lastReplicaID`, and `op.PostFinish()` when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@maintainer/operator/operator_merge_test.go`:
- Line 261: The test function name
TestMergeOperator_PostFinishReleasesOccupyAfterRemovingOldReplicas uses
underscores and should be renamed to camelCase; change it to
TestMergeOperatorPostFinishReleasesOccupyAfterRemovingOldReplicas (or another
camelCase variant) and update any references/calls in the file (or other tests)
to the new name so tests compile and run.
- Around line 283-295: The receive from windowObservedCh can hang indefinitely
if the goroutine never sends; change the test to use a bounded receive (e.g.,
select with time.After or testify's require.Eventually/require.Eventuallyf) when
waiting for windowObservedCh instead of a naked `<-windowObservedCh`, so the
test fails fast; update the assertions that currently use `require.False(t,
<-windowObservedCh)` and `require.Nil(t,
spanController.GetTaskByID(lastReplicaID))` to first attempt a timed receive
from `windowObservedCh` (or assert within a timeout) and then assert the
received value is false and that `spanController.GetTaskByID(lastReplicaID)` is
nil after `op.PostFinish()` completes, referencing the existing symbols
`observerReadyCh`, `windowObservedCh`, `occupyOperators`, `spanController`,
`lastReplicaID`, and `op.PostFinish()` when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5ca868b6-ff98-4546-98df-0caaa8176937
📒 Files selected for processing (2)
maintainer/operator/operator_merge.gomaintainer/operator/operator_merge_test.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@maintainer/operator/operator_merge_test.go`:
- Around line 284-290: The goroutine uses require.Eventually which can call
t.FailNow() and hang; change it to use assert.Eventually so failures don't exit
the goroutine and ensure we always send a value to windowObservedCh.
Specifically, in the anonymous goroutine that closes observerReadyCh and checks
occupyOperators[0].IsFinished(), replace require.Eventually(t, ...) with ok :=
assert.Eventually(t, func() bool { return occupyOperators[0].IsFinished() },
5*time.Second, 10*time.Millisecond) (or equivalent) and then send the resulting
boolean (e.g., ok && spanController.GetTaskByID(lastReplicaID) != nil or send
two separate booleans) to windowObservedCh so the channel send always happens
even when the assertion fails; keep the rest of the logic
(close(observerReadyCh), spanController.GetTaskByID(lastReplicaID),
lastReplicaID, windowObservedCh) intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 36343a7a-9bcd-4f1a-ab47-c5b697ee5d86
📒 Files selected for processing (1)
maintainer/operator/operator_merge_test.go
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: 3AceShowHand, wk989898 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
|
/cherry-pick release-8.5 |
|
@hongyunyan: new pull request created to branch DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
|
In response to a cherrypick label: new pull request could not be created: failed to create pull request against pingcap/ticdc#release-8.5 from head ti-chi-bot:cherry-pick-4543-to-release-8.5: status code 422 not one of [201], body: {"message":"Validation Failed","errors":[{"resource":"PullRequest","code":"custom","message":"A pull request already exists for ti-chi-bot:cherry-pick-4543-to-release-8.5."}],"documentation_url":"https://docs.github.com/rest/pulls/pulls#create-a-pull-request","status":"422"} |
What problem does this PR solve?
Issue Number: close #4542
A merge can finish its occupy operators before the old replica sets are removed or restored. If a late
Stopped/Removedstatus from an old dispatcher arrives in that window, maintainer fallback may observeGetOperator(dispatcherID) == nilwhile the old task still exists inspanController, mark that old replica absent again, and leave a ghost scheduler entry that stalls maintainer checkpoint progress.What is changed and how it works?
This PR delays
setOccupyOperatorsFinished()until the end ofMergeDispatcherOperator.PostFinish().That keeps old dispatcher IDs protected by their occupy operators until merge post-finish has fully removed or restored the old replica sets, so late terminal statuses from old dispatchers cannot be rescheduled by the generic maintainer fallback.
This PR also adds a focused unit test that exercises the post-finish ordering window and verifies occupy operators are not released before the old replicas disappear from
spanController.Check List
Tests
Questions
Will it cause performance regression or break compatibility?
No. This only changes the in-memory ordering inside merge post-finish and keeps occupy operators alive slightly longer during the same post-finish call.
Do you need to update user documentation, design documentation or monitoring documentation?
No.
Release note
Summary by CodeRabbit
Bug Fixes
Tests