Skip to content

coordinator,maintainer: Fixed bootstrap might fail to succeed in frequent maintainer scheduling#4114

Merged
ti-chi-bot[bot] merged 5 commits intopingcap:masterfrom
hongyunyan:0130-maintainer
Feb 2, 2026
Merged

coordinator,maintainer: Fixed bootstrap might fail to succeed in frequent maintainer scheduling#4114
ti-chi-bot[bot] merged 5 commits intopingcap:masterfrom
hongyunyan:0130-maintainer

Conversation

@hongyunyan
Copy link
Copy Markdown
Collaborator

@hongyunyan hongyunyan commented Feb 2, 2026

What problem does this PR solve?

Issue Number: close #4115

What is changed and how it works?

This pull request significantly improves the stability and correctness of changefeed lifecycle management, particularly during node movements and restarts. By introducing a BootstrapDone flag and refining the handling of remove requests, it ensures that changefeeds are fully initialized before operations are considered complete and prevents erroneous premature removals, thereby enhancing the overall reliability of the system.

Highlights

  • Enhanced Changefeed Move Logic: The MoveMaintainerOperator now explicitly waits for a BootstrapDone flag from the destination node's MaintainerStatus before considering a changefeed move operation complete. This ensures the changefeed is fully initialized on the new node.
  • Robust Maintainer Removal: The Maintainer component has been updated to ignore non-cascade remove requests if they arrive before the maintainer has completed its bootstrap process. This prevents premature termination and potential issues where bootstrap responses might be dropped, leading to a stuck state.
  • New Atomic Flag for Logging: A new ignoredNonCascadeRemoveBeforeBootstrap atomic boolean has been introduced in the Maintainer to prevent log spam when multiple non-cascade remove requests are received before bootstrap.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

  • Bug Fixes

    • Changefeed transitions now wait for bootstrap completion before finishing moves.
    • Prevented improper maintainer removal during early bootstrap, avoiding spurious state changes.
  • Tests

    • Added tests verifying move completion requires destination bootstrap to be true and covers related transition behavior.

✏️ Tip: You can customize this high-level summary in your review settings.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 2, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 2, 2026

📝 Walkthrough

Walkthrough

Adds bootstrap-completion gating to maintainer move logic and a safety guard for premature maintainer removal; updates tests to cover the new bootstrap requirement and adds mocks in coordinator tests to mark changefeeds as BootstrapDone.

Changes

Cohort / File(s) Summary
Operator: move logic
coordinator/operator/operator_move.go
Adds nil-status guard and requires status.BootstrapDone for the transition finish condition in MoveMaintainerOperator.Check, delaying finish until destination bootstrap is complete.
Tests: operator & coordinator
coordinator/operator/operator_move_test.go, coordinator/coordinator_test.go
Adds TestMoveMaintainerOperator_CheckRequiresDestBootstrapDone; updates coordinator test to register a test PD clock and mock SchemaStore, and mark changefeeds as BootstrapDone.
Maintainer: removal safety
maintainer/maintainer.go
Adds guard in onRemoveMaintainer to ignore non-cascade removals while the maintainer is not yet initialized, preventing premature removal during bootstrap.

Sequence Diagram(s)

sequenceDiagram
    participant Coordinator as Coordinator
    participant Operator as MoveMaintainerOperator
    participant Origin as OriginNode
    participant Dest as DestNode
    rect rgba(200,200,255,0.5)
    Coordinator->>Operator: schedule move (origin -> dest)
    Operator->>Origin: observe status
    Origin-->>Operator: Stopped
    Operator->>Dest: observe status
    Dest-->>Operator: Working, BootstrapDone=false
    note right of Operator: do not finish yet
    Dest-->>Operator: Working, BootstrapDone=true
    Operator->>Coordinator: mark finished / apply schedule
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A little hop from node to node I keep,
Waiting for bootstrap before I leap,
Guards in place so nothing goes amiss,
Tests ensure the timing's right—pure bliss,
Puff of carrots, onward we sweep! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: fixing bootstrap failure during frequent maintainer scheduling by introducing BootstrapDone flag and improved removal handling.
Description check ✅ Passed The PR description provides context on the problem, highlights key changes, includes the linked issue number, and covers test checklist items despite incomplete sections.
Linked Issues check ✅ Passed The code changes implement the core fixes for issue #4115: BootstrapDone flag in MoveMaintainerOperator ensures changefeeds fully initialize on destination nodes [operator_move.go], and Maintainer ignores premature removals before bootstrap completes [maintainer.go], preventing stuck maintainers and checkpoint lag.
Out of Scope Changes check ✅ Passed All code changes directly address the bootstrap completion and maintainer removal issues outlined in issue #4115; test additions validate the new behavior and are within scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @hongyunyan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the stability and correctness of changefeed lifecycle management, particularly during node movements and restarts. By introducing a BootstrapDone flag and refining the handling of remove requests, it ensures that changefeeds are fully initialized before operations are considered complete and prevents erroneous premature removals, thereby enhancing the overall reliability of the system.

Highlights

  • Enhanced Changefeed Move Logic: The MoveMaintainerOperator now explicitly waits for a BootstrapDone flag from the destination node's MaintainerStatus before considering a changefeed move operation complete. This ensures the changefeed is fully initialized on the new node.
  • Robust Maintainer Removal: The Maintainer component has been updated to ignore non-cascade remove requests if they arrive before the maintainer has completed its bootstrap process. This prevents premature termination and potential issues where bootstrap responses might be dropped, leading to a stuck state.
  • New Atomic Flag for Logging: A new ignoredNonCascadeRemoveBeforeBootstrap atomic boolean has been introduced in the Maintainer to prevent log spam when multiple non-cascade remove requests are received before bootstrap.
  • Comprehensive Unit Tests: New unit tests have been added to validate the updated MoveMaintainerOperator's dependency on BootstrapDone and the Maintainer's behavior of ignoring non-cascade remove requests before bootstrap, ensuring correct and robust operation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces important correctness fixes for changefeed move and restart operations. The core issue addressed is a race condition where a changefeed maintainer could get stuck in a bootstrapping state if a move/remove operation was initiated concurrently.

The changes include:

  1. The MoveMaintainerOperator now correctly waits for the destination maintainer to be fully bootstrapped by checking a new BootstrapDone flag in its status. This ensures a move operation is only considered complete when the changefeed is fully operational on the new node.
  2. The Maintainer is now more robust against premature remove requests. It will ignore non-cascade remove requests that arrive before it has finished its own bootstrap process. This prevents the maintainer from entering a removing state that would block the bootstrap from ever completing, thus avoiding a potential deadlock.

The changes are well-implemented and accompanied by thorough unit tests that cover the new logic and edge cases. The overall code quality is high. This is a solid improvement to the system's robustness.

@ti-chi-bot ti-chi-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 2, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@maintainer/maintainer_remove_before_bootstrap_test.go`:
- Line 1: Add the standard copyright header at the very top of this file before
the "package maintainer" declaration; ensure it matches the project's canonical
header text and formatting (including year and owner) used across the repo so
the CI header check passes, and do not alter the existing "package maintainer"
line or any other code in maintainer_remove_before_bootstrap_test.go.

@hongyunyan hongyunyan changed the title wip coordinator,maintainer: Fixed that bootstrap might fail to succeed in frequent maintainer scheduling Feb 2, 2026
@hongyunyan hongyunyan changed the title coordinator,maintainer: Fixed that bootstrap might fail to succeed in frequent maintainer scheduling coordinator,maintainer: Fixed bootstrap might fail to succeed in frequent maintainer scheduling Feb 2, 2026
@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Feb 2, 2026
@ti-chi-bot ti-chi-bot bot added the lgtm label Feb 2, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Feb 2, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: asddongmen, wk989898

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [asddongmen,wk989898]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Feb 2, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Feb 2, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-02-02 03:32:25.874946002 +0000 UTC m=+65016.976344712: ☑️ agreed by asddongmen.
  • 2026-02-02 05:32:51.277825467 +0000 UTC m=+72242.379224186: ☑️ agreed by wk989898.

@ti-chi-bot ti-chi-bot bot merged commit e6ede44 into pingcap:master Feb 2, 2026
27 checks passed
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Feb 2, 2026

@hongyunyan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cdc-mysql-integration-light-next-gen 0f71543 link unknown /test pull-cdc-mysql-integration-light-next-gen
pull-cdc-pulsar-integration-heavy-next-gen 0f71543 link unknown /test pull-cdc-pulsar-integration-heavy-next-gen

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

a-cong pushed a commit to a-cong/ticdc that referenced this pull request Feb 2, 2026
lidezhu pushed a commit that referenced this pull request Feb 27, 2026
tenfyzhong pushed a commit that referenced this pull request Mar 18, 2026
…uent maintainer scheduling (#4114)

close #4115

(cherry picked from commit e6ede44)
Signed-off-by: tenfyzhong <tenfy@tenfy.cn>
tenfyzhong pushed a commit that referenced this pull request Mar 18, 2026
tenfyzhong pushed a commit that referenced this pull request Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Changefeed checkpoint lag keeps growing after rolling restart due to maintainer bootstrap not completing

3 participants