Skip to content

Conversation

@hongyunyan
Copy link
Collaborator

@hongyunyan hongyunyan commented Dec 25, 2025

What problem does this PR solve?

Issue Number: close #3799

What is changed and how it works?

This pull request addresses a critical race condition in the DDL event processing pipeline within TiCDC. It introduces a robust buffering and deferral mechanism to ensure that global block events, such as syncpoints or database-wide DDLs, are not acted upon by the maintainer until all preceding DDLs that involve adding or dropping tables have been fully scheduled and acknowledged. This enhancement prevents the maintainer from operating on an outdated or incomplete view of the system's schema, thereby improving data consistency and the reliability of crash recovery.

Highlights

  • DDL Event Buffering: Introduced a mechanism in BasicDispatcher to temporarily hold database-level or all-table-level block events, ensuring they are processed only after all preceding DDL tasks that modify tables have been acknowledged.
  • Race Condition Prevention: Implemented pendingACKCount and holdingBlockEvent to prevent a race condition where the maintainer might create an incomplete spanController task snapshot if global block events are processed before table-modifying DDLs are fully scheduled.
  • Maintainer Logic Adjustment: Modified the Barrier in the maintainer to defer acknowledging DB/All block events from the DDL dispatcher if there are still pending schedule-required events, ensuring an accurate range checker is built.
  • New Test Coverage: Added new unit and integration tests (TestHoldBlockEventUntilNoResendTasks, TestDeferAllDBBlockEventFromDDLDispatcherWhilePendingSchedule) to validate the new deferral and buffering logic for DDL block events.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Dec 25, 2025
@hongyunyan
Copy link
Collaborator Author

/test help

@ti-chi-bot
Copy link

ti-chi-bot bot commented Dec 25, 2025

@hongyunyan: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-build
/test pull-cdc-kafka-integration-heavy
/test pull-cdc-kafka-integration-light
/test pull-cdc-mysql-integration-heavy
/test pull-cdc-mysql-integration-light
/test pull-cdc-storage-integration-heavy
/test pull-cdc-storage-integration-light
/test pull-check
/test pull-error-log-review
/test pull-unit-test

The following commands are available to trigger optional jobs:

/test pull-build-next-gen
/test pull-cdc-kafka-integration-heavy-next-gen
/test pull-cdc-kafka-integration-light-next-gen
/test pull-cdc-mysql-integration-heavy-next-gen
/test pull-cdc-mysql-integration-light-next-gen
/test pull-cdc-pulsar-integration-light
/test pull-cdc-pulsar-integration-light-next-gen
/test pull-cdc-storage-integration-heavy-next-gen
/test pull-cdc-storage-integration-light-next-gen
/test pull-unit-test-next-gen

Use /test all to run the following jobs that were automatically triggered:

pull-build
pull-build-next-gen
pull-check
pull-error-log-review
pull-unit-test
pull-unit-test-next-gen
Details

In response to this:

/test help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 25, 2025
@gemini-code-assist
Copy link

Summary of Changes

Hello @hongyunyan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a potential race condition in the DDL event dispatching mechanism by introducing a buffering system within the BasicDispatcher. It temporarily holds database-level or all-table-level block events, ensuring they are processed only after all preceding DDL tasks, particularly those related to schema changes, have been fully acknowledged. This enhancement prevents scenarios where an incomplete snapshot of tasks could lead to incorrect state management or recovery issues, thereby improving the robustness and reliability of DDL processing.

Highlights

  • DDL Event Handling Refinement: Introduced a mechanism within the BasicDispatcher to defer critical DDL block events (specifically those with DB/All scope) when the dispatcher is configured as a table trigger event dispatcher.
  • Race Condition Mitigation: Implemented a buffering system to hold non-normal block events when there are unacknowledged resend tasks. This prevents a race condition where the maintainer might create an incomplete spanController task snapshot, ensuring correct startTs selection during downstream crash recovery.
  • New Test Case: Added a comprehensive unit test, TestHoldBlockEventUntilNoResendTasks, to validate the new deferral logic for block events, ensuring that DB/All events are correctly held and processed only after preceding DDL acknowledgments.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@hongyunyan
Copy link
Collaborator Author

/test pull-cdc-mysql-integration-heavy

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces logic to hold certain DDL events to prevent a race condition, along with a new unit test to verify this behavior. The core logic change seems correct in its intent. However, I've identified a critical race condition in the implementation that could lead to events being stuck indefinitely. I've provided suggestions to fix this race. Additionally, there are some temporary changes in an integration test script that should be reverted before merging.

Comment on lines 774 to 781
if d.IsTableTriggerEventDispatcher() &&
d.resendTaskMap.Len() > 0 &&
blockedTables != nil &&
blockedTables.InfluenceType != commonEvent.InfluenceTypeNormal {
d.blockEventStatus.setBlockEvent(event, heartbeatpb.BlockStage_WAITING)
d.HoldBlockEvent(event)
return
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Following my suggestion to change HoldBlockEvent to fix a race condition, this part needs to be updated to use the new return value.

Suggested change
if d.IsTableTriggerEventDispatcher() &&
d.resendTaskMap.Len() > 0 &&
blockedTables != nil &&
blockedTables.InfluenceType != commonEvent.InfluenceTypeNormal {
d.blockEventStatus.setBlockEvent(event, heartbeatpb.BlockStage_WAITING)
d.HoldBlockEvent(event)
return
}
if d.IsTableTriggerEventDispatcher() &&
blockedTables != nil &&
blockedTables.InfluenceType != commonEvent.InfluenceTypeNormal {
if d.HoldBlockEvent(event) {
d.blockEventStatus.setBlockEvent(event, heartbeatpb.BlockStage_WAITING)
return
}
}

Comment on lines 833 to 844
func (d *BasicDispatcher) HoldBlockEvent(event commonEvent.BlockEvent) {
d.holdingBlockEventMu.Lock()
defer d.holdingBlockEventMu.Unlock()
// The event stream is blocked by this block event, so at most one such event can be pending here.
if d.holdingBlockEvent != nil {
d.HandleError(errors.ErrDispatcherFailed.GenWithStackByArgs(
"hold non-normal block event failed: holdingBlockEvent is already occupied",
))
return
}
d.holdingBlockEvent = event
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This function is part of a sequence that has a critical race condition. The check d.resendTaskMap.Len() > 0 in DealWithBlockEvent and this function call are not atomic. An ACK could be processed between the check and the hold, leading to a stuck event.

To fix this, the check should be moved inside this function to make the check-and-hold atomic. This function should also return a boolean to indicate whether the event was held. I will also add a comment in DealWithBlockEvent to adapt to this change.

func (d *BasicDispatcher) HoldBlockEvent(event commonEvent.BlockEvent) bool {
	d.holdingBlockEventMu.Lock()
	defer d.holdingBlockEventMu.Unlock()

	if d.resendTaskMap.Len() == 0 {
		return false
	}

	// The event stream is blocked by this block event, so at most one such event can be pending here.
	if d.holdingBlockEvent != nil {
		d.HandleError(errors.ErrDispatcherFailed.GenWithStackByArgs(
			"hold non-normal block event failed: holdingBlockEvent is already occupied",
		))
	} else {
		d.holdingBlockEvent = event
	}
	return true
}

Comment on lines 33 to 67
'syncpoint syncpoint syncpoint syncpoint'
# G01
'ddl_for_split_tables_with_random_move_table'
'syncpoint syncpoint syncpoint syncpoint'
# G02
'ddl_for_split_tables_with_failover'
'syncpoint syncpoint syncpoint syncpoint'
# G03
'cdc move_table checkpoint_race_ddl_crash'
'syncpoint syncpoint syncpoint syncpoint'
# G04
'complex_transaction syncpoint syncpoint_check_ts random_drop_message'
'syncpoint syncpoint syncpoint syncpoint'
# G05
'ddl_for_split_tables_with_merge_and_split'
'syncpoint syncpoint syncpoint syncpoint'
# G06
'ddl_for_split_tables_with_random_merge_and_split'
# G07
'consistent_compatibility consistent_partition_table consistent_replicate_gbk consistent_replicate_ddl consistent_replicate_basic'
# G08
'default_value http_proxies bank ddl_for_split_tables_random_schedule'
# G09
'availability resolve_lock merge_table drop_many_tables ddl_for_split_tables'
# G10
'consistent_replicate_nfs consistent_replicate_storage_file consistent_replicate_storage_file_large_value consistent_replicate_storage_s3'
# G11
'multi_changefeeds ddl_wait ddl_reentrant force_replicate_table multi_source'
'syncpoint syncpoint syncpoint syncpoint'
# G01
'syncpoint syncpoint syncpoint syncpoint'
# G02
'syncpoint syncpoint syncpoint syncpoint'
# G03
'syncpoint syncpoint syncpoint syncpoint'
# G04
'syncpoint syncpoint syncpoint syncpoint'
# G05
'syncpoint syncpoint syncpoint syncpoint'
# G12
'ddl_with_random_move_table'
# G13
'tidb_mysql_test fail_over region_merge fail_over_ddl_mix_random_delay'
# G14
'fail_over_ddl_mix'
# G15
'fail_over_ddl_mix_with_syncpoint'
'syncpoint syncpoint syncpoint syncpoint'
# G01
'syncpoint syncpoint syncpoint syncpoint'
# G02
'syncpoint syncpoint syncpoint syncpoint'
# G03
'syncpoint syncpoint syncpoint syncpoint'
# G04
'syncpoint syncpoint syncpoint syncpoint'
# G05
'syncpoint syncpoint syncpoint syncpoint'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

These changes to the mysql_groups array appear to be temporary for debugging purposes, as they replace all integration test groups with syncpoint and contain duplicated group comments (e.g., # G01, # G02). Please revert these changes before merging to ensure all integration tests are run in CI.

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 26, 2025
@hongyunyan
Copy link
Collaborator Author

/test pull-cdc-mysql-integration-light

@hongyunyan
Copy link
Collaborator Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to prevent race conditions during DDL processing, specifically between DDLs that alter table schemas (like adding or dropping tables) and subsequent database-wide or global events (like DROP DATABASE or syncpoints). The solution involves two main parts: the DDL dispatcher now holds these global events if it has un-acknowledged DDLs, and the maintainer will discard such events if it has pending scheduling tasks. This ensures that range checkers for global events are built with a complete and up-to-date view of all tables. The changes are well-reasoned and accompanied by thorough unit tests. My review has identified a minor inaccuracy in a comment and a significant but likely temporary change in the CI test configuration that should be reverted before merging.

Comment on lines 31 to 68
mysql_groups=(
# G00
'api_v2 generate_column many_pk_or_uk multi_source large_txn'
'syncpoint syncpoint syncpoint syncpoint'
# G01
'ddl_for_split_tables_with_random_move_table'
'syncpoint syncpoint syncpoint syncpoint'
# G02
'ddl_for_split_tables_with_failover'
'syncpoint syncpoint syncpoint syncpoint'
# G03
'cdc move_table checkpoint_race_ddl_crash'
'syncpoint syncpoint syncpoint syncpoint'
# G04
'complex_transaction syncpoint syncpoint_check_ts random_drop_message'
'syncpoint syncpoint syncpoint syncpoint'
# G05
'ddl_for_split_tables_with_merge_and_split'
'syncpoint syncpoint syncpoint syncpoint'
# G06
'ddl_for_split_tables_with_random_merge_and_split'
# G07
'consistent_compatibility consistent_partition_table consistent_replicate_gbk consistent_replicate_ddl consistent_replicate_basic'
# G08
'default_value http_proxies bank ddl_for_split_tables_random_schedule'
# G09
'availability resolve_lock merge_table drop_many_tables ddl_for_split_tables'
# G10
'consistent_replicate_nfs consistent_replicate_storage_file consistent_replicate_storage_file_large_value consistent_replicate_storage_s3'
# G11
'multi_changefeeds ddl_wait ddl_reentrant force_replicate_table multi_source'
'syncpoint syncpoint syncpoint syncpoint'
# G01
'syncpoint syncpoint syncpoint syncpoint'
# G02
'syncpoint syncpoint syncpoint syncpoint'
# G03
'syncpoint syncpoint syncpoint syncpoint'
# G04
'syncpoint syncpoint syncpoint syncpoint'
# G05
'syncpoint syncpoint syncpoint syncpoint'
# G12
'ddl_with_random_move_table'
# G13
'tidb_mysql_test fail_over region_merge fail_over_ddl_mix_random_delay'
# G14
'fail_over_ddl_mix'
# G15
'fail_over_ddl_mix_with_syncpoint'
'syncpoint syncpoint syncpoint syncpoint'
# G01
'syncpoint syncpoint syncpoint syncpoint'
# G02
'syncpoint syncpoint syncpoint syncpoint'
# G03
'syncpoint syncpoint syncpoint syncpoint'
# G04
'syncpoint syncpoint syncpoint syncpoint'
# G05
'syncpoint syncpoint syncpoint syncpoint'
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The mysql_groups array has been significantly modified, with most of the original integration tests replaced by repeated syncpoint tests. This appears to be a temporary change for debugging purposes. Please ensure this is reverted before the final merge to avoid disabling a large number of important integration tests.

hongyunyan and others added 3 commits December 26, 2025 23:29
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@hongyunyan
Copy link
Collaborator Author

/gemini review

@hongyunyan
Copy link
Collaborator Author

/gemini summary

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to prevent a race condition between DDLs that modify table schedules (like CREATE, DROP, TRUNCATE) and global DDLs or syncpoints. This is achieved by holding DB/All level events in the DDL dispatcher until all pending scheduling tasks are acknowledged by the maintainer, ensuring correct serialization. A safeguard is also added on the maintainer side to discard such events if scheduling tasks are still pending. The core logic appears sound and is supported by new unit tests. However, there is a critical issue in a test script that needs to be addressed before merging.

Comment on lines 31 to 68
mysql_groups=(
# G00
'api_v2 generate_column many_pk_or_uk multi_source large_txn'
'syncpoint syncpoint syncpoint syncpoint'
# G01
'ddl_for_split_tables_with_random_move_table'
'syncpoint syncpoint syncpoint syncpoint'
# G02
'ddl_for_split_tables_with_failover'
'syncpoint syncpoint syncpoint syncpoint'
# G03
'cdc move_table checkpoint_race_ddl_crash'
'syncpoint syncpoint syncpoint syncpoint'
# G04
'complex_transaction syncpoint syncpoint_check_ts random_drop_message'
'syncpoint syncpoint syncpoint syncpoint'
# G05
'ddl_for_split_tables_with_merge_and_split'
'syncpoint syncpoint syncpoint syncpoint'
# G06
'ddl_for_split_tables_with_random_merge_and_split'
# G07
'consistent_compatibility consistent_partition_table consistent_replicate_gbk consistent_replicate_ddl consistent_replicate_basic'
# G08
'default_value http_proxies bank ddl_for_split_tables_random_schedule'
# G09
'availability resolve_lock merge_table drop_many_tables ddl_for_split_tables'
# G10
'consistent_replicate_nfs consistent_replicate_storage_file consistent_replicate_storage_file_large_value consistent_replicate_storage_s3'
# G11
'multi_changefeeds ddl_wait ddl_reentrant force_replicate_table multi_source'
'syncpoint syncpoint syncpoint syncpoint'
# G01
'syncpoint syncpoint syncpoint syncpoint'
# G02
'syncpoint syncpoint syncpoint syncpoint'
# G03
'syncpoint syncpoint syncpoint syncpoint'
# G04
'syncpoint syncpoint syncpoint syncpoint'
# G05
'syncpoint syncpoint syncpoint syncpoint'
# G12
'ddl_with_random_move_table'
# G13
'tidb_mysql_test fail_over region_merge fail_over_ddl_mix_random_delay'
# G14
'fail_over_ddl_mix'
# G15
'fail_over_ddl_mix_with_syncpoint'
'syncpoint syncpoint syncpoint syncpoint'
# G01
'syncpoint syncpoint syncpoint syncpoint'
# G02
'syncpoint syncpoint syncpoint syncpoint'
# G03
'syncpoint syncpoint syncpoint syncpoint'
# G04
'syncpoint syncpoint syncpoint syncpoint'
# G05
'syncpoint syncpoint syncpoint syncpoint'
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The mysql_groups test suite has been modified to repeatedly run syncpoint tests. This appears to be a temporary change for debugging or focused testing. This must be reverted to the original list of tests before merging to ensure CI test coverage is not compromised.

Comment on lines 175 to 179
// pendingACKCount is only used by the table trigger dispatcher.
//
// It tracks the number of DDLs that need to report to maintainer(i.e. DDLs that add/drop tables),
// but have not been ACKed by maintainer (i.e. maintainer has not finished scheduling add/drop tasks).
pendingACKCount atomic.Int64

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment for pendingACKCount states that it tracks DDLs that add/drop tables. However, the implementation has been expanded to track all events reported to the maintainer by the DDL dispatcher that require an ACK, including syncpoints. To improve clarity and maintainability, please update the comment to reflect this broader scope.

For example:

// It tracks the number of events reported to the maintainer by the table trigger dispatcher that are awaiting an ACK.

@gemini-code-assist
Copy link

Summary of Changes

This pull request addresses a critical race condition in the DDL event processing pipeline within TiCDC. It introduces a robust buffering and deferral mechanism to ensure that global block events, such as syncpoints or database-wide DDLs, are not acted upon by the maintainer until all preceding DDLs that involve adding or dropping tables have been fully scheduled and acknowledged. This enhancement prevents the maintainer from operating on an outdated or incomplete view of the system's schema, thereby improving data consistency and the reliability of crash recovery.

Highlights

  • DDL Event Buffering: Introduced a mechanism in BasicDispatcher to temporarily hold database-level or all-table-level block events, ensuring they are processed only after all preceding DDL tasks that modify tables have been acknowledged.
  • Race Condition Prevention: Implemented pendingACKCount and holdingBlockEvent to prevent a race condition where the maintainer might create an incomplete spanController task snapshot if global block events are processed before table-modifying DDLs are fully scheduled.
  • Maintainer Logic Adjustment: Modified the Barrier in the maintainer to defer acknowledging DB/All block events from the DDL dispatcher if there are still pending schedule-required events, ensuring an accurate range checker is built.
  • New Test Coverage: Added new unit and integration tests (TestHoldBlockEventUntilNoResendTasks, TestDeferAllDBBlockEventFromDDLDispatcherWhilePendingSchedule) to validate the new deferral and buffering logic for DDL block events.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • downstreamadapter/dispatcher/basic_dispatcher.go
    • Added pendingACKCount (atomic.Int64) to track the number of un-ACKed DDLs that add or drop tables.
    • Introduced holdingBlockEventMu (sync.Mutex) and holdingBlockEvent (commonEvent.BlockEvent) to buffer non-normal (DB/All) block events.
    • Modified DealWithBlockEvent to increment pendingACKCount for table-modifying DDLs and to hold DB/All block events if pendingACKCount is greater than zero.
    • Refactored block event reporting into a new function reportBlockedEventToMaintainer.
    • Updated cancelResendTask to decrement pendingACKCount and release any held block events if the count reaches zero.
    • Added helper functions holdBlockEvent and popHoldingBlockEvent for managing the buffered block event.
  • downstreamadapter/dispatcher/event_dispatcher_test.go
    • Added TestHoldBlockEventUntilNoResendTasks to verify the correct deferral and processing of DB/All block events.
  • maintainer/barrier.go
    • Implemented logic in handleBlockState to temporarily discard DB/All block events from the DDL dispatcher if there are pending schedule-required events, ensuring the spanController has an up-to-date task snapshot.
  • maintainer/barrier_helper.go
    • Added a Len() method to pendingScheduleEventMap to retrieve the current number of pending schedule events.
  • maintainer/barrier_test.go
    • Added TestDeferAllDBBlockEventFromDDLDispatcherWhilePendingSchedule to test the barrier's behavior when deferring global block events.
  • tests/integration_tests/run_heavy_it_in_ci.sh
    • Temporarily modified the mysql_groups array to replace most integration tests with syncpoint for focused testing or debugging purposes. This change is expected to be reverted before merging.
Activity
  • The pull request was opened by hongyunyan and is currently a work-in-progress (wip-ddl).
  • The ti-chi-bot indicated that flowbehappy is required for approval.
  • The author hongyunyan requested /test help and subsequently ran /test pull-cdc-mysql-integration-heavy and /test pull-cdc-mysql-integration-light.
  • The pull-cdc-mysql-integration-light test failed.
  • The author hongyunyan requested /gemini review twice and /gemini summary once.
  • The gemini-code-assist bot provided an initial summary and several review comments, highlighting a critical race condition in HoldBlockEvent and the temporary changes to the integration test configuration.

@hongyunyan hongyunyan changed the title wip-ddl ddl: Fix a race condition in DDL logic leading to inconsistency checked by syncpoint Dec 29, 2025
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 29, 2025
@hongyunyan
Copy link
Collaborator Author

/test all

@hongyunyan
Copy link
Collaborator Author

/test all

@hongyunyan
Copy link
Collaborator Author

/retest

@hongyunyan
Copy link
Collaborator Author

/test pull-cdc-kafka-integration-heavy

@hongyunyan
Copy link
Collaborator Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a buffering mechanism to fix a race condition in DDL processing, which is a solid approach. The changes on both the dispatcher and maintainer sides aim to ensure correctness. However, I've identified a critical race condition in the new dispatcher logic that could undermine the entire fix. The logic for deciding when to dispatch a held event is not atomic, which could lead to out-of-order processing under concurrent DDL operations, re-introducing the original bug. My review focuses on this critical issue.

Comment on lines 833 to 844
if d.IsTableTriggerEventDispatcher() {
d.pendingACKCount.Add(-1)
// If there is a held DB/All block event, report it as soon as all resend tasks are ACKed.
// For schedule-related non-blocking DDLs, the maintainer only ACKs after scheduling is done.
// For schedule-related blocking DDLs, the maintainer will only begin deal with after no pending scheduling tasks.
// Thus, we ensure DB/All block events can genereate correct range checkers.
if d.pendingACKCount.Load() == 0 {
if holding := d.popHoldingBlockEvent(); holding != nil {
d.reportBlockedEventToMaintainer(holding)
}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a critical race condition here that could undermine the fix this PR introduces. The logic to check pendingACKCount and dispatch the held event is not atomic.

  1. Stale Read: The pendingACKCount is decremented, and then its value is loaded in a separate step (d.pendingACKCount.Load() == 0). Another goroutine could increment pendingACKCount between the Add(-1) and Load() calls. This would cause the Load() to see a non-zero value and fail to dispatch the held event, leaving it stuck.

  2. Check-Then-Act Race: Even if you fix the stale read by using the result of Add(-1) (i.e., if d.pendingACKCount.Add(-1) == 0), a more subtle race exists. Another DDL event could be processed concurrently and increment pendingACKCount after the check evaluates to true but before the held event is dispatched by reportBlockedEventToMaintainer. This would violate the core invariant that global block events should only be dispatched when no other DDLs are pending.

To fix this robustly, the check for pendingACKCount == 0 and the subsequent pop-and-dispatch of the holding event must be performed atomically. I recommend using a sync.Mutex to protect these critical sections across all related methods (cancelResendTask, holdBlockEvent, and modifications to pendingACKCount).

Comment on lines 862 to 868
// double check here to avoid pendingACKCount becomes zero before we hold the event
if d.pendingACKCount.Load() == 0 {
holding := d.popHoldingBlockEvent()
if holding != nil {
d.reportBlockedEventToMaintainer(holding)
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This double-check has the same critical race condition as the one in cancelResendTask. The pendingACKCount can change between the Load() and the popHoldingBlockEvent() call. This could lead to the held event being dispatched while another event is still pending acknowledgement, violating the ordering guarantees this PR aims to enforce.

This critical section should also be protected by the same mutex proposed for cancelResendTask to ensure the decision to dispatch and the dispatch action itself are atomic with respect to other pending operations.

@hongyunyan
Copy link
Collaborator Author

/test all

Comment on lines 839 to 843
if d.pendingACKCount.Load() == 0 {
if holding := d.popHoldingBlockEvent(); holding != nil {
d.reportBlockedEventToMaintainer(holding)
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the pendingACKCount is less than zero? It's better to add a check to cover.

Comment on lines 863 to 867
if d.pendingACKCount.Load() == 0 {
holding := d.popHoldingBlockEvent()
if holding != nil {
d.reportBlockedEventToMaintainer(holding)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Dec 30, 2025
@hongyunyan
Copy link
Collaborator Author

/retest

@ti-chi-bot ti-chi-bot bot added the lgtm label Dec 31, 2025
@ti-chi-bot
Copy link

ti-chi-bot bot commented Dec 31, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lidezhu, wk989898

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Dec 31, 2025
@ti-chi-bot
Copy link

ti-chi-bot bot commented Dec 31, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-12-30 08:23:36.603367316 +0000 UTC m=+85772.421675749: ☑️ agreed by wk989898.
  • 2025-12-31 14:11:49.348141592 +0000 UTC m=+193065.166450024: ☑️ agreed by lidezhu.

@hongyunyan
Copy link
Collaborator Author

/retest

@hongyunyan
Copy link
Collaborator Author

/retest

1 similar comment
@wk989898
Copy link
Collaborator

wk989898 commented Jan 6, 2026

/retest

@ti-chi-bot ti-chi-bot bot merged commit ad529e0 into pingcap:master Jan 7, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

unstable test syncpoint

3 participants