ddl: Fix a race condition in DDL logic leading to inconsistency checked by syncpoint #3802

hongyunyan · 2025-12-25T14:27:21Z

What problem does this PR solve?

Issue Number: close #3799

What is changed and how it works?

This pull request addresses a critical race condition in the DDL event processing pipeline within TiCDC. It introduces a robust buffering and deferral mechanism to ensure that global block events, such as syncpoints or database-wide DDLs, are not acted upon by the maintainer until all preceding DDLs that involve adding or dropping tables have been fully scheduled and acknowledged. This enhancement prevents the maintainer from operating on an outdated or incomplete view of the system's schema, thereby improving data consistency and the reliability of crash recovery.

Highlights

DDL Event Buffering: Introduced a mechanism in BasicDispatcher to temporarily hold database-level or all-table-level block events, ensuring they are processed only after all preceding DDL tasks that modify tables have been acknowledged.
Race Condition Prevention: Implemented pendingACKCount and holdingBlockEvent to prevent a race condition where the maintainer might create an incomplete spanController task snapshot if global block events are processed before table-modifying DDLs are fully scheduled.
Maintainer Logic Adjustment: Modified the Barrier in the maintainer to defer acknowledging DB/All block events from the DDL dispatcher if there are still pending schedule-required events, ensuring an accurate range checker is built.
New Test Coverage: Added new unit and integration tests (TestHoldBlockEventUntilNoResendTasks, TestDeferAllDBBlockEventFromDDLDispatcherWhilePendingSchedule) to validate the new deferral and buffering logic for DDL block events.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

hongyunyan · 2025-12-25T14:27:33Z

/test help

ti-chi-bot · 2025-12-25T14:27:36Z

@hongyunyan: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-build

/test pull-cdc-kafka-integration-heavy

/test pull-cdc-kafka-integration-light

/test pull-cdc-mysql-integration-heavy

/test pull-cdc-mysql-integration-light

/test pull-cdc-storage-integration-heavy

/test pull-cdc-storage-integration-light

/test pull-check

/test pull-error-log-review

/test pull-unit-test

The following commands are available to trigger optional jobs:

/test pull-build-next-gen

/test pull-cdc-kafka-integration-heavy-next-gen

/test pull-cdc-kafka-integration-light-next-gen

/test pull-cdc-mysql-integration-heavy-next-gen

/test pull-cdc-mysql-integration-light-next-gen

/test pull-cdc-pulsar-integration-light

/test pull-cdc-pulsar-integration-light-next-gen

/test pull-cdc-storage-integration-heavy-next-gen

/test pull-cdc-storage-integration-light-next-gen

/test pull-unit-test-next-gen

Use /test all to run the following jobs that were automatically triggered:

pull-build

pull-build-next-gen

pull-check

pull-error-log-review

pull-unit-test

pull-unit-test-next-gen

Details

In response to this:

/test help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

gemini-code-assist · 2025-12-25T14:27:40Z

Summary of Changes

Hello @hongyunyan, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a potential race condition in the DDL event dispatching mechanism by introducing a buffering system within the BasicDispatcher. It temporarily holds database-level or all-table-level block events, ensuring they are processed only after all preceding DDL tasks, particularly those related to schema changes, have been fully acknowledged. This enhancement prevents scenarios where an incomplete snapshot of tasks could lead to incorrect state management or recovery issues, thereby improving the robustness and reliability of DDL processing.

Highlights

DDL Event Handling Refinement: Introduced a mechanism within the BasicDispatcher to defer critical DDL block events (specifically those with DB/All scope) when the dispatcher is configured as a table trigger event dispatcher.
Race Condition Mitigation: Implemented a buffering system to hold non-normal block events when there are unacknowledged resend tasks. This prevents a race condition where the maintainer might create an incomplete spanController task snapshot, ensuring correct startTs selection during downstream crash recovery.
New Test Case: Added a comprehensive unit test, TestHoldBlockEventUntilNoResendTasks, to validate the new deferral logic for block events, ensuring that DB/All events are correctly held and processed only after preceding DDL acknowledgments.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

hongyunyan · 2025-12-25T14:27:44Z

/test pull-cdc-mysql-integration-heavy

gemini-code-assist

Code Review

This pull request introduces logic to hold certain DDL events to prevent a race condition, along with a new unit test to verify this behavior. The core logic change seems correct in its intent. However, I've identified a critical race condition in the implementation that could lead to events being stuck indefinitely. I've provided suggestions to fix this race. Additionally, there are some temporary changes in an integration test script that should be reverted before merging.

gemini-code-assist · 2025-12-25T14:30:32Z

downstreamadapter/dispatcher/basic_dispatcher.go

+		if d.IsTableTriggerEventDispatcher() &&
+			d.resendTaskMap.Len() > 0 &&
+			blockedTables != nil &&
+			blockedTables.InfluenceType != commonEvent.InfluenceTypeNormal {
+			d.blockEventStatus.setBlockEvent(event, heartbeatpb.BlockStage_WAITING)
+			d.HoldBlockEvent(event)
+			return
 		}


Following my suggestion to change HoldBlockEvent to fix a race condition, this part needs to be updated to use the new return value.

Suggested change

if d.IsTableTriggerEventDispatcher() &&

d.resendTaskMap.Len() > 0 &&

blockedTables != nil &&

blockedTables.InfluenceType != commonEvent.InfluenceTypeNormal {

d.blockEventStatus.setBlockEvent(event, heartbeatpb.BlockStage_WAITING)

d.HoldBlockEvent(event)

return

}

if d.IsTableTriggerEventDispatcher() &&

blockedTables != nil &&

blockedTables.InfluenceType != commonEvent.InfluenceTypeNormal {

if d.HoldBlockEvent(event) {

d.blockEventStatus.setBlockEvent(event, heartbeatpb.BlockStage_WAITING)

return

}

}

gemini-code-assist · 2025-12-25T14:30:32Z

downstreamadapter/dispatcher/basic_dispatcher.go

+func (d *BasicDispatcher) HoldBlockEvent(event commonEvent.BlockEvent) {
+	d.holdingBlockEventMu.Lock()
+	defer d.holdingBlockEventMu.Unlock()
+	// The event stream is blocked by this block event, so at most one such event can be pending here.
+	if d.holdingBlockEvent != nil {
+		d.HandleError(errors.ErrDispatcherFailed.GenWithStackByArgs(
+			"hold non-normal block event failed: holdingBlockEvent is already occupied",
+		))
+		return
+	}
+	d.holdingBlockEvent = event
+}


This function is part of a sequence that has a critical race condition. The check d.resendTaskMap.Len() > 0 in DealWithBlockEvent and this function call are not atomic. An ACK could be processed between the check and the hold, leading to a stuck event.

To fix this, the check should be moved inside this function to make the check-and-hold atomic. This function should also return a boolean to indicate whether the event was held. I will also add a comment in DealWithBlockEvent to adapt to this change.

func (d *BasicDispatcher) HoldBlockEvent(event commonEvent.BlockEvent) bool { d.holdingBlockEventMu.Lock() defer d.holdingBlockEventMu.Unlock() if d.resendTaskMap.Len() == 0 { return false } // The event stream is blocked by this block event, so at most one such event can be pending here. if d.holdingBlockEvent != nil { d.HandleError(errors.ErrDispatcherFailed.GenWithStackByArgs( "hold non-normal block event failed: holdingBlockEvent is already occupied", )) } else { d.holdingBlockEvent = event } return true }

gemini-code-assist · 2025-12-25T14:30:32Z

tests/integration_tests/run_heavy_it_in_ci.sh

+	'syncpoint syncpoint syncpoint syncpoint'
 	# G01
-	'ddl_for_split_tables_with_random_move_table'
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G02
-	'ddl_for_split_tables_with_failover'
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G03
-	'cdc move_table checkpoint_race_ddl_crash'
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G04
-	'complex_transaction syncpoint syncpoint_check_ts random_drop_message'
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G05
-	'ddl_for_split_tables_with_merge_and_split'
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G06
-	'ddl_for_split_tables_with_random_merge_and_split'
-	# G07
-	'consistent_compatibility consistent_partition_table consistent_replicate_gbk consistent_replicate_ddl consistent_replicate_basic'
-	# G08
-	'default_value http_proxies bank ddl_for_split_tables_random_schedule'
-	# G09
-	'availability resolve_lock merge_table drop_many_tables ddl_for_split_tables'
-	# G10
-	'consistent_replicate_nfs consistent_replicate_storage_file consistent_replicate_storage_file_large_value consistent_replicate_storage_s3'
-	# G11
-	'multi_changefeeds ddl_wait ddl_reentrant force_replicate_table multi_source'
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G01
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G02
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G03
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G04
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G05
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G12
-	'ddl_with_random_move_table'
-	# G13
-	'tidb_mysql_test fail_over region_merge fail_over_ddl_mix_random_delay'
-	# G14
-	'fail_over_ddl_mix'
-	# G15
-	'fail_over_ddl_mix_with_syncpoint'
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G01
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G02
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G03
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G04
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G05
+	'syncpoint syncpoint syncpoint syncpoint'


These changes to the mysql_groups array appear to be temporary for debugging purposes, as they replace all integration test groups with syncpoint and contain duplicated group comments (e.g., # G01, # G02). Please revert these changes before merging to ensure all integration tests are run in CI.

hongyunyan · 2025-12-26T15:01:43Z

/test pull-cdc-mysql-integration-light

hongyunyan · 2025-12-26T15:27:22Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a mechanism to prevent race conditions during DDL processing, specifically between DDLs that alter table schemas (like adding or dropping tables) and subsequent database-wide or global events (like DROP DATABASE or syncpoints). The solution involves two main parts: the DDL dispatcher now holds these global events if it has un-acknowledged DDLs, and the maintainer will discard such events if it has pending scheduling tasks. This ensures that range checkers for global events are built with a complete and up-to-date view of all tables. The changes are well-reasoned and accompanied by thorough unit tests. My review has identified a minor inaccuracy in a comment and a significant but likely temporary change in the CI test configuration that should be reverted before merging.

gemini-code-assist · 2025-12-26T15:28:51Z

tests/integration_tests/run_heavy_it_in_ci.sh

 mysql_groups=(
 	# G00
-	'api_v2 generate_column many_pk_or_uk multi_source large_txn'
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G01
-	'ddl_for_split_tables_with_random_move_table'
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G02
-	'ddl_for_split_tables_with_failover'
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G03
-	'cdc move_table checkpoint_race_ddl_crash'
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G04
-	'complex_transaction syncpoint syncpoint_check_ts random_drop_message'
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G05
-	'ddl_for_split_tables_with_merge_and_split'
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G06
-	'ddl_for_split_tables_with_random_merge_and_split'
-	# G07
-	'consistent_compatibility consistent_partition_table consistent_replicate_gbk consistent_replicate_ddl consistent_replicate_basic'
-	# G08
-	'default_value http_proxies bank ddl_for_split_tables_random_schedule'
-	# G09
-	'availability resolve_lock merge_table drop_many_tables ddl_for_split_tables'
-	# G10
-	'consistent_replicate_nfs consistent_replicate_storage_file consistent_replicate_storage_file_large_value consistent_replicate_storage_s3'
-	# G11
-	'multi_changefeeds ddl_wait ddl_reentrant force_replicate_table multi_source'
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G01
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G02
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G03
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G04
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G05
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G12
-	'ddl_with_random_move_table'
-	# G13
-	'tidb_mysql_test fail_over region_merge fail_over_ddl_mix_random_delay'
-	# G14
-	'fail_over_ddl_mix'
-	# G15
-	'fail_over_ddl_mix_with_syncpoint'
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G01
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G02
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G03
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G04
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G05
+	'syncpoint syncpoint syncpoint syncpoint'
 )


The mysql_groups array has been significantly modified, with most of the original integration tests replaced by repeated syncpoint tests. This appears to be a temporary change for debugging purposes. Please ensure this is reverted before the final merge to avoid disabling a large number of important integration tests.

downstreamadapter/dispatcher/basic_dispatcher.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…1225-ddl

hongyunyan · 2025-12-29T06:14:10Z

/gemini review

hongyunyan · 2025-12-29T06:14:15Z

/gemini summary

gemini-code-assist

Code Review

This pull request introduces a mechanism to prevent a race condition between DDLs that modify table schedules (like CREATE, DROP, TRUNCATE) and global DDLs or syncpoints. This is achieved by holding DB/All level events in the DDL dispatcher until all pending scheduling tasks are acknowledged by the maintainer, ensuring correct serialization. A safeguard is also added on the maintainer side to discard such events if scheduling tasks are still pending. The core logic appears sound and is supported by new unit tests. However, there is a critical issue in a test script that needs to be addressed before merging.

gemini-code-assist · 2025-12-29T06:15:46Z

tests/integration_tests/run_heavy_it_in_ci.sh

 mysql_groups=(
 	# G00
-	'api_v2 generate_column many_pk_or_uk multi_source large_txn'
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G01
-	'ddl_for_split_tables_with_random_move_table'
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G02
-	'ddl_for_split_tables_with_failover'
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G03
-	'cdc move_table checkpoint_race_ddl_crash'
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G04
-	'complex_transaction syncpoint syncpoint_check_ts random_drop_message'
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G05
-	'ddl_for_split_tables_with_merge_and_split'
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G06
-	'ddl_for_split_tables_with_random_merge_and_split'
-	# G07
-	'consistent_compatibility consistent_partition_table consistent_replicate_gbk consistent_replicate_ddl consistent_replicate_basic'
-	# G08
-	'default_value http_proxies bank ddl_for_split_tables_random_schedule'
-	# G09
-	'availability resolve_lock merge_table drop_many_tables ddl_for_split_tables'
-	# G10
-	'consistent_replicate_nfs consistent_replicate_storage_file consistent_replicate_storage_file_large_value consistent_replicate_storage_s3'
-	# G11
-	'multi_changefeeds ddl_wait ddl_reentrant force_replicate_table multi_source'
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G01
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G02
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G03
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G04
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G05
+	'syncpoint syncpoint syncpoint syncpoint'
 	# G12
-	'ddl_with_random_move_table'
-	# G13
-	'tidb_mysql_test fail_over region_merge fail_over_ddl_mix_random_delay'
-	# G14
-	'fail_over_ddl_mix'
-	# G15
-	'fail_over_ddl_mix_with_syncpoint'
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G01
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G02
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G03
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G04
+	'syncpoint syncpoint syncpoint syncpoint'
+	# G05
+	'syncpoint syncpoint syncpoint syncpoint'
 )


The mysql_groups test suite has been modified to repeatedly run syncpoint tests. This appears to be a temporary change for debugging or focused testing. This must be reverted to the original list of tests before merging to ensure CI test coverage is not compromised.

gemini-code-assist · 2025-12-29T06:15:46Z

downstreamadapter/dispatcher/basic_dispatcher.go

+	// pendingACKCount is only used by the table trigger dispatcher.
+	//
+	// It tracks the number of DDLs that need to report to maintainer(i.e. DDLs that add/drop tables),
+	// but have not been ACKed by maintainer (i.e. maintainer has not finished scheduling add/drop tasks).
+	pendingACKCount atomic.Int64


The comment for pendingACKCount states that it tracks DDLs that add/drop tables. However, the implementation has been expanded to track all events reported to the maintainer by the DDL dispatcher that require an ACK, including syncpoints. To improve clarity and maintainability, please update the comment to reflect this broader scope.

For example:

// It tracks the number of events reported to the maintainer by the table trigger dispatcher that are awaiting an ACK.

gemini-code-assist · 2025-12-29T06:16:05Z

Summary of Changes

This pull request addresses a critical race condition in the DDL event processing pipeline within TiCDC. It introduces a robust buffering and deferral mechanism to ensure that global block events, such as syncpoints or database-wide DDLs, are not acted upon by the maintainer until all preceding DDLs that involve adding or dropping tables have been fully scheduled and acknowledged. This enhancement prevents the maintainer from operating on an outdated or incomplete view of the system's schema, thereby improving data consistency and the reliability of crash recovery.

Highlights

DDL Event Buffering: Introduced a mechanism in BasicDispatcher to temporarily hold database-level or all-table-level block events, ensuring they are processed only after all preceding DDL tasks that modify tables have been acknowledged.
Race Condition Prevention: Implemented pendingACKCount and holdingBlockEvent to prevent a race condition where the maintainer might create an incomplete spanController task snapshot if global block events are processed before table-modifying DDLs are fully scheduled.
Maintainer Logic Adjustment: Modified the Barrier in the maintainer to defer acknowledging DB/All block events from the DDL dispatcher if there are still pending schedule-required events, ensuring an accurate range checker is built.
New Test Coverage: Added new unit and integration tests (TestHoldBlockEventUntilNoResendTasks, TestDeferAllDBBlockEventFromDDLDispatcherWhilePendingSchedule) to validate the new deferral and buffering logic for DDL block events.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

downstreamadapter/dispatcher/basic_dispatcher.go
- Added pendingACKCount (atomic.Int64) to track the number of un-ACKed DDLs that add or drop tables.
- Introduced holdingBlockEventMu (sync.Mutex) and holdingBlockEvent (commonEvent.BlockEvent) to buffer non-normal (DB/All) block events.
- Modified DealWithBlockEvent to increment pendingACKCount for table-modifying DDLs and to hold DB/All block events if pendingACKCount is greater than zero.
- Refactored block event reporting into a new function reportBlockedEventToMaintainer.
- Updated cancelResendTask to decrement pendingACKCount and release any held block events if the count reaches zero.
- Added helper functions holdBlockEvent and popHoldingBlockEvent for managing the buffered block event.
downstreamadapter/dispatcher/event_dispatcher_test.go
- Added TestHoldBlockEventUntilNoResendTasks to verify the correct deferral and processing of DB/All block events.
maintainer/barrier.go
- Implemented logic in handleBlockState to temporarily discard DB/All block events from the DDL dispatcher if there are pending schedule-required events, ensuring the spanController has an up-to-date task snapshot.
maintainer/barrier_helper.go
- Added a Len() method to pendingScheduleEventMap to retrieve the current number of pending schedule events.
maintainer/barrier_test.go
- Added TestDeferAllDBBlockEventFromDDLDispatcherWhilePendingSchedule to test the barrier's behavior when deferring global block events.
tests/integration_tests/run_heavy_it_in_ci.sh
- Temporarily modified the mysql_groups array to replace most integration tests with syncpoint for focused testing or debugging purposes. This change is expected to be reverted before merging.

Activity

The pull request was opened by hongyunyan and is currently a work-in-progress (wip-ddl).
The ti-chi-bot indicated that flowbehappy is required for approval.
The author hongyunyan requested /test help and subsequently ran /test pull-cdc-mysql-integration-heavy and /test pull-cdc-mysql-integration-light.
The pull-cdc-mysql-integration-light test failed.
The author hongyunyan requested /gemini review twice and /gemini summary once.
The gemini-code-assist bot provided an initial summary and several review comments, highlighting a critical race condition in HoldBlockEvent and the temporary changes to the integration test configuration.

hongyunyan · 2025-12-29T06:23:38Z

/test all

hongyunyan · 2025-12-29T07:22:16Z

/test all

hongyunyan · 2025-12-29T09:23:16Z

/retest

hongyunyan · 2025-12-29T15:14:05Z

/test pull-cdc-kafka-integration-heavy

hongyunyan · 2025-12-30T07:05:20Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a buffering mechanism to fix a race condition in DDL processing, which is a solid approach. The changes on both the dispatcher and maintainer sides aim to ensure correctness. However, I've identified a critical race condition in the new dispatcher logic that could undermine the entire fix. The logic for deciding when to dispatch a held event is not atomic, which could lead to out-of-order processing under concurrent DDL operations, re-introducing the original bug. My review focuses on this critical issue.

gemini-code-assist · 2025-12-30T07:09:16Z

downstreamadapter/dispatcher/basic_dispatcher.go

+	if d.IsTableTriggerEventDispatcher() {
+		d.pendingACKCount.Add(-1)
+		// If there is a held DB/All block event, report it as soon as all resend tasks are ACKed.
+		// For schedule-related non-blocking DDLs, the maintainer only ACKs after scheduling is done.
+		// For schedule-related blocking DDLs, the maintainer will only begin deal with after no pending scheduling tasks.
+		// Thus, we ensure DB/All block events can genereate correct range checkers.
+		if d.pendingACKCount.Load() == 0 {
+			if holding := d.popHoldingBlockEvent(); holding != nil {
+				d.reportBlockedEventToMaintainer(holding)
+			}
+		}
+	}


There's a critical race condition here that could undermine the fix this PR introduces. The logic to check pendingACKCount and dispatch the held event is not atomic.

Stale Read: The pendingACKCount is decremented, and then its value is loaded in a separate step (d.pendingACKCount.Load() == 0). Another goroutine could increment pendingACKCount between the Add(-1) and Load() calls. This would cause the Load() to see a non-zero value and fail to dispatch the held event, leaving it stuck.

Check-Then-Act Race: Even if you fix the stale read by using the result of Add(-1) (i.e., if d.pendingACKCount.Add(-1) == 0), a more subtle race exists. Another DDL event could be processed concurrently and increment pendingACKCount after the check evaluates to true but before the held event is dispatched by reportBlockedEventToMaintainer. This would violate the core invariant that global block events should only be dispatched when no other DDLs are pending.

To fix this robustly, the check for pendingACKCount == 0 and the subsequent pop-and-dispatch of the holding event must be performed atomically. I recommend using a sync.Mutex to protect these critical sections across all related methods (cancelResendTask, holdBlockEvent, and modifications to pendingACKCount).

gemini-code-assist · 2025-12-30T07:09:16Z

downstreamadapter/dispatcher/basic_dispatcher.go

+	// double check here to avoid pendingACKCount becomes zero before we hold the event
+	if d.pendingACKCount.Load() == 0 {
+		holding := d.popHoldingBlockEvent()
+		if holding != nil {
+			d.reportBlockedEventToMaintainer(holding)
+		}
+	}


This double-check has the same critical race condition as the one in cancelResendTask. The pendingACKCount can change between the Load() and the popHoldingBlockEvent() call. This could lead to the held event being dispatched while another event is still pending acknowledgement, violating the ordering guarantees this PR aims to enforce.

This critical section should also be protected by the same mutex proposed for cancelResendTask to ensure the decision to dispatch and the dispatch action itself are atomic with respect to other pending operations.

hongyunyan · 2025-12-30T07:14:04Z

/test all

wk989898 · 2025-12-30T08:19:36Z

downstreamadapter/dispatcher/basic_dispatcher.go

+		if d.pendingACKCount.Load() == 0 {
+			if holding := d.popHoldingBlockEvent(); holding != nil {
+				d.reportBlockedEventToMaintainer(holding)
+			}
+		}


Is it possible that the pendingACKCount is less than zero? It's better to add a check to cover.

wk989898 · 2025-12-30T08:20:59Z

downstreamadapter/dispatcher/basic_dispatcher.go

+	if d.pendingACKCount.Load() == 0 {
+		holding := d.popHoldingBlockEvent()
+		if holding != nil {
+			d.reportBlockedEventToMaintainer(holding)
+		}


hongyunyan · 2025-12-30T08:59:08Z

/retest

ti-chi-bot · 2025-12-31T14:11:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lidezhu, wk989898

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [lidezhu,wk989898]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2025-12-31T14:11:50Z

[LGTM Timeline notifier]

Timeline:

2025-12-30 08:23:36.603367316 +0000 UTC m=+85772.421675749: ☑️ agreed by wk989898.
2025-12-31 14:11:49.348141592 +0000 UTC m=+193065.166450024: ☑️ agreed by lidezhu.

hongyunyan · 2026-01-05T03:10:30Z

/retest

hongyunyan · 2026-01-05T13:50:35Z

/retest

wk989898 · 2026-01-06T02:05:32Z

/retest

update

a0910f7

ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Dec 25, 2025

ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 25, 2025

gemini-code-assist bot reviewed Dec 25, 2025

View reviewed changes

update

8c9b440

ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 26, 2025

hongyunyan added 2 commits December 26, 2025 22:47

Merge branch 'master' of https://github.com/pingcap/ticdc into 1225-ddl

84990af

update

087e3c5

update

3726e39

gemini-code-assist bot reviewed Dec 26, 2025

View reviewed changes

hongyunyan and others added 3 commits December 26, 2025 23:29

Update downstreamadapter/dispatcher/basic_dispatcher.go

bbcba03

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

update

3c2dd21

Merge branch '1225-ddl' of https://github.com/hongyunyan/tigate into …

321df2a

…1225-ddl

ti-chi-bot bot removed the do-not-merge/needs-linked-issue label Dec 29, 2025

gemini-code-assist bot reviewed Dec 29, 2025

View reviewed changes

hongyunyan changed the title ~~wip-ddl~~ ddl: Fix a race condition in DDL logic leading to inconsistency checked by syncpoint Dec 29, 2025

ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 29, 2025

update

b510845

update

9ab0053

gemini-code-assist bot reviewed Dec 30, 2025

View reviewed changes

wk989898 approved these changes Dec 30, 2025

View reviewed changes

ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Dec 30, 2025

update

ac8c343

lidezhu approved these changes Dec 31, 2025

View reviewed changes

ti-chi-bot bot added the lgtm label Dec 31, 2025

ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Dec 31, 2025

update

c59909a

Merge branch 'master' of https://github.com/pingcap/ticdc into 1225-ddl

8d625c9

ti-chi-bot bot merged commit ad529e0 into pingcap:master Jan 7, 2026
19 checks passed

ddl: Fix a race condition in DDL logic leading to inconsistency checked by syncpoint #3802

ddl: Fix a race condition in DDL logic leading to inconsistency checked by syncpoint #3802

Uh oh!

Conversation

hongyunyan commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Highlights

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Uh oh!

hongyunyan commented Dec 25, 2025

Uh oh!

ti-chi-bot bot commented Dec 25, 2025

Uh oh!

gemini-code-assist bot commented Dec 25, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

hongyunyan commented Dec 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

hongyunyan commented Dec 26, 2025

Uh oh!

hongyunyan commented Dec 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hongyunyan commented Dec 29, 2025

Uh oh!

hongyunyan commented Dec 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Dec 29, 2025

Summary of Changes

Highlights

Uh oh!

hongyunyan commented Dec 29, 2025

Uh oh!

hongyunyan commented Dec 29, 2025

Uh oh!

hongyunyan commented Dec 29, 2025

Uh oh!

hongyunyan commented Dec 29, 2025

Uh oh!

hongyunyan commented Dec 30, 2025

Uh oh!

hongyunyan commented Dec 25, 2025 •

edited

Loading