Skip to content

sink(cloudstorage): add use-table-id-as-path option (#4356)#4594

Merged
ti-chi-bot[bot] merged 14 commits intopingcap:release-8.5from
ti-chi-bot:cherry-pick-4356-to-release-8.5
Mar 25, 2026
Merged

sink(cloudstorage): add use-table-id-as-path option (#4356)#4594
ti-chi-bot[bot] merged 14 commits intopingcap:release-8.5from
ti-chi-bot:cherry-pick-4356-to-release-8.5

Conversation

@ti-chi-bot
Copy link
Copy Markdown
Member

@ti-chi-bot ti-chi-bot commented Mar 25, 2026

This is an automated cherry-pick of #4356

What problem does this PR solve?

Issue Number: close #4357

What is changed and how it works?

The 'use-table-id-as-path' configuration option ONLY applies to TICI.

  • Adds config use-table-id-as-path.
  • Adds use_table_id_as_path into API conversion and sink config parsing.
  • Remove partition_id from the path to prevent duplicates.

In this mode, we adjust cloud storage path generation to omit schema when table-id-as-path is enabled and skip DB schema writes, that is:

  • Adding this configuration removes the 'database' prefix.
  • Automatically filtering database-related events.

For example:

The use-table-id-as-path option switches the path to use table_id instead of table_name when it set to true.
With configuration use-table-id-as-path=true in sink uri, for example: --sink-uri="s3://cdc&use-table-id-as-path=true", the cdc path changed from

test_db/table_name/5/2024-01-01/CDC_xxx.json

to

12345/5/2024-01-01/CDC_xxx.json

The reason for this design is:

  • To prevent residual data from being affected by creating a table with the same name after dropping a table.
  • Database events are not processed by TCI;
  • Retaining database events would create a database directory and schema files, increasing the workload of subsequent GC S3 directory creation.

Check List

Tests

  • Unit test

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

  • New Features

    • Added optional use-table-id-as-path configuration parameter for cloud storage sink, enabling file organization by numeric table IDs instead of table names.
  • Bug Fixes

    • Enhanced validation for exchange partition operations to properly detect and handle missing source table information.
  • Tests

    • Added test coverage for table ID-based path generation and exchange partition error handling.

Summary by CodeRabbit

  • New Features

    • Added use-table-id-as-path configuration option for cloud storage sinks to organize schema files by table ID instead of schema and table names.
  • Bug Fixes

    • Improved validation for table exchange DDL events to prevent processing invalid events.
    • Enhanced error handling in schema file path generation operations.

@ti-chi-bot ti-chi-bot added first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. lgtm ok-to-test Indicates a PR is ready to be tested. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-8.5 This PR is cherry-picked to release-8.5 from a source PR. labels Mar 25, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 25, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: de74555c-38f6-4549-bdd9-d0071840f437

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request adds a new use-table-id-as-path configuration option for cloud storage sinks, enabling table ID-based directory structures instead of table name-based paths. Changes include API/config model extensions, path generation refactoring with error handling, DDL event validation, and parameter threading through the sink write pipeline.

Changes

Cohort / File(s) Summary
API Models
api/v2/model.go, api/v2/model_test.go
Added optional UseTableIDAsPath field to CloudStorageConfig and wired it through bidirectional conversion paths (toInternalReplicaConfigWithOriginConfig and ToAPIReplicaConfig). Tests validate round-trip conversion.
Sink Configuration
pkg/config/sink.go, pkg/sink/cloudstorage/config.go, pkg/sink/cloudstorage/config_test.go
Added UseTableIDAsPath constant and field to sink config models. Extended SinkConfig.CheckCompatibilityWithSinkURI to parse use-table-id-as-path from URI, validate compatibility, and detect configuration changes. Integrated parameter into config apply/merge logic.
Path Generation
pkg/sink/cloudstorage/path.go, pkg/sink/cloudstorage/path_test.go, pkg/sink/cloudstorage/table_definition.go, pkg/sink/cloudstorage/table_definition_test.go
Refactored path generation to accept useTableIDAsPath and tableID parameters. Made GenerateSchemaFilePath, GenerateIndexFilePath, and generateDataDirPath return error pairs. When enabled, paths use <table_id>/meta/ structure instead of <schema>/<table>/. Added validation for schema name, table version, table ID, and column definitions.
DDL Event Processing
downstreamadapter/sink/cloudstorage/sink.go, downstreamadapter/sink/cloudstorage/sink_test.go
Added validation for ActionExchangeTablePartition events to verify MultipleTableInfos has ≥2 elements. Updated writeFile to skip database-level DDL schemas when UseTableIDAsPath is enabled and table name is empty. Threading useTableIDAsPath into schema path generation calls. Tests cover table ID-based paths, database schema skipping, and exchange partition validation.
Index File Error Handling
downstreamadapter/sink/cloudstorage/writer.go
Updated flushMessages to capture and handle errors from GenerateIndexFilePath, adding structured logging (workerID, keyspace, changefeed) before returning traced error.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

approved, needs-cherry-pick-release-8.5

Suggested reviewers

  • wk989898
  • tenfyzhong
  • flowbehappy

Poem

🐰 Hops through cloud paths with glee,
Table IDs dancing where names used to be!
With validation strong and error-aware,
This feature proves we truly care.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.54% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'sink(cloudstorage): add use-table-id-as-path option (#4356)' clearly and concisely describes the main feature addition to the cloud storage sink configuration.
Description check ✅ Passed The pull request description adequately covers the problem statement (issue #4357), explains what changed and how it works with concrete examples, includes test confirmation, and provides a release note placeholder.
Linked Issues check ✅ Passed Code changes fully implement the requirements from issue #4357: adds use-table-id-as-path config option, implements API/sink config parsing, adjusts path generation to omit schema, validates DDL events, and skips database schema writes.
Out of Scope Changes check ✅ Passed All code changes are directly scoped to implementing the use-table-id-as-path feature: configuration parsing, API conversion, path generation logic, validation, and related test coverage. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to the cloud storage sink by allowing users to configure data organization using numeric table IDs instead of table names. This change aims to improve data integrity by preventing issues with residual data when tables are dropped and recreated with the same name, and to optimize storage by skipping unnecessary database-level schema writes. Additionally, it strengthens the system's robustness by adding validation for DDL events and improving error handling in file path generation.

Highlights

  • New Configuration Option: Introduced a new configuration option, use-table-id-as-path, for cloud storage sinks to organize data by numeric table IDs instead of table names.
  • Path Generation Logic: Modified cloud storage path generation to use table IDs when the new option is enabled, omitting schema names and filtering database-related events to prevent residual data issues and optimize storage.
  • DDL Event Validation: Enhanced validation for exchange partition DDL events to ensure source table information is present, improving system robustness.
  • Error Handling: Improved error handling for index file path generation within the cloud storage writer and other path-related functions.
  • API and Configuration Updates: Updated API models, sink configurations, and path generation logic across various components to fully support the new use-table-id-as-path functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new configuration option, UseTableIDAsPath, for cloud storage sinks. When enabled, it modifies the file path generation logic to use table IDs instead of schema and table names for organizing schema and data files, and skips writing database-level schema files. Additionally, it adds validation for ActionExchangeTablePartition DDL events. A minor logging inconsistency was noted where shardID was used instead of workerID.

lidezhu and others added 2 commits March 25, 2026 11:45
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@lidezhu
Copy link
Copy Markdown
Collaborator

lidezhu commented Mar 25, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 25, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@ti-chi-bot ti-chi-bot bot added cherry-pick-approved Cherry pick PR approved by release team. and removed do-not-merge/cherry-pick-not-approved cherry-pick-approved Cherry pick PR approved by release team. labels Mar 25, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
pkg/sink/cloudstorage/config_test.go (1)

81-85: Add one malformed-value test for use-table-id-as-path.

Consider adding a case like ?use-table-id-as-path=not-bool and asserting an error, so parser behavior is locked down for invalid inputs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/sink/cloudstorage/config_test.go` around lines 81 - 85, Add a negative
test case to the existing test table (next to the "sink uri with
use-table-id-as-path" case) that uses the query param
`use-table-id-as-path=not-bool` and sets expectedErr to a non-empty string;
ensure the test invokes the same parser used by the file (the table-driven test
that calls the config parsing function) and asserts that parsing returns an
error (and optionally that the error message contains "use-table-id-as-path" or
"invalid boolean") so malformed boolean values are rejected.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@downstreamadapter/sink/cloudstorage/sink_test.go`:
- Around line 231-243: The test uses undefined ast.NewCIStr; replace all uses of
ast.NewCIStr with the existing parser_model.NewCIStr so the code
compiles—specifically update occurrences inside the tableInfo construction
(timodel.TableInfo, its Columns entries) and any other spots listed (around the
tableInfo variable and the later assertions at lines noted) to call
parser_model.NewCIStr instead of ast.NewCIStr.

In `@pkg/config/sink.go`:
- Around line 718-737: The compatibility bug: when applyParameterBySinkURI
parses a sink URI with the query param use-table-id-as-path it does not persist
that value into the SinkConfig (CloudStorageConfig.UseTableIDAsPath), so later
CheckUseTableIDAsPathCompatibility sees nil and allows an unintended flip; fix
applyParameterBySinkURI (and the analogous block around lines 1005-1056) to set
oldSinkConfig.CloudStorageConfig.UseTableIDAsPath = pointer(boolValue) (or
initialize CloudStorageConfig if nil) whenever the URI contains
use-table-id-as-path so the parsed boolean is stored in the SinkConfig and
CheckUseTableIDAsPathCompatibility will compare actual values instead of nil.

---

Nitpick comments:
In `@pkg/sink/cloudstorage/config_test.go`:
- Around line 81-85: Add a negative test case to the existing test table (next
to the "sink uri with use-table-id-as-path" case) that uses the query param
`use-table-id-as-path=not-bool` and sets expectedErr to a non-empty string;
ensure the test invokes the same parser used by the file (the table-driven test
that calls the config parsing function) and asserts that parsing returns an
error (and optionally that the error message contains "use-table-id-as-path" or
"invalid boolean") so malformed boolean values are rejected.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7a7a0d7f-3acb-4efc-ac70-b0914cd8682a

📥 Commits

Reviewing files that changed from the base of the PR and between 55aca47 and 0d78442.

📒 Files selected for processing (12)
  • api/v2/model.go
  • api/v2/model_test.go
  • downstreamadapter/sink/cloudstorage/sink.go
  • downstreamadapter/sink/cloudstorage/sink_test.go
  • downstreamadapter/sink/cloudstorage/writer.go
  • pkg/config/sink.go
  • pkg/sink/cloudstorage/config.go
  • pkg/sink/cloudstorage/config_test.go
  • pkg/sink/cloudstorage/path.go
  • pkg/sink/cloudstorage/path_test.go
  • pkg/sink/cloudstorage/table_definition.go
  • pkg/sink/cloudstorage/table_definition_test.go

Comment on lines +718 to +737
// CheckUseTableIDAsPathCompatibility checks the compatibility between sink config and sink URI.
func CheckUseTableIDAsPathCompatibility(
sinkConfig *SinkConfig,
useTableIDAsPathFromURI *bool,
) error {
if sinkConfig == nil ||
sinkConfig.CloudStorageConfig == nil ||
sinkConfig.CloudStorageConfig.UseTableIDAsPath == nil ||
useTableIDAsPathFromURI == nil {
return nil
}
useTableIDAsPathFromConfig := sinkConfig.CloudStorageConfig.UseTableIDAsPath
if util.GetOrZero(useTableIDAsPathFromConfig) == util.GetOrZero(useTableIDAsPathFromURI) {
return nil
}
return cerror.ErrIncompatibleSinkConfig.GenWithStackByArgs(
fmt.Sprintf("%s=%t", UseTableIDAsPathKey, util.GetOrZero(useTableIDAsPathFromURI)),
fmt.Sprintf("%s=%t", UseTableIDAsPathKey, util.GetOrZero(useTableIDAsPathFromConfig)),
)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Persist use-table-id-as-path in applyParameterBySinkURI too.

This compatibility path only sees the new URI. A changefeed created with ...?use-table-id-as-path=true still leaves oldSinkConfig.CloudStorageConfig.UseTableIDAsPath == nil, so a later update that drops the query param looks compatible here and silently flips the sink back to name-based paths.

💡 Suggested fix
 func (s *SinkConfig) applyParameterBySinkURI(sinkURI *url.URL) error {
 	if sinkURI == nil {
 		return nil
 	}
 
 	cfgInSinkURI := map[string]string{}
 	cfgInFile := map[string]string{}
 	params := sinkURI.Query()
@@
 	protocolFromURI := params.Get(ProtocolKey)
 	if protocolFromURI != "" {
 		if s.Protocol != nil && util.GetOrZero(s.Protocol) != protocolFromURI {
 			cfgInSinkURI[ProtocolKey] = protocolFromURI
 			cfgInFile[ProtocolKey] = util.GetOrZero(s.Protocol)
 		}
 		s.Protocol = util.AddressOf(protocolFromURI)
 	}
+
+	if IsStorageScheme(sinkURI.Scheme) {
+		useTableIDAsPathFromURI := params.Get(UseTableIDAsPathKey)
+		if useTableIDAsPathFromURI != "" {
+			enabled, err := strconv.ParseBool(useTableIDAsPathFromURI)
+			if err != nil {
+				return cerror.WrapError(cerror.ErrSinkURIInvalid, err)
+			}
+			if s.CloudStorageConfig == nil {
+				s.CloudStorageConfig = &CloudStorageConfig{}
+			}
+			if s.CloudStorageConfig.UseTableIDAsPath != nil &&
+				util.GetOrZero(s.CloudStorageConfig.UseTableIDAsPath) != enabled {
+				cfgInSinkURI[UseTableIDAsPathKey] = strconv.FormatBool(enabled)
+				cfgInFile[UseTableIDAsPathKey] = strconv.FormatBool(util.GetOrZero(s.CloudStorageConfig.UseTableIDAsPath))
+			}
+			s.CloudStorageConfig.UseTableIDAsPath = util.AddressOf(enabled)
+		}
+	}

Also applies to: 1005-1056

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/config/sink.go` around lines 718 - 737, The compatibility bug: when
applyParameterBySinkURI parses a sink URI with the query param
use-table-id-as-path it does not persist that value into the SinkConfig
(CloudStorageConfig.UseTableIDAsPath), so later
CheckUseTableIDAsPathCompatibility sees nil and allows an unintended flip; fix
applyParameterBySinkURI (and the analogous block around lines 1005-1056) to set
oldSinkConfig.CloudStorageConfig.UseTableIDAsPath = pointer(boolValue) (or
initialize CloudStorageConfig if nil) whenever the URI contains
use-table-id-as-path so the parsed boolean is stored in the SinkConfig and
CheckUseTableIDAsPathCompatibility will compare actual values instead of nil.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Mar 25, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: flowbehappy, lidezhu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the approved label Mar 25, 2026
@ti-chi-bot ti-chi-bot bot merged commit b0d936b into pingcap:release-8.5 Mar 25, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved cherry-pick-approved Cherry pick PR approved by release team. first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. lgtm ok-to-test Indicates a PR is ready to be tested. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-8.5 This PR is cherry-picked to release-8.5 from a source PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants