Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement end to end checksum for TiDB and TiCDC #42747

Open
13 of 17 tasks
cfzjywxk opened this issue Apr 3, 2023 · 0 comments
Open
13 of 17 tasks

Implement end to end checksum for TiDB and TiCDC #42747

cfzjywxk opened this issue Apr 3, 2023 · 0 comments
Assignees

Comments

@cfzjywxk
Copy link
Contributor

cfzjywxk commented Apr 3, 2023

Background

TiCDC is an important component for TiDB to synchronize data to various downstream systems. When synchronizing data to downstream systems, data integrity is especially important. However, TiCDC does not support end-to-end data integrity verification yet.

Spec

Provide below cluster level boolean type option in TiDB side.

tidb_enable_row_level_checksum = [true|false]  # the default value is false.
SET GLOBAL tidb_enable_row_level_checksum = true;  

After the customer enables this option, every data change for a row in non-system databases will append an invisible field that is used to store a computed checksum value based on the content of the row. This invisible field is just for data correctness checking purposes and is transparent to the customer.

TiCDC and end users would use this checksum value to verify the data integrity.

Development tracking for the TiDB part

  • Add a checksum function tidb_row_checksum to return the checksum value of a row. *: add tidb_row_checksum() as a builtin function #43479
  • Add checksum-related utilities in tidb, in this case, CRC32 would be used, the calculation method is shown as the following. util: extend row format with checksum #42859 util: reimplement row level checksum utilities #43141
  • Let tidb be aware of the origin state (none or public) of a column if its current state is not public. -- we always append two checksums if there is a column whose state is not public, thus no need to know the direction of state transform.
  • Support writing rows with checksum values *: support writing rows with checksum values #43163
    • Add a global system variable tidb_enable_row_level_checksum to enable or disable the checksum calculation when inserting new rows. When it's enabled, multi-schema change will be blocked.
    • Make it work with the DDL add column schema change, and generate two checksum values if necessary.
    • Make it work with the DDL drop column schema change, and generate two checksum values if necessary.
    • Make it work with the DDL modify column schema change, and generate two checksum values if necessary.
    • Calculate the row checksum in the tablecodec package when EncodeRow function is used. Calculate the CRC32 result for each column when executing encodeRowCols, a checksum result is returned finally.
    • Append the checksum header and checksum result information to the encoded row according to the extended row format protocol.
  • Keep the read request processing compatibility.
  • Add telementry for the new feature.
  • Compatibility tests, the checksum extended part should not impact the backward compatibility, and downgrade is supported when the checksum row format is used.

Development tracking for the TiKV part

Development tracking for the TiCDC part

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this issue May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants