Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Support CHECKSUM TABLE command #1895

Open
sjmudd opened this issue Oct 28, 2016 · 10 comments
Open

Feature request: Support CHECKSUM TABLE command #1895

sjmudd opened this issue Oct 28, 2016 · 10 comments
Labels
feature/discussing This feature request is discussing among product managers help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. type/compatibility type/enhancement

Comments

@sjmudd
Copy link

sjmudd commented Oct 28, 2016

When importing data into TiDB we want to be sure the data is correct. A convenient command in MySQL is CHECKSUM TABLE and this command is not currently supported by TiDB.

Adding it would make it easy to confirm if the data loaded in from an external source matches or not.

Example I did something like:

$ ssh mysqlhost sudo mysqldump test SomeTable | mysql -h tidbhost -P 4000 -u root -D test

MySQL 5.6:

root@mysqlhost [test]> checksum table SomeTable;
+-----------+------------+
| Table     | Checksum   |
+-----------+------------+
| SomeTable | 2867272629 |
+-----------+------------+
1 row in set (0.01 sec)

TiDB:

root@tidbhost [test]> checksum table SomeTable;
ERROR 1105 (HY000): line 0 column 8 near " table SomeTable"
root@tidbhost [test]> 

So providing some sort of CHECKSUM table command would be great.

@coocood
Copy link
Member

coocood commented Oct 28, 2016

@sjmudd
Thank you for your suggestion.
This is a very useful feature.
We will let you know when starts to support this feature.

But we can't use the same algorithm to calculate the checksum value because we store data in a different format than MySQL, so the checksum value for the table imported from MySQL to TiDB will be different.

@ngaut ngaut added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 23, 2018
@morgo
Copy link
Contributor

morgo commented Mar 17, 2019

I stumbled across ADMIN CHECKSUM TABLE by accident today. It looks like this is used by lightning. The output format is not CHECKSUM TABLE compatible though:

mysql> ADMIN CHECKSUM TABLE trips;
+-----------+------------+----------------------+-----------+-------------+
| Db_name   | Table_name | Checksum_crc64_xor   | Total_kvs | Total_bytes |
+-----------+------------+----------------------+-----------+-------------+
| bikeshare | trips      | 11248831942383972982 |  19117643 |  2597048988 |
+-----------+------------+----------------------+-----------+-------------+
1 row in set (5.24 sec)

The current workaround to compare data between TiDB and MySQL data sources is to use md5sum on a result ordered by primary key. For example:

morgo@ryzen:/mnt/evo970/data-sets/bikeshare-data$ ~/sandboxes/msb_8_0_15/use -BNe "SELECT * FROM bikeshare.trips ORDER BY trip_id" | md5sum
9d893b153e6476f148f3564d801e6ed6  -
morgo@ryzen:/mnt/evo970/data-sets/bikeshare-data$ mysql -BNe "SELECT * FROM bikeshare.trips ORDER BY trip_id" | md5sum
9d893b153e6476f148f3564d801e6ed6  -

I will leave this request open as a request for a MySQL-compatible CHECKSUM TABLE command.

@bb7133
Copy link
Member

bb7133 commented Jun 11, 2020

Close this issue since the feature is supported.

@bb7133 bb7133 closed this as completed Jun 11, 2020
@bb7133 bb7133 reopened this Jun 12, 2020
@ghost ghost added this to Need Triage in Feature Request Kanban via automation Jul 23, 2020
@zz-jason
Copy link
Member

zz-jason commented Jul 27, 2020

I will leave this request open as a request for a MySQL-compatible CHECKSUM TABLE command.

There are two challenges to implement a MySQL-Compatible CHECKSUM TABLE command:

  1. use the same checksum algorithm as MySQL.
  2. tidb-lighting must be adapted to use the same checksum algorithm if it still uses the ADMIN CHECKSUM TABLE <table_name> statement provided in TiDB.

/cc @kennytm, @IANTHEREAL

@zz-jason
Copy link
Member

From MySQL document about innodb_checksum_algorithm, seems the checksum default algorithm is crc32:

Specifies how to generate and verify the checksum stored in the disk blocks of InnoDB tablespaces. crc32 is the default value as of MySQL 5.7.7.

The value innodb is backward-compatible with earlier versions of MySQL. The value crc32 uses an algorithm that is faster to compute the checksum for every modified block, and to check the checksums for each disk read. It scans blocks 32 bits at a time, which is faster than the innodb checksum algorithm, which scans blocks 8 bits at a time. The value none writes a constant value in the checksum field rather than computing a value based on the block data. The blocks in a tablespace can use a mix of old, new, and no checksum values, being updated gradually as the data is modified; once blocks in a tablespace are modified to use the crc32 algorithm, the associated tables cannot be read by earlier versions of MySQL.

@zz-jason zz-jason moved this from Need Triage to Need Discussion in Feature Request Kanban Jul 27, 2020
@kennytm
Copy link
Contributor

kennytm commented Jul 27, 2020

I don't see how Lightning or ADMIN CHECKSUM TABLE interferes with this feature.

ADMIN CHECKSUM TABLE is calculated based on the encoded KV pairs, and includes contributions from the indices. CHECKSUM TABLE is based on the SQL values and does not count indices. Therefore, Lightning will not switch to use CHECKSUM TABLE even if it is available.

@zz-jason zz-jason added feature/reviewing This feature request is reviewing by product managers feature/discussing This feature request is discussing among product managers and removed feature/reviewing This feature request is reviewing by product managers labels Aug 6, 2020
@scsldb
Copy link

scsldb commented Aug 28, 2020

We need to investigate the user scenario in depth, and then make a decision after the research is completed.

@ghost
Copy link

ghost commented Aug 28, 2020

As @kennytm notes, the use-case is not for lightning, but could be used for DM.

Because of SQL modes, timezones and character sets logical restores could corrupt data. This is true even if TiDB is ostensibly behaving correctly (i.e. it could be misconfiguration). CHECKSUM TABLE provides a way of logically comparing two systems as having the same data.

MySQL users often use it during upgrades (MySQL <-> MySQL), for example logically dumping a 5.7 system and then restoring on 8.0 and comparing the data. I have found problematic behavior changes in MySQL this way. I can't find the bug now, but I think it was in 5.0 it started storing negative infinity values differently.

The feature request here is to implement the same algorithm as MySQL so it can be used for a MySQL <-> TiDB upgrade. That is, it could either be used by DM or a manual inspection by the DBA, who is concerned about such logical corruptions.

@ghost
Copy link

ghost commented Aug 28, 2020

From MySQL document about innodb_checksum_algorithm, seems the checksum default algorithm is crc32:

Specifies how to generate and verify the checksum stored in the disk blocks of InnoDB tablespaces. crc32 is the default value as of MySQL 5.7.7.

The value innodb is backward-compatible with earlier versions of MySQL. The value crc32 uses an algorithm that is faster to compute the checksum for every modified block, and to check the checksums for each disk read. It scans blocks 32 bits at a time, which is faster than the innodb checksum algorithm, which scans blocks 8 bits at a time. The value none writes a constant value in the checksum field rather than computing a value based on the block data. The blocks in a tablespace can use a mix of old, new, and no checksum values, being updated gradually as the data is modified; once blocks in a tablespace are modified to use the crc32 algorithm, the associated tables cannot be read by earlier versions of MySQL.

This is a different type of checksum (page checksums on the physical format). It is not related to CHECKSUM TABLE. The manual does not explicitly state the format.

@kennytm
Copy link
Contributor

kennytm commented Mar 24, 2021

In Sync-diff-inspector, we have a custom checksum command based on:

SELECT bit_xor(
    CAST(crc32(
        concat_ws(',',
            col1, col2, col3, …, colN,
            concat(isnull(col1), isnull(col2), …, isnull(colN))
        )
    ) AS UNSIGNED)
)
FROM t;

(for sure this is not a general solution because the concat_ws result cannot exceed @@max_allowed_packet, and also it relies on the stability of data → string conversion.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/discussing This feature request is discussing among product managers help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. type/compatibility type/enhancement
Projects
Feature Request Kanban
  
Need Discussion
Development

No branches or pull requests

8 participants