Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddl,util: modify the collation of the column of multi-valued index to "binary" #46993

Merged
merged 2 commits into from Sep 20, 2023

Conversation

YangKeao
Copy link
Member

@YangKeao YangKeao commented Sep 14, 2023

What problem does this PR solve?

Issue Number: close #46717

What is changed and how it works?

This PR changes the collation of the column of multi-valued index to "binary". I think it's a better way to solve #46717. It doesn't need to modify tipb/tikv and the influence is much more limited.

Background

  1. The design doc of multi-valued index tells that the collation of multi-valued index string should be "binary" (MySQL uses "utf8mb4_0900_as_cs", which is not supported by us for now).
  2. The TiDB actually uses the "binary" collation to query the index (see planner: set binary collation for mv column #40644), and the tailing whitespaces in multi-valued index string is not trimmed, so we didn't find any correctness issue about this.
  3. The "restored data" is a part of the value of an index K-V. Because the generated key of a string datum can be different from the original value according to the collation, so we need the help of the "restored data" to reconstruct the original value. For example, as the utf8mb4_bin will trim the tailing whitespace, the "restored data" of the utf8mb4_bin string is an integer, which represents the count of tailing whitespaces.
  4. The column corresponding to multi-valued index is written as utf8mb4_bin collation, but the restored data is not written and the tailing whitespaces are not trimmed. When the TiKV is about to restore the original value from the index, it'll fail to read the "restored data" and report an error.

This issue will only happen in compound index which contains at least one string column (which will actually write restored data), because when the index value is too short, both TiDB and TiKV will just regard it as an "old format" and skip the restore logic, then nothing is reported.

Q&A

  1. Is it fine to modify the collation directly? Yes, the multi-valued index is written without restored data, and the whitespace in the key is also kept. For the index lookup, the planner always uses the "binary" collation (see planner: set binary collation for mv column #40644). It's fine to read the value as binary collations.
  2. How does it solve the problem in Array field should be passed to the columnInfo #46717? By setting the collation to "binary", TiKV will not try to read the restored value.

Another question is, why it's a better solution than pingcap/tipb#317 tikv/tikv#15537 #46718 ?

  1. executor, expression: add support for array type tikv/tikv#15537 gives the wrong type for array columns. Though for now array columns are only used in multi-valued index, the original design of TiDB marks it as a JSON type. If it's read for index, it'll use the original type. Theoretically, TiKV doesn't need to know whether this K-V is an array or not, as it'll only handle the original value. executor, expression: add support for array type tikv/tikv#15537 lets the TiKV regard it as a JSON, but it's actually a string (or integer, or something else), which can cause bugs in the future.
  2. The influence and logic of this PR is much simpler: use "binary" collation for multi-valued index. We don't need to consider the interaction between tidb/tikv, and how different query engines handle the types.

When I was drafting pingcap/tipb#317 tikv/tikv#15537 #46718, I didn't fully understand the design of original multi-valued index type. Now I'm pretty confident about this PR 👍 .

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

You can test whether #46717 is solved by using the following SQL:

create table t (pk varchar(4) primary key clustered, j json, str varchar(255), value int, key idx((cast(j as char(100) array)), str));
insert into t values ("1", '["a"]', 'b', 1);
select * from t use index(idx) where "a" member of (j);

I'll add them to tidb-test later, as I want to test across tidb and tikv 🤔 . (I'm not sure whether tests/integrationtest will run with TiKV after merged. I'll confirm later. If so, tests/integrationtest looks better).

Release note

Change the collation of the column of multi-valued index to "binary"

@ti-chi-bot ti-chi-bot bot added release-note do-not-merge/needs-tests-checked size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 14, 2023
@codecov
Copy link

codecov bot commented Sep 14, 2023

Codecov Report

Merging #46993 (8bcc10a) into master (3f3b07e) will decrease coverage by 0.1845%.
Report is 6 commits behind head on master.
The diff coverage is 100.0000%.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #46993        +/-   ##
================================================
- Coverage   72.9609%   72.7765%   -0.1845%     
================================================
  Files          1338       1359        +21     
  Lines        399186     406172      +6986     
================================================
+ Hits         291250     295598      +4348     
- Misses        89120      91855      +2735     
+ Partials      18816      18719        -97     
Flag Coverage Δ
integration 30.5134% <0.0000%> (?)
unit 73.1138% <100.0000%> (+0.1528%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 53.9913% <ø> (ø)
parser 84.9728% <ø> (+0.0107%) ⬆️
br 48.9936% <ø> (-3.9485%) ⬇️

@YangKeao
Copy link
Member Author

/retest

@YangKeao
Copy link
Member Author

/retest

Signed-off-by: Yang Keao <yangkeao@chunibyo.icu>
@YangKeao
Copy link
Member Author

/retest

1 similar comment
@YangKeao
Copy link
Member Author

/retest

util/misc.go Show resolved Hide resolved
Copy link
Contributor

@zimulala zimulala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@YangKeao
Copy link
Member Author

YangKeao commented Sep 20, 2023

/hold

This PR will make import into import no rows (if tidb_enable_dist_task is enabled). Still investigating.

@ti-chi-bot ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 20, 2023
Copy link
Member

@bb7133 bb7133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot
Copy link

ti-chi-bot bot commented Sep 20, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bb7133, zimulala

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link

ti-chi-bot bot commented Sep 20, 2023

[LGTM Timeline notifier]

Timeline:

  • 2023-09-20 03:01:22.340200273 +0000 UTC m=+659248.307788323: ☑️ agreed by zimulala.
  • 2023-09-20 06:22:45.400166666 +0000 UTC m=+671331.367754701: ☑️ agreed by bb7133.

@YangKeao
Copy link
Member Author

/unhold

I think it's a bug of dist task framework. It occurs less frequently in master, but I think it's still possible. I'll confirm it with the developer of dist task framework.

@ti-chi-bot ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 20, 2023
@YangKeao
Copy link
Member Author

YangKeao commented Sep 20, 2023

/unhold

I think it's a bug of dist task framework. It occurs less frequently in master, but I think it's still possible. I'll confirm it with the developer of dist task framework.

The reported bug (importing succeed with 0 row) has been fixed by #46957 (though, maybe not intentional).

@YangKeao
Copy link
Member Author

/retest

2 similar comments
@YangKeao
Copy link
Member Author

/retest

@YangKeao
Copy link
Member Author

/retest

@ti-chi-bot ti-chi-bot bot merged commit 9626556 into pingcap:master Sep 20, 2023
13 of 16 checks passed
yibin87 pushed a commit to yibin87/tidb that referenced this pull request Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm release-note size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Array field should be passed to the columnInfo
3 participants