Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collation: add pinyin collation for chinese charset support #20790

Closed
wants to merge 8 commits into from

Conversation

xiongjiwei
Copy link
Contributor

@xiongjiwei xiongjiwei commented Nov 3, 2020

What problem does this PR solve?

Issue Number: This issue provides an alternative for issue #19747 and #10192

Problem Summary:

What is changed and how it works?

many users in China would like to sort Chinese characters with pinyin order.
Proposal: #19984

How it Works:

we use a big table to store the character's weight. To compare two strings, just find the character's weight one by one.

For all Chinese characters, we sort them and give them the weight from 1 to n according to zh.xml file in cldr24 and then plus 0xFFA00000

For non-Chinese characters, we convert them to gb18030 codepoint, and if

  1. codepoint is not greater than 0xFFFF, use the codepoint to be weight directly.
  2. codepoint greater than 0xFFFF, use 0xFF000000 + codepoint + 0x1E248 to be weight.

benchmark

goos: linux
goarch: amd64
pkg: github.com/pingcap/tidb/util/collate
BenchmarkUtf8mb4Bin_CompareLong-8                   2907            417144 ns/op
BenchmarkUtf8mb4GeneralCI_CompareLong-8               58          20108234 ns/op
BenchmarkUtf8mb4UnicodeCI_CompareLong-8               55          22296453 ns/op
BenchmarkUtf8mb4Pinyin_CompareLong-8                  68          17060721 ns/op
BenchmarkUtf8mb4Bin_KeyLong-8                       1724            692919 ns/op
BenchmarkUtf8mb4GeneralCI_KeyLong-8                   88          12352256 ns/op
BenchmarkUtf8mb4UnicodeCI_KeyLong-8                   82          14197344 ns/op
BenchmarkUtf8mb4Pinyin_KeyLong-8                      90          12048163 ns/op

The test string length is 2<<20

Check List

Tests

  • Unit test
  • Integration test

Release note

  • add pinyin order support

@xiongjiwei xiongjiwei changed the title *: Pinyin order imp collation: add pinyin collation for chinese charset support Nov 4, 2020
@xiongjiwei xiongjiwei marked this pull request as ready for review November 4, 2020 08:42
@xiongjiwei xiongjiwei requested a review from a team as a code owner November 4, 2020 08:42
@xiongjiwei xiongjiwei requested review from XuHuaiyu and removed request for a team November 4, 2020 08:42
Copy link
Member

@bb7133 bb7133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Nov 5, 2020
@guo-shaoge guo-shaoge removed the request for review from XuHuaiyu June 10, 2021 01:52
@ti-chi-bot
Copy link
Member

@xiongjiwei: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 31, 2021
@xiongjiwei xiongjiwei closed this Aug 1, 2021
@xiongjiwei xiongjiwei deleted the pinyin-order-imp branch September 23, 2022 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/expression needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. status/LGT1 Indicates that a PR has LGTM 1.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants