support `utf8mb4_zh_0900_as_cs` for `utf8mb4` character set #19747

zz-jason · 2020-09-03T04:23:34Z

Feature Request

Is your feature request related to a problem? Please describe:

It's unable to order by a column based on it's pinyin order. For example:

create table t(
	a varchar(100)
)
charset = 'utf8mb4'
collate = 'utf8mb4_zh_0900_as_cs';

insert some data:

insert into t values ("中文"), ("啊中文");

a query requires to order by column a in its pinyin order:

select * from t order by a;

Describe the feature you'd like:

Support the utf8mb4_zh_0900_as_cs collate like MySQL, the output should be:

MySQL(root@127.0.0.1:test) > select * from t order by a;
+-----------+
| a         |
+-----------+
| 啊中文    |
| 中文      |
+-----------+
2 rows in set (0.00 sec)

Describe alternatives you've considered:

N/A

Teachability, Documentation, Adoption, Migration Strategy:

N/A

The text was updated successfully, but these errors were encountered:

zz-jason · 2020-09-03T04:23:47Z

@bb7133 PTAL

zz-jason · 2020-09-03T04:27:22Z

BTW, MySQL supports lots of collations for utf8mb4 charset, in order to support these collations in TiDB in the future, we need a flexible execution engine that allows us to easily add a new collation or even add a new character set. @SunRunAway PTAL

wjhuang2016 · 2020-09-03T04:53:43Z

Adding a new collation is easy, but adding a new charset is hard now.

xiongjiwei · 2020-09-10T11:24:01Z

MySQL has a framework to support language specific collation, it based on UCA v900 and accept different tailoring rules.
we can do same with mysql, add a framework for language specific collation, it will be easier to adding other language specific collation in future, or we implement collation separately, this way has better performance, easier to optimize.

zz-jason · 2020-09-10T11:34:13Z

or we implement collation separately, this way has better performance, easier to optimize.

@xiongjiwei How much performance would we gain compared with the unified collation framework?

xiongjiwei · 2020-09-10T12:00:42Z

or we implement collation separately, this way has better performance, easier to optimize.

@xiongjiwei How much performance would we gain compared with the unified collation framework?

@zz-jason I have no idea, I am not deep in this, but I instinctively feel that if we implement framework very carefully, we may only loss performance when startup, it is hard but still has a way.

zz-jason · 2020-09-10T12:23:46Z

we may only loss performance when startup

Do you mean when the tidb-server instance is started up?

xiongjiwei · 2020-09-10T12:24:46Z

Do you mean when the tidb-server instance is started up?

right

zz-jason · 2020-09-10T12:33:31Z

OK. I think it's affordable. Would you like to write a proposal for this framework or other algorithms?

xiongjiwei · 2020-09-10T12:44:11Z

OK. I think it's affordable. Would you like to write a proposal for this framework or other algorithms?

sure

bb7133 · 2020-09-10T15:39:08Z

This feature request is related to #10192 because I believe they actually aim to the same goal - the 'pinyin' order of Chinese character.

Basically we've two alternatives:

Implement utf8mb4_zh_0900_as_cs: we need to implement the standard UCA algorithm and the tailoring rules of utf8mb4_zh_0900_as_cs in MySQL 8, which seems non-trivial. The implementation of MySQL looks complicated with weight reorders, magic numbers, and some sort of thicks: https://github.com/mysql/mysql-server/blob/8.0/strings/ctype-uca.cc#L697. A lot of effort should be paid to know the exact meaning of its implementation.
Before MySQL 8 and utf8mb4_zh_0900_as_cs, almost all MySQL users archive 'pinyin' order by order by convert(... using gbk) asc(code point of GBK is ordered by pinyin). It is possible(and easy) for us to implement an additional collation that sort the Chinese character exactly the same with GBK, named like tidb_utf8_general_zh_cs. This satisfies the requirements of most users but it is not compatible with MySQL. For those users need the replication from TiDB to MySQL, we need to comment the additional this collation to ensure the compatibility will not be broken.

ilovesoup · 2020-09-10T15:54:10Z

This feature request is related to #10192 because I believe they actually aim to the same goal - the 'pinyin' order of Chinese character.

Basically we've two alternatives:

Implement utf8mb4_zh_0900_as_cs: we need to implement the standard UCA algorithm and the tailoring rules of utf8mb4_zh_0900_as_cs in MySQL 8, which seems non-trivial. The implementation of MySQL looks complicated with weight reorders, magic numbers, and some sort of thicks: https://github.com/mysql/mysql-server/blob/8.0/strings/ctype-uca.cc#L697. A lot of effort should be paid to know the exact meaning of its implementation.

Before MySQL 8 and utf8mb4_zh_0900_as_cs, almost all MySQL users archive 'pinyin' order by order by convert(... using gbk) asc(code point of GBK is ordered by pinyin). It is possible(and easy) for us to implement an additional collation that sort the Chinese character exactly the same with GBK, named like tidb_utf8_general_zh_cs. This satisfies the requirements of most users but it is not compatible with MySQL. For those users need the replication from TiDB to MySQL, we need to comment the additional this collation to ensure the compatibility will not be broken.

I feel like option 2 is better for now considering the cost. It does not prevent us go for optional 1 in the future if needed.

xiongjiwei · 2020-09-14T01:49:54Z

hi, @zz-jason, consider opinion of bb7133 , it is much easier to

implement an additional collation that sort the Chinese character exactly the same with GBK

if we just need pinyin order.

if we need full compatibility with MySQL, we may have a lot to do. we should discuss which way to achieve pinyin order. I'd like to write a proposal if we choose full compatibility with MySQL

zz-jason · 2020-09-14T06:19:47Z

It is possible(and easy) for us to implement an additional collation that sort the Chinese character exactly the same with GBK, named like tidb_utf8_general_zh_cs.

@bb7133 @ilovesoup It's indeed easy to be implemented, but introducing a new collation tidb_utf8_general_zh_cs brings additional maintenance effects. For example:

what's the difference between tidb_utf8_general_zh_cs and utf8mb4_zh_0900_as_cs, tidb_utf8_general_zh_cs and order by convert(... using gbk) asc, what collation should the users choose if TiDB provides both tidb_utf8_general_zh_cs and utf8mb4_zh_0900_as_cs?
would the behavior of tidb_utf8_general_zh_cs be changed for newer TiDB versions?
how to migrate the applications using tidb_utf8_general_zh_cs to other databases?

bb7133 · 2020-09-29T01:30:05Z

what's the difference between tidb_utf8_general_zh_cs and utf8mb4_zh_0900_as_cs, tidb_utf8_general_zh_cs and order by convert(... using gbk) asc, what collation should the users choose if TiDB provides both tidb_utf8_general_zh_cs and utf8mb4_zh_0900_as_cs?

We aim to make the Chinese characters of tidb_utf8_general_zh_ci ordered by pinyin and that's all. It may be different from GBK(actually, gbk_chinese_ci collation)/utf8mb4_zh_0900_as_cs for non-Chinese characters. This is to make the implementation quick and easy.

would the behavior of tidb_utf8_general_zh_cs be changed for newer TiDB versions?

It will not change, no collation will be changed since it is released.

how to migrate the applications using tidb_utf8_general_zh_ci to other databases?

We may not support it and this is not something new, considering there are some features in TiDB that doesn't exists in MySQL(Sequences/Placement Rules...). However, if the tidb_utf8_general_zh_ci becomes a trouble for migration, we can comment it and fallback to the 'default collation' when replicating the data to other databases.

kennytm · 2020-10-20T21:04:50Z

The algorithm of utf8mb4_zh_0900_as_cs is described in https://dev.mysql.com/worklog/task/?id=11825. The UCA is implemented in https://pkg.go.dev/golang.org/x/text@v0.3.3/collate, but the table version is 6.2.0 (CLDR v23), rather than 9.0.0 (CLDR v28). Maybe we can make it work from a static 9.0.0 table or adjust the reorder parameters for the golang 6.2.0 range.

Example program:

package main

import (
	"fmt"
	"golang.org/x/text/collate"
	"golang.org/x/text/language"
)

func main() {
	testCases := []string{"一", "二", "三", "四", "五", "1", "one", "yi", "YI", "yī", "🤔"}
	collator := collate.New(language.Chinese)
	var buffer collate.Buffer
	for _, tc := range testCases {
		fmt.Printf("%s\t%x\n", tc, collator.KeyFromString(&buffer, tc))
		buffer.Reset()
	}
}

Output:

一      9689db00000020000002
二      961bda00000020000002
三      9664f900000020000002
四      966c3a00000020000002
五      967a0a00000020000002
1       14e700000020000002
one     1771174f164c00000020002000200000020202
yi      188016cd00000020002000000202
YI      188016cd00000020002000000808
yī      188016cd00000020001c00000202
🤔      86f91400000020000002

on MySQL 8.0:

select column_0, weight_string(column_0 collate utf8mb4_zh_0900_as_cs) 
from (values 
    row('一'), row('二'), row('三'), row('四'), row('五'), row('1'), row('one'), row('yi'), row('YI'), row('yī'), row('🤔')
) a;

output:

+----------+--------------------------------------------------------------------------------------------------------------+
| column_0 | weight_string(column_0 collate utf8mb4_zh_0900_as_cs)                                                        |
+----------+--------------------------------------------------------------------------------------------------------------+
| 一       | 0xA63E0000002000000002                                                                                       |
| 二       | 0x38290000002000000002                                                                                       |
| 三       | 0x815D0000002000000002                                                                                       |
| 四       | 0x88A40000002000000002                                                                                       |
| 五       | 0x966E0000002000000002                                                                                       |
| 1        | 0x1C3E0000002000000002                                                                                       |
| one      | 0xBF5ABF36BE2700000020002000200000000200020002                                                               |
| yi       | 0xC088BEAF000000200020000000020002                                                                           |
| YI       | 0xC088BEAF000000200020000000080008                                                                           |
| yī       | 0xC088BEAF00000020001F0116000000020002                                                                       |
| 🤔         | 0x15AD0000002000000002                                                                                       |
+----------+--------------------------------------------------------------------------------------------------------------+

xiongjiwei · 2020-10-21T02:35:20Z

Maybe we can make it work from a static 9.0.0 table or adjust the reorder parameters for the golang 6.2.0 range.

The performance is not as good as implement it by hand, only about 30% under the default table.
The collation should be also implemented in TiKV, and the results must same, it means, we must implement it by ourselves in TiKV
As a quick implementation of pinyin order, convert to gbk and use gbk_chinese_ci is enough, we will support utf8mb4_zh_0900_as_cs in some day

@kennytm

kennytm · 2020-10-21T03:46:07Z

As a quick implementation of pinyin order, convert to gbk and use gbk_chinese_ci is enough, we will support utf8mb4_zh_0900_as_cs in some day

well this issue is about utf8mb4_zh_0900_as_cs, the two don't have to interfere each other.

The collation should be also implemented in TiKV, and the results must same, it means, we must implement it by ourselves in TiKV

that's unfortunate, but there's still no need to reimplement UCA by hand in Rust, reusing ICU's ucol_getSortKey should be fine (and so the Go side should also use an ICU Cgo binding).

wjhuang2016 · 2020-10-21T04:06:58Z

The Go's sortKey is not the same as ICU's sortKey. The sortKey can be implement-defined. Besides, collate in Go is still unreleased, and it tends to buggy:golang/go#38726.

dveeden · 2023-08-01T11:08:13Z

Looks related to #37566 / #45650, maybe this can be done once utf8mb4_0900_ai_ci support is merged.

zz-jason added the type/feature-request This is a feature requests on the product label Sep 3, 2020

github-actions bot added this to Need Triage in Feature Request Kanban Sep 3, 2020

zz-jason added the feature/accepted This feature request is accepted by product managers label Sep 3, 2020

zz-jason moved this from Need Triage to Accepted in Feature Request Kanban Sep 3, 2020

bb7133 added this to Issue Backlog: Need Triage in SIG Infra Kanban via automation Sep 4, 2020

pingcap deleted a comment from ti-srebot Sep 4, 2020

xiongjiwei mentioned this issue Sep 14, 2020

doc/design: add pinyin order collation for utf8mb4 charset #19984

Merged

bb7133 added the sig/sql-infra SIG: SQL Infra label Sep 30, 2020

kennytm mentioned this issue Oct 20, 2020

collation: add new pinyin collation utf8mb4_zh_pinyin_tidb_as_cs pingcap/parser#1058

Merged

bb7133 mentioned this issue Nov 5, 2020

collation: add pinyin collation for chinese charset support #20790

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support `utf8mb4_zh_0900_as_cs` for `utf8mb4` character set #19747

support `utf8mb4_zh_0900_as_cs` for `utf8mb4` character set #19747

zz-jason commented Sep 3, 2020

zz-jason commented Sep 3, 2020

zz-jason commented Sep 3, 2020

wjhuang2016 commented Sep 3, 2020

xiongjiwei commented Sep 10, 2020

zz-jason commented Sep 10, 2020

xiongjiwei commented Sep 10, 2020 •

edited

zz-jason commented Sep 10, 2020 •

edited

xiongjiwei commented Sep 10, 2020 •

edited

zz-jason commented Sep 10, 2020

xiongjiwei commented Sep 10, 2020 •

edited

bb7133 commented Sep 10, 2020 •

edited

ilovesoup commented Sep 10, 2020

xiongjiwei commented Sep 14, 2020

zz-jason commented Sep 14, 2020 •

edited

bb7133 commented Sep 29, 2020

kennytm commented Oct 20, 2020

xiongjiwei commented Oct 21, 2020 •

edited

kennytm commented Oct 21, 2020

wjhuang2016 commented Oct 21, 2020

dveeden commented Aug 1, 2023

support utf8mb4_zh_0900_as_cs for utf8mb4 character set #19747

support utf8mb4_zh_0900_as_cs for utf8mb4 character set #19747

Comments

zz-jason commented Sep 3, 2020

Feature Request

zz-jason commented Sep 3, 2020

zz-jason commented Sep 3, 2020

wjhuang2016 commented Sep 3, 2020

xiongjiwei commented Sep 10, 2020

zz-jason commented Sep 10, 2020

xiongjiwei commented Sep 10, 2020 • edited

zz-jason commented Sep 10, 2020 • edited

xiongjiwei commented Sep 10, 2020 • edited

zz-jason commented Sep 10, 2020

xiongjiwei commented Sep 10, 2020 • edited

bb7133 commented Sep 10, 2020 • edited

ilovesoup commented Sep 10, 2020

xiongjiwei commented Sep 14, 2020

zz-jason commented Sep 14, 2020 • edited

bb7133 commented Sep 29, 2020

kennytm commented Oct 20, 2020

xiongjiwei commented Oct 21, 2020 • edited

kennytm commented Oct 21, 2020

wjhuang2016 commented Oct 21, 2020

dveeden commented Aug 1, 2023

support `utf8mb4_zh_0900_as_cs` for `utf8mb4` character set #19747

support `utf8mb4_zh_0900_as_cs` for `utf8mb4` character set #19747

xiongjiwei commented Sep 10, 2020 •

edited

zz-jason commented Sep 10, 2020 •

edited

xiongjiwei commented Sep 10, 2020 •

edited

xiongjiwei commented Sep 10, 2020 •

edited

bb7133 commented Sep 10, 2020 •

edited

zz-jason commented Sep 14, 2020 •

edited

xiongjiwei commented Oct 21, 2020 •

edited