Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/design: add a proposal for full charsets support #27325

Merged
merged 9 commits into from Sep 9, 2021
270 changes: 270 additions & 0 deletions docs/design/2021-08-18-charsets.md
@@ -0,0 +1,270 @@
# Proposal: Support Charset Framework in TiDB

- Author(s): [zimuxia](https://github.com/zimulala) (Xia Li)
- Last updated: 2021-08-18
- Tracking Issue: https://github.com/pingcap/tidb/issues/26812
- Related Document: https://docs.google.com/document/d/1SC0spRoNScR8-GwUNJ9ZHAV8zmlkq99Q2PReu88FARs/edit#
- Performance Test(simple implementation): https://docs.google.com/document/d/1c6AOPpPHOTwVKFJp7n0k72Jy8KdP3oWpTkTjo3Bk2PE/edit#

## Table of Contents

* [Introduction](#introduction)
* [Motivation or Background](#motivation-or-background)
* [Detailed Design](#detailed-design)
* [Test Design](#test-design)
* [Functional Tests](#functional-tests)
* [Scenario Tests](#scenario-tests)
* [Compatibility Tests](#compatibility-tests)
* [Benchmark Tests](#benchmark-tests)
* [Investigation & Alternatives](#investigation--alternatives)
* [Development Plan](#development-plan)

# Introduction

Currently, TiDB only supports ascii, binary, latin1, utf8, and utf8mb4 character sets. This proposal aims to add a framework for character sets to TiDB to facilitate the support of other characters. This design takes the gbk character set as an example.

# Motivation or Background

Character: Various letters and symbols are collectively referred to as characters. A character can be a Chinese character, an English letter, an Arabic numeral, a punctuation mark, a graphic symbol or a control symbol, etc.

Charset: A collection of multiple characters is called a charset. Common character set names: ASCII character set, Unicode character set, GBK character set, etc.

Character Encoding: Character encoding can be considered as a mapping rule. According to this mapping rule, a character can be mapped into other forms of data for storage and transmission in the computer. Each character set has corresponding character encoding rules, and the commonly used character set encoding rules include UTF-8 encoding, GBK encoding and so on.
zimulala marked this conversation as resolved.
Show resolved Hide resolved

MySQL supports many character sets, such as `utf8mb4`, `gbk`, `gb18030`, etc. However, currently TiDB only supports 5 character sets, and there are still some problems with related character sets ([charset: incorrect encoding for `latin1` character set](https://github.com/pingcap/tidb/issues/18955),[extended latin character set not support](https://github.com/pingcap/tidb/issues/3888)). Now it is difficult for TiDB to add a new character set. This proposal describes the implementation of the character set framework. In many traditional industries in China, `gbk`/`gb18030` is used, so this proposal uses gbk character set support as an example.

## Functional requirements

Support gbk character set related functions:
- Supports all code points defined by the GBK 1.0 standard.
- Support conversion with other character sets.
- Support implicit conversion according to [collation coercibility rules](https://dev.mysql.com/doc/refman/8.0/en/charset-collation-coercibility.html).
- Supports the use of `CAST`, `CONVERT` and other functions to convert gbk characters.
- Support SQL statements, such as: `SET CHARACTER SET GBK`, `SET NAMES GBK`, `SHOW CHARSET`, etc.
- Supports the comparison between strings in the gbk character set.
- If the characters passed in are illegal, an error or warning is returned. If an illegal character is used in `CONVERT`, an error is returned. Otherwise, a warning will be returned.
- Compatible with other components, including data import, migration, and so on.

### Supported Collations

- gbk_bin: A binary collation.
- gbk_chinese_ci: The default collation, which supports Pinyin.
- gbk_utf8mb4_bin: A binary collation encoded by utf8mb4.

# Detailed Design
zimulala marked this conversation as resolved.
Show resolved Hide resolved

After receiving the non-utf-8 character set request, this solution will convert it to the utf-8 character set, and then it will be the utf-8 character set when calculating or storing in the TiDB runtime layer and the storage layer, and finally return the result of the conversion to the non-utf-8 character set.

### Encoder/Decoder

- Encapsulate read and write functions, using the Transform functions in golang.org/x/text/encoding (The version of this package used between different components is preferably the same).

### Parser

- Add parser configuration information and record the encoding of the current parser. The value of the encoding is determined by the system variable character_set_client. When invoking Restore, you can output restore SQL according to encoding.
- All ParseOneStmt/Parse usage needs to be checked
- For SQL strings that need to be temporarily saved, you need to bring character set information. For example, BindSQL, View Select Stmt, etc.
- For internally executed SQL statements, since they are already utf-8, they do not need to be processed. For example, the table creation statement in the perfschema package.

### Runtime
- Add a repertoire field to collationInfo to facilitate automatic character set conversion in expressions, so that many errors like "illegal mix of collations" can be avoided.
- The corresponding types of the Repertoire attribute are as follows:

```go
type Repertoire int

const (
// RepertoireASCII is pure ASCII and it’s Unicode range: U+0000..U+007F
RepertoireASCII Repertoire = 1
// RepertoireExtended is Extended characters and it’s Unicode range: U+0080..U+FFFF
RepertoireExtended Repertoire = 1 << 1
// RepertoireUnicode consists ASCII and EXTENDED, and it’s Unicode range: U+0000..U+FFFF
RepertoireUnicode Repertoire = ASCII | EXTENDED
)
```

- Some of the built-in functions related to string computation require special processing after conversion from utf-8.
- Reference document: https://dev.mysql.com/doc/refman/8.0/en/string-functions.html.
- Handle the related functions in the Coprocessor.
- Process some string-related internal functions, check whether the corresponding character set needs to be processed, specific functions such as DatumsToString, strToInt.

### Optimizer

- The statistics module may need special processing functions based on charset: AvgColSizeListInDisk, GetFixedLen, BucketToString and DecodedString, etc.
- Ranger module
- The processing of prefix len needs to consider the charset.
- Functions such as BuildFromPatternLike and checkLikeFunc may need to consider charset.

### Collation

Add gbk_chinese_ci and gbk_bin collations. In addition, considering the performance, we can add the collation of utf8mb4 (gbk_utf8mb4_bin).
- Implement the Collator and WildcardPattern interface functions for each collation.
- gbk_chinese_ci and gbk_bin need to convert utf-8 to gbk encoding and then generate a sort key. gbk_utf8mb4_bin does not need to be converted to gbk code for processing.
- Implement the corresponding functions in the Coprocessor.

### DDL

Support for character sets of databases, tables and columns when creating databases, tables or adding columns, and support for changes through alter statements.

### Others

Other behaviors that need to be dealt with:
- TiKVMinVersion information may need to be modified.
- Set character set operations are supported, such as `set character set gbk`, `set names gbk` and `set character_set_client = gbk`, etc.
- Support the correct display of character sets, such as show charset/collation.
- load data: LoadDataInfo.getLine.

### Compatibility

#### Compatibility between TiDB versions

- Upgrade compatibility:
- Upgrades from versions below 4.0 do not support gbk or any character sets other than the original five (binary, ascii, latin1, utf8, utf8mb4).
- Upgrade from version 4.0 or higher
- There may be compatibility issues when performing non-utf-8-related operations during the rolling upgrade.
- The new version of the cluster is expected to have no compatibility issues when reading old data.
- Downgrade compatibility:
- Downgrade is not compatible. The index key uses the table of gbk_bin/gbk_chinese_ci. The lower version of TiDB will have problems when decoding, and it needs to be transcoded before downgrading.

#### Compatibility with MySQL

Illegal character related issue:

```sql
create table t3(a char(10) charset gbk);
insert into t3 values ('a');

// 0xcee5 is a valid gbk hex literal but invalid utf8mb4 hex literal.
select hex(concat(a, 0xcee5)) from t3;
-- mysql 61cee5

// 0xe4b880 is an invalid gbk hex literal but valid utf8mb4 hex literal.
select hex(concat(a, 0xe4b880)) from t3;
-- mysql 61e4b880 (test on mysql 5.7 and 8.0.22)
-- mysql returns "Cannot convert string '\x80' from binary to gbk" (test on mysql 8.0.25 and 8.0.26). TiDB will be compatible with this behavior.

// 0x80 is a hex literal that invalid for neither gbk nor utf8mb4.
select hex(concat(a, 0x80)) from t3;
-- mysql 6180 (test on mysql 5.7 and 8.0.22)
-- mysql returns "Cannot convert string '\x80' from binary to gbk" (test on mysql 8.0.25 and 8.0.26). TiDB will be compatible with this behavior.

set @@sql_mode = '';
insert into t3 values (0x80);
-- mysql gets a warning and insert null values (warning: "Incorrect string value: '\x80' for column 'a' at row 1")

set @@sql_mode = 'STRICT_TRANS_TABLES';
insert into t3 values (0x80);
-- mysql returns "Incorrect string value: '\x80' for column 'a' at row 1"
```

#### Compatibility with other components

After using this version of TiDB, and when gbk-encoded data is already stored in the database, you can only use the components corresponding to this version or greater.
zimulala marked this conversation as resolved.
Show resolved Hide resolved
- TiKV
- Coprocessor related builtin functions need to be processed, and collation related functions need to be implemented.
- TiCDC
- It may need to be converted to gbk character set encoding when outputting in accordance with the [TiCDC Open Protocol](https://docs.pingcap.com/zh/tidb/dev/ticdc-open-protocol).
- TiFlash
- Related builtin functions need to be processed, and collation related functions need to be implemented.
- BR
- BR realizes backup and recovery by directly manipulating SST files. At present, BR only rewrites the table ID and index ID in the key when restoring, and needs to decode KV. The decoder reuses the codec interface of TiDB, and only needs to update the TiDB dependency. There are no compatibility issues.
- Lightning
- tidb-lightning is responsible for parsing SQL to become a KV pair. It implements its own parse, and special processing may be required here.
- DM
- Currently, DM will specify utf8mb4 when connecting to TiDB, which needs to be processed here.
- When DDL is synchronized, TiDB-parser will be used to parse the DDL statement. Currently, empty charset and collation are used. Use the charset in binlog to pass in the Parser function (it is expected to be available in the [header of the DDL event](https://dev.mysql.com/doc/internals/en/query-event.html#q-charset-database-code)):
- If TiDB-parser supports (ascii, binary, latin1, utf8, utf8mb4, gbk), handle it directly.
- If it is not supported, fallback to an empty charset/collation to keep the old behavior.
- Dumpling
- Dumpling obtains the table structure and data of upstream TiDB/MySQL by executing SQL statements, so there is no compatibility problem.
- TiDB-binlog
- TiDB to binlog is still encoded in utf8. That is, no special treatment is required.

# Test Design

## Functional Tests

Unit and integration tests to consider
- Covers the supported built-in string functions with non-utf-8 character set, including tests for conversion of different character sets.
- Copr-pushdown test of TiDB/TiKV/TiFlash.
- Related read and write tests covering different ranges of character sets and collations.
- Set or display the test of gbk character set related sentences.
Port the gbk character set and related collation tests from mysql-tests.

## Scenario Tests

Test the feasibility of using different character sets for mixed sorting.

## Compatibility Tests

### Compatibility with other features

Test the compatibility of some related features, such as SQL binding, SQL hints, clustered index, view, expression index, statements_summary, slow_query, explain analyze and other features.

### Compatibility with other external components

- Confirm that the corresponding versions of the libraries that call utf-8 and gbk transcoded by different components are consistent.
- Using different components, gbk related operations can be executed normally.

### Upgrade compatibility

- Upgrades from versions below 4.0 do not support gbk.
- Version 4.0 and above test upgrade:
- During the rolling upgrade process, gbk operation is not supported.
- After the upgrade, normal gbk related operations are supported.

### Downgrade compatibility

There will be incompatibility issues when downgrading tables that use gbk encoding.

## Benchmark Tests

- Test the read and write performance test of utf-8 and gbk under the same amount of data.
- Test the corresponding collation related performance test under gbk.

# Investigation & Alternatives

- MySQL's implementation of gbk can be considered as a solution that both runtime and storage use gbk. The character set used may be different between the server and the client of MySQL, between the connection and the result set, and the character set conversion is required. Different character sets in MySQL will implement character set specific interfaces to ensure the realization of character set related functions. Among them, the specific implementation reference of gbk character set: https://github.com/mysql/mysql-server/blob/8.0/strings/ctype-gbk.cc.
- CockroachDB currently does not support additional character sets.

## Rejected Alternatives

After this solution receives the gbk character set request, it will be converted to gbk encoding for processing, that is, it is the gbk character set when calculating or storing at the TiDB runtime layer and the storage layer.
The reason for the final rejection of this plan is that the issues that need to be dealt with are more complex and more uncontrollable, such as:
- Some String methods such as Datum require special handling, such as:
- ConvertToString
- String()
- Most of the built-in functions related to TiDB string and some string processing methods need to be processed.
- Mainly a large number of function signatures and related calling methods need to be modified
- stringutil package
- TiDB uses the methods of Go's own strings library (if you modify the related methods in this item, some built-in functions of TiDB do not need to be modified):
- strings.go
- ToUpper, ToLower
- HasPrefix
- ...
- strconv package
- AppendQuote
- ParseInt, ParseFloat
- ...
- ToGBKString

# Development Plan

The first stage (mainly realize the development of TiDB side)

- Support normal read and write operations using gbk.
- Supports setting charset to gbk or related sentences that display gbk information.
- Support string-related commonly used functions to process characters with gbk encoding.
- Support character set conversion, that is, support to use cast, convert and other functions to convert other character sets into gbk or gbk into other character sets.
- Support gbk_bin collation.

The second stage

- Realize TiKV and TiFlash related operations.
- Compatible with other components, including TiCDC, BR, Lightning, DM, Dumpling and TiDB-binlog.
- Support gbk_chinese_ci collation.

The third stage

- Basically support all string-related functions already supported by TiDB.
- Support the use of alter statement, support to modify the charset of the column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this include modifying columns that are part of an index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be included.