From 6fceb1a6e38a887d3e5296961199f05235d16b61 Mon Sep 17 00:00:00 2001 From: houfaxin Date: Tue, 11 Nov 2025 11:01:21 +0800 Subject: [PATCH 01/19] initial trans --- TOC.md | 1 + br/backup-and-restore-overview.md | 1 + character-set-and-collation.md | 35 +++--- character-set-gb18030.md | 111 ++++++++++++++++++ character-set-gbk.md | 37 +++++- dm/dm-overview.md | 4 +- .../information-schema-character-sets.md | 3 +- migrate-from-mariadb.md | 6 +- mysql-compatibility.md | 4 +- .../sql-statement-show-collation.md | 59 ++++++++-- 10 files changed, 222 insertions(+), 39 deletions(-) create mode 100644 character-set-gb18030.md diff --git a/TOC.md b/TOC.md index 2f0a3ea4f7f1d..5006f35c8ec1f 100644 --- a/TOC.md +++ b/TOC.md @@ -1004,6 +1004,7 @@ - Character Set and Collation - [Overview](/character-set-and-collation.md) - [GBK](/character-set-gbk.md) + - [GB18030](/character-set-gb18030.md) - [Placement Rules in SQL](/placement-rules-in-sql.md) - System Tables - `mysql` Schema diff --git a/br/backup-and-restore-overview.md b/br/backup-and-restore-overview.md index 061d08f8ec8a5..8c5ddbaa3bfad 100644 --- a/br/backup-and-restore-overview.md +++ b/br/backup-and-restore-overview.md @@ -113,6 +113,7 @@ Backup and restore might go wrong when some TiDB features are enabled or disable | Feature | Issue | Solution | | ---- | ---- | ----- | |GBK charset|| BR of versions earlier than v5.4.0 does not support restoring `charset=GBK` tables. No version of BR supports recovering `charset=GBK` tables to TiDB clusters earlier than v5.4.0. | +|GB18030 charset|| Before v9.0.0, BR does not support restoring tables with `charset=GB18030`. In addition, no version of BR supports restoring tables with `charset=GB18030` to TiDB clusters earlier than v9.0.0.| | Clustered index | [#565](https://github.com/pingcap/br/issues/565) | Make sure that the value of the `tidb_enable_clustered_index` global variable during restore is consistent with that during backup. Otherwise, data inconsistency might occur, such as `default not found` error and inconsistent data index. | | New collation | [#352](https://github.com/pingcap/br/issues/352) | Make sure that the value of the `new_collation_enabled` variable in the `mysql.tidb` table during restore is consistent with that during backup. Otherwise, inconsistent data index might occur and checksum might fail to pass. For more information, see [FAQ - Why does BR report `new_collations_enabled_on_first_bootstrap` mismatch?](/faq/backup-and-restore-faq.md#why-is-new_collation_enabled-mismatch-reported-during-restore). | | Global temporary tables | | Make sure that you are using v5.3.0 or a later version of BR to back up and restore data. Otherwise, an error occurs in the definition of the backed global temporary tables. | diff --git a/character-set-and-collation.md b/character-set-and-collation.md index 5b5b3d845826e..5dfdc5843545d 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -1,6 +1,6 @@ --- title: Character Set and Collation -summary: Learn about the supported character sets and collations in TiDB. +summary: TiDB supports the following character sets: ascii, binary, gbk, gb18030, latin1, utf8, and utf8mb4. The supported collations include: ascii_bin, binary, gbk_bin, gbk_chinese_ci, gb18030_bin, gb18030_chinese_ci, latin1_bin, utf8_bin, utf8_general_ci, utf8_unicode_ci, utf8mb4_0900_ai_ci, utf8mb4_0900_bin, utf8mb4_bin, utf8mb4_general_ci, and utf8mb4_unicode_ci. TiDB strongly recommends using the utf8mb4 character set because it supports a wider range of characters. In TiDB, the default collation is affected by the client’s connection collation setting. If the client uses utf8mb4_0900_ai_ci as the connection collation, TiDB follows the client configuration. TiDB also supports a new collation framework that provides semantic-level support for different collations. aliases: ['/docs/dev/character-set-and-collation/','/docs/dev/reference/sql/characterset-and-collation/','/docs/dev/reference/sql/character-set/'] --- @@ -99,17 +99,18 @@ SHOW CHARACTER SET; ``` ```sql -+---------+-------------------------------------+-------------------+--------+ -| Charset | Description | Default collation | Maxlen | -+---------+-------------------------------------+-------------------+--------+ -| ascii | US ASCII | ascii_bin | 1 | -| binary | binary | binary | 1 | -| gbk | Chinese Internal Code Specification | gbk_chinese_ci | 2 | -| latin1 | Latin1 | latin1_bin | 1 | -| utf8 | UTF-8 Unicode | utf8_bin | 3 | -| utf8mb4 | UTF-8 Unicode | utf8mb4_bin | 4 | -+---------+-------------------------------------+-------------------+--------+ -6 rows in set (0.00 sec) ++---------+-------------------------------------+--------------------+--------+ +| Charset | Description | Default collation | Maxlen | ++---------+-------------------------------------+--------------------+--------+ +| ascii | US ASCII | ascii_bin | 1 | +| binary | binary | binary | 1 | +| gb18030 | China National Standard GB18030 | gb18030_chinese_ci | 4 | +| gbk | Chinese Internal Code Specification | gbk_chinese_ci | 2 | +| latin1 | Latin1 | latin1_bin | 1 | +| utf8 | UTF-8 Unicode | utf8_bin | 3 | +| utf8mb4 | UTF-8 Unicode | utf8mb4_bin | 4 | ++---------+-------------------------------------+--------------------+--------+ +7 rows in set (0.000 sec) ``` TiDB supports the following collations: @@ -124,6 +125,8 @@ SHOW COLLATION; +--------------------+---------+-----+---------+----------+---------+---------------+ | ascii_bin | ascii | 65 | Yes | Yes | 1 | PAD SPACE | | binary | binary | 63 | Yes | Yes | 1 | NO PAD | +| gb18030_bin | gb18030 | 249 | | Yes | 1 | PAD SPACE | +| gb18030_chinese_ci | gb18030 | 248 | Yes | Yes | 1 | PAD SPACE | | gbk_bin | gbk | 87 | | Yes | 1 | PAD SPACE | | gbk_chinese_ci | gbk | 28 | Yes | Yes | 1 | PAD SPACE | | latin1_bin | latin1 | 47 | Yes | Yes | 1 | PAD SPACE | @@ -136,7 +139,7 @@ SHOW COLLATION; | utf8mb4_general_ci | utf8mb4 | 45 | | Yes | 1 | PAD SPACE | | utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 8 | PAD SPACE | +--------------------+---------+-----+---------+----------+---------+---------------+ -13 rows in set (0.00 sec) +15 rows in set (0.000 sec) ``` > **Warning:** @@ -171,7 +174,7 @@ SHOW COLLATION WHERE Charset = 'utf8mb4'; 5 rows in set (0.001 sec) ``` -For details about the TiDB support of the GBK character set, see [GBK](/character-set-gbk.md). +For details about the GBK character set, see [GBK](/character-set-gbk.md). For details about the the GB18030 character set, see [GB18030](/character-set-gb18030.md). ## `utf8` and `utf8mb4` in TiDB @@ -535,9 +538,9 @@ This new framework supports semantically parsing collations. TiDB enables the ne -Under the new framework, TiDB supports the `utf8_general_ci`, `utf8mb4_general_ci`, `utf8_unicode_ci`, `utf8mb4_unicode_ci`, `utf8mb4_0900_bin`, `utf8mb4_0900_ai_ci`, `gbk_chinese_ci`, and `gbk_bin` collations, which is compatible with MySQL. +Under the new framework, TiDB supports the `utf8_general_ci`, `utf8mb4_general_ci`, `utf8_unicode_ci`, `utf8mb4_unicode_ci`, `utf8mb4_0900_bin`, `utf8mb4_0900_ai_ci`, `gbk_chinese_ci`, `gbk_bin`, `gb18030_chinese_ci` and `gb18030_bin` collations, which is compatible with MySQL. -When one of `utf8_general_ci`, `utf8mb4_general_ci`, `utf8_unicode_ci`, `utf8mb4_unicode_ci`, `utf8mb4_0900_ai_ci` and `gbk_chinese_ci` is used, the string comparison is case-insensitive and accent-insensitive. At the same time, TiDB also corrects the collation's `PADDING` behavior: +When one of `utf8_general_ci`, `utf8mb4_general_ci`, `utf8_unicode_ci`, `utf8mb4_unicode_ci`, `utf8mb4_0900_ai_ci`, `gbk_chinese_ci` and `gb18030_chinese_ci` is used, the string comparison is case-insensitive and accent-insensitive. At the same time, TiDB also corrects the collation's `PADDING` behavior: ```sql CREATE TABLE t(a varchar(20) charset utf8mb4 collate utf8mb4_general_ci PRIMARY KEY); diff --git a/character-set-gb18030.md b/character-set-gb18030.md new file mode 100644 index 0000000000000..49a16492e8fb8 --- /dev/null +++ b/character-set-gb18030.md @@ -0,0 +1,111 @@ +--- +title: GB18030 +summary: This document provides details about the TiDB support of the GB18030 character set. +--- + +# GB18030 + +Starting from v9.0.0, TiDB supports the GB18030-2022 character set. This document describes TiDB's support for and compatibility with the GB18030 character set. + +```sql +SHOW CHARACTER SET WHERE CHARSET = 'gb18030'; +``` + +``` ++---------+---------------------------------+--------------------+--------+ +| Charset | Description | Default collation | Maxlen | ++---------+---------------------------------+--------------------+--------+ +| gb18030 | China National Standard GB18030 | gb18030_chinese_ci | 4 | ++---------+---------------------------------+--------------------+--------+ +1 row in set (0.01 sec) +``` + +```sql +SHOW COLLATION WHERE CHARSET = 'gb18030'; +``` + +``` ++--------------------+---------+-----+---------+----------+---------+---------------+ +| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute | ++--------------------+---------+-----+---------+----------+---------+---------------+ +| gb18030_bin | gb18030 | 249 | | Yes | 1 | PAD SPACE | +| gb18030_chinese_ci | gb18030 | 248 | Yes | Yes | 1 | PAD SPACE | ++--------------------+---------+-----+---------+----------+---------+---------------+ +2 rows in set (0.001 sec) +``` + +## MySQL compatibility + +This section describes the compatibility of the GB18030 character set in TiDB with MySQL. + +### Collation compatibility + +In MySQL, the default collation for the GB18030 character set is `gb18030_chinese_ci`. In TiDB, the default collation for GB18030 depends on the configuration parameter [`new_collations_enabled_on_first_bootstrap`](/tidb-configuration-file.md#new_collations_enabled_on_first_bootstrap): + +- By default, `new_collations_enabled_on_first_bootstrap` is set to `true`, which means enabling the [new collation framework](/character-set-and-collation.md#new-framework-for-collations). In this case, the default collation for GB18030 is `gb18030_chinese_ci`. +- If `new_collations_enabled_on_first_bootstrap` is set to `false`, the new framework for collations is disabled, and the default collation for GB18030 is `gb18030_bin`. + +Additionally, the `gb18030_bin` supported by TiDB differs from MySQL's `gb18030_bin`. TiDB converts GB18030 to `utf8mb4` and then performs binary sorting. + +After enabling the new framework for collations, checking the collations for the GB18030 character set shows that TiDB's default collation for GB18030 is switched to `gb18030_chinese_ci`. + +```sql +SHOW CHARACTER SET WHERE CHARSET = 'gb18030'; +``` + +``` ++---------+---------------------------------+--------------------+--------+ +| Charset | Description | Default collation | Maxlen | ++---------+---------------------------------+--------------------+--------+ +| gb18030 | China National Standard GB18030 | gb18030_chinese_ci | 4 | ++---------+---------------------------------+--------------------+--------+ +1 row in set (0.01 sec) +``` + +```sql +SHOW COLLATION WHERE CHARSET = 'gb18030'; +``` + +``` ++--------------------+---------+-----+---------+----------+---------+---------------+ +| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute | ++--------------------+---------+-----+---------+----------+---------+---------------+ +| gb18030_bin | gb18030 | 249 | | Yes | 1 | PAD SPACE | +| gb18030_chinese_ci | gb18030 | 248 | Yes | Yes | 1 | PAD SPACE | ++--------------------+---------+-----+---------+----------+---------+---------------+ +2 rows in set (0.00 sec) +``` + +### Character compatibility + +- TiDB supports GB18030-2022 characters, while MySQL supports GB18030-2005 characters. As a result, the encoding and decoding of some characters differ. + +- For invalid GB18030 characters, such as `0xFE39FE39`, MySQL allows writing them to the database in hexadecimal form and stores them as `?`. In TiDB, reading or writing invalid GB18030 characters in strict mode will return an error, while in non-strict mode, it will generate a warning. + +### Others + +- Currently, TiDB does not support changing other character sets to gb18030 or converting from `gb18030` to another character set using the `ALTER TABLE` statement. + +- TiDB does not support using collations with the `_gb18030` suffix, for example: + + ```sql + CREATE TABLE t(a CHAR(10) CHARSET BINARY); + Query OK, 0 rows affected (0.00 sec) + INSERT INTO t VALUES (_gb18030'啊'); + ERROR 1115 (42000): Unsupported character introducer: 'gb18030' + ``` + +* For binary characters in `ENUM` and `SET` types, TiDB currently treats them as using the `utf8mb4` character set. + +## Component compatibility + +- TiFlash, TiDB Data Migration (DM), and TiCDC currently do not support the GB18030 character set. + +- Before v9.0.0, Dumpling does not support exporting tables with `charset=GB18030`, and TiDB Lightning does not support importing tables with `charset=GB18030`. + +- Before v9.0.0, TiDB Backup & Restore (BR) does not support backing up or restoring tables with `charset=GB18030`. In addition, no version of BR supports restoring tables with `charset=GB18030` to TiDB clusters earlier than v9.0.0. + +## See also + +* [`SHOW CHARACTER SET`](/sql-statements/sql-statement-show-character-set.md) +* [Character Set and Collation](/character-set-and-collation.md) diff --git a/character-set-gbk.md b/character-set-gbk.md index a767af5fcdbba..e65353d3d3f96 100644 --- a/character-set-gbk.md +++ b/character-set-gbk.md @@ -7,8 +7,6 @@ summary: This document provides details about the TiDB support of the GBK charac Starting from v5.4.0, TiDB supports the GBK character set. This document provides the TiDB support and compatibility information of the GBK character set. -Starting from v6.0.0, TiDB enables the [new framework for collations](/character-set-and-collation.md#new-framework-for-collations) by default. The default collation for TiDB GBK character set is `gbk_chinese_ci`, which is consistent with MySQL. - ```sql SHOW CHARACTER SET WHERE CHARSET = 'gbk'; ``` @@ -17,7 +15,7 @@ SHOW CHARACTER SET WHERE CHARSET = 'gbk'; +---------+-------------------------------------+-------------------+--------+ | Charset | Description | Default collation | Maxlen | +---------+-------------------------------------+-------------------+--------+ -| gbk | Chinese Internal Code Specification | gbk_chinese_ci | 2 | +| gbk | Chinese Internal Code Specification | gbk_bin | 2 | +---------+-------------------------------------+-------------------+--------+ 1 row in set (0.00 sec) ``` @@ -33,7 +31,7 @@ SHOW COLLATION WHERE CHARSET = 'gbk'; | gbk_bin | gbk | 87 | | Yes | 1 | PAD SPACE | | gbk_chinese_ci | gbk | 28 | Yes | Yes | 1 | PAD SPACE | +----------------+---------+----+---------+----------+---------+---------------+ -2 rows in set (0.00 sec) +2 rows in set (0.001 sec) ``` ## MySQL compatibility @@ -59,6 +57,37 @@ By default, TiDB Cloud enables the [new framework for collations](/character-set Additionally, because TiDB converts GBK to `utf8mb4` and then uses a binary collation, the `gbk_bin` collation in TiDB is not the same as the `gbk_bin` collation in MySQL. +After the new framework for collations is enabled, if you check the collations corresponding to the GBK character set, you can see that the default collation for GBK in TiDB has been switched to `gbk_chinese_ci`. + +Starting from TiDB v6.0.0, [the new framework for collations](/character-set-and-collation.md#new-framework-for-collations) is enabled by default, which sets `gbk_chinese_ci` as the default collation for the GBK character set in TiDB, consistent with MySQL. + +```sql +SHOW CHARACTER SET WHERE CHARSET = 'gbk'; +``` + +``` ++---------+-------------------------------------+-------------------+--------+ +| Charset | Description | Default collation | Maxlen | ++---------+-------------------------------------+-------------------+--------+ +| gbk | Chinese Internal Code Specification | gbk_chinese_ci | 2 | ++---------+-------------------------------------+-------------------+--------+ +1 row in set (0.00 sec) +``` + +```sql +SHOW COLLATION WHERE CHARSET = 'gbk'; +``` + +``` ++----------------+---------+----+---------+----------+---------+---------------+ +| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute | ++----------------+---------+----+---------+----------+---------+---------------+ +| gbk_bin | gbk | 87 | | Yes | 1 | PAD SPACE | +| gbk_chinese_ci | gbk | 28 | Yes | Yes | 1 | PAD SPACE | ++----------------+---------+----+---------+----------+---------+---------------+ +2 rows in set (0.001 sec) +``` + ### Illegal character compatibility * If the system variables [`character_set_client`](/system-variables.md#character_set_client) and [`character_set_connection`](/system-variables.md#character_set_connection) are not set to `gbk` at the same time, TiDB handles illegal characters in the same way as MySQL. diff --git a/dm/dm-overview.md b/dm/dm-overview.md index 8ecd035a17193..eeaccfbf2b4e0 100644 --- a/dm/dm-overview.md +++ b/dm/dm-overview.md @@ -59,9 +59,9 @@ Before using the DM tool, note the following restrictions: - DM does not replicate view-related DDL statements and DML statements to the downstream TiDB cluster. It is recommended that you create the view in the downstream TiDB cluster manually. -+ GBK character set compatibility ++ GBK and GB18030 character sets compatibility - - DM does not support migrating `charset=GBK` tables to TiDB clusters earlier than v5.4.0. + - DM does not support migrating `charset=GBK` tables to TiDB clusters earlier than v5.4.0. Currently, DM does not support migrating tables with `charset=GB18030` to TiDB. + Binlog compatibility diff --git a/information-schema/information-schema-character-sets.md b/information-schema/information-schema-character-sets.md index b94170bb5f8ff..197074ede79e1 100644 --- a/information-schema/information-schema-character-sets.md +++ b/information-schema/information-schema-character-sets.md @@ -40,12 +40,13 @@ The output is as follows: +--------------------+----------------------+-------------------------------------+--------+ | ascii | ascii_bin | US ASCII | 1 | | binary | binary | binary | 1 | +| gb18030 | gb18030_chinese_ci | China National Standard GB18030 | 4 | | gbk | gbk_chinese_ci | Chinese Internal Code Specification | 2 | | latin1 | latin1_bin | Latin1 | 1 | | utf8 | utf8_bin | UTF-8 Unicode | 3 | | utf8mb4 | utf8mb4_bin | UTF-8 Unicode | 4 | +--------------------+----------------------+-------------------------------------+--------+ -6 rows in set (0.00 sec) +7 rows in set (0.00 sec) ``` The description of columns in the `CHARACTER_SETS` table is as follows: diff --git a/migrate-from-mariadb.md b/migrate-from-mariadb.md index 4a670c08cceec..bf70a1b950d5f 100644 --- a/migrate-from-mariadb.md +++ b/migrate-from-mariadb.md @@ -192,12 +192,14 @@ To see what collations TiDB supports, execute this statement on TiDB: SHOW COLLATION; ``` -```sql +``` +--------------------+---------+-----+---------+----------+---------+---------------+ | Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute | +--------------------+---------+-----+---------+----------+---------+---------------+ | ascii_bin | ascii | 65 | Yes | Yes | 1 | PAD SPACE | | binary | binary | 63 | Yes | Yes | 1 | NO PAD | +| gb18030_bin | gb18030 | 249 | | Yes | 1 | PAD SPACE | +| gb18030_chinese_ci | gb18030 | 248 | Yes | Yes | 1 | PAD SPACE | | gbk_bin | gbk | 87 | | Yes | 1 | PAD SPACE | | gbk_chinese_ci | gbk | 28 | Yes | Yes | 1 | PAD SPACE | | latin1_bin | latin1 | 47 | Yes | Yes | 1 | PAD SPACE | @@ -210,7 +212,7 @@ SHOW COLLATION; | utf8mb4_general_ci | utf8mb4 | 45 | | Yes | 1 | PAD SPACE | | utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 8 | PAD SPACE | +--------------------+---------+-----+---------+----------+---------+---------------+ -13 rows in set (0.00 sec) +15 rows in set (0.000 sec) ``` To check what collations the columns of your current tables are using, you can use this statement: diff --git a/mysql-compatibility.md b/mysql-compatibility.md index b8f9629b97070..deb85071aa1bb 100644 --- a/mysql-compatibility.md +++ b/mysql-compatibility.md @@ -52,7 +52,7 @@ You can try out TiDB features on [TiDB Playground](https://play.tidbcloud.com/?u > Currently, only {{{ .starter }}} and {{{ .essential }}} clusters in certain AWS regions support [`FULLTEXT` syntax and indexes](https://docs.pingcap.com/tidbcloud/vector-search-full-text-search-sql). TiDB Self-Managed and TiDB Cloud Dedicated support parsing the `FULLTEXT` syntax but do not support using the `FULLTEXT` indexes. + `SPATIAL` (also known as `GIS`/`GEOMETRY`) functions, data types and indexes [#6347](https://github.com/pingcap/tidb/issues/6347) -+ Character sets other than `ascii`, `latin1`, `binary`, `utf8`, `utf8mb4`, and `gbk`. ++ Character sets other than `ascii`, `latin1`, `binary`, `utf8`, `utf8mb4`, `gbk`, and `gb18030`. + Optimizer trace + XML Functions + X-Protocol [#1109](https://github.com/pingcap/tidb/issues/1109) @@ -210,6 +210,8 @@ For more information, see [Compatibility between TiDB local temporary tables and * For information on the MySQL compatibility of the GBK character set, refer to [GBK compatibility](/character-set-gbk.md#mysql-compatibility) . +* For information on the MySQL compatibility of the GB18030 character set, refer to [GB18030 compatibility](/character-set-gb18030.md#mysql-compatibility). + * TiDB inherits the character set used in the table as the national character set. ### Storage engines diff --git a/sql-statements/sql-statement-show-collation.md b/sql-statements/sql-statement-show-collation.md index a6c840aaa225f..8d844059adafb 100644 --- a/sql-statements/sql-statement-show-collation.md +++ b/sql-statements/sql-statement-show-collation.md @@ -27,7 +27,7 @@ ShowLikeOrWhere ::= -When [the new collation framework](/tidb-configuration-file.md#new_collations_enabled_on_first_bootstrap) is enabled (the default), the example output is as follows: +If [the new collation framework](/tidb-configuration-file.md#new_collations_enabled_on_first_bootstrap) is not enabled, only binary collations are displayed: @@ -58,24 +58,57 @@ SHOW COLLATION; -When the new collation framework is disabled, only binary collations are listed. +If the new framework for collations is enabled, in addition to the binary collations, the following collations are also supported: + +- Seven case- and accent-insensitive collations, ending `with _ci` +- `utf8mb4_0900_bin` + +```sql +SHOW COLLATION; +``` + +``` ++--------------------+---------+-----+---------+----------+---------+---------------+ +| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute | ++--------------------+---------+-----+---------+----------+---------+---------------+ +| ascii_bin | ascii | 65 | Yes | Yes | 1 | PAD SPACE | +| binary | binary | 63 | Yes | Yes | 1 | NO PAD | +| gb18030_bin | gb18030 | 249 | | Yes | 1 | PAD SPACE | +| gb18030_chinese_ci | gb18030 | 248 | Yes | Yes | 1 | PAD SPACE | +| gbk_bin | gbk | 87 | | Yes | 1 | PAD SPACE | +| gbk_chinese_ci | gbk | 28 | Yes | Yes | 1 | PAD SPACE | +| latin1_bin | latin1 | 47 | Yes | Yes | 1 | PAD SPACE | +| utf8_bin | utf8 | 83 | Yes | Yes | 1 | PAD SPACE | +| utf8_general_ci | utf8 | 33 | | Yes | 1 | PAD SPACE | +| utf8_unicode_ci | utf8 | 192 | | Yes | 8 | PAD SPACE | +| utf8mb4_0900_ai_ci | utf8mb4 | 255 | | Yes | 0 | NO PAD | +| utf8mb4_0900_bin | utf8mb4 | 309 | | Yes | 1 | NO PAD | +| utf8mb4_bin | utf8mb4 | 46 | Yes | Yes | 1 | PAD SPACE | +| utf8mb4_general_ci | utf8mb4 | 45 | | Yes | 1 | PAD SPACE | +| utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 8 | PAD SPACE | ++--------------------+---------+-----+---------+----------+---------+---------------+ +15 rows in set (0.000 sec) +``` + +If the new framework for collations is disabled, only binary collations are listed. ```sql SHOW COLLATION; ``` ``` -+-------------+---------+----+---------+----------+---------+---------------+ -| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute | -+-------------+---------+----+---------+----------+---------+---------------+ -| utf8mb4_bin | utf8mb4 | 46 | Yes | Yes | 1 | PAD SPACE | -| latin1_bin | latin1 | 47 | Yes | Yes | 1 | PAD SPACE | -| binary | binary | 63 | Yes | Yes | 1 | NO PAD | -| ascii_bin | ascii | 65 | Yes | Yes | 1 | PAD SPACE | -| utf8_bin | utf8 | 83 | Yes | Yes | 1 | PAD SPACE | -| gbk_bin | gbk | 87 | Yes | Yes | 1 | PAD SPACE | -+-------------+---------+----+---------+----------+---------+---------------+ -6 rows in set (0.00 sec) ++-------------+---------+-----+---------+----------+---------+---------------+ +| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute | ++-------------+---------+-----+---------+----------+---------+---------------+ +| utf8mb4_bin | utf8mb4 | 46 | Yes | Yes | 1 | PAD SPACE | +| latin1_bin | latin1 | 47 | Yes | Yes | 1 | PAD SPACE | +| binary | binary | 63 | Yes | Yes | 1 | NO PAD | +| ascii_bin | ascii | 65 | Yes | Yes | 1 | PAD SPACE | +| utf8_bin | utf8 | 83 | Yes | Yes | 1 | PAD SPACE | +| gbk_bin | gbk | 87 | Yes | Yes | 1 | PAD SPACE | +| gb18030_bin | gb18030 | 249 | Yes | Yes | 1 | PAD SPACE | ++-------------+---------+-----+---------+----------+---------+---------------+ +7 rows in set (0.00 sec) ``` From 6aa55c41414a94e8fe3c2510c8eb6a6e1e663c03 Mon Sep 17 00:00:00 2001 From: houfaxin Date: Tue, 11 Nov 2025 11:11:16 +0800 Subject: [PATCH 02/19] Update character-set-gb18030.md --- character-set-gb18030.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/character-set-gb18030.md b/character-set-gb18030.md index 49a16492e8fb8..3d5bf1ba39c18 100644 --- a/character-set-gb18030.md +++ b/character-set-gb18030.md @@ -88,12 +88,12 @@ SHOW COLLATION WHERE CHARSET = 'gb18030'; - TiDB does not support using collations with the `_gb18030` suffix, for example: - ```sql - CREATE TABLE t(a CHAR(10) CHARSET BINARY); - Query OK, 0 rows affected (0.00 sec) - INSERT INTO t VALUES (_gb18030'啊'); - ERROR 1115 (42000): Unsupported character introducer: 'gb18030' - ``` + ```sql + CREATE TABLE t(a CHAR(10) CHARSET BINARY); + Query OK, 0 rows affected (0.00 sec) + INSERT INTO t VALUES (_gb18030'啊'); + ERROR 1115 (42000): Unsupported character introducer: 'gb18030' + ``` * For binary characters in `ENUM` and `SET` types, TiDB currently treats them as using the `utf8mb4` character set. From 435c9a5aacc64548b3511f70dde12efcea55746a Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Tue, 11 Nov 2025 14:33:16 +0800 Subject: [PATCH 03/19] Apply suggestions from code review Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- character-set-and-collation.md | 2 +- character-set-gb18030.md | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/character-set-and-collation.md b/character-set-and-collation.md index 5dfdc5843545d..2b43840688a8e 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -174,7 +174,7 @@ SHOW COLLATION WHERE Charset = 'utf8mb4'; 5 rows in set (0.001 sec) ``` -For details about the GBK character set, see [GBK](/character-set-gbk.md). For details about the the GB18030 character set, see [GB18030](/character-set-gb18030.md). +For details about the GBK character set, see [GBK](/character-set-gbk.md). For details about the GB18030 character set, see [GB18030](/character-set-gb18030.md). ## `utf8` and `utf8mb4` in TiDB diff --git a/character-set-gb18030.md b/character-set-gb18030.md index 3d5bf1ba39c18..535e2f1425c53 100644 --- a/character-set-gb18030.md +++ b/character-set-gb18030.md @@ -1,6 +1,6 @@ --- title: GB18030 -summary: This document provides details about the TiDB support of the GB18030 character set. +summary: This document provides details about the TiDB support for the GB18030 character set. --- # GB18030 @@ -86,7 +86,7 @@ SHOW COLLATION WHERE CHARSET = 'gb18030'; - Currently, TiDB does not support changing other character sets to gb18030 or converting from `gb18030` to another character set using the `ALTER TABLE` statement. -- TiDB does not support using collations with the `_gb18030` suffix, for example: +- TiDB does not support using the `_gb18030` character set introducer, for example: ```sql CREATE TABLE t(a CHAR(10) CHARSET BINARY); @@ -95,7 +95,7 @@ SHOW COLLATION WHERE CHARSET = 'gb18030'; ERROR 1115 (42000): Unsupported character introducer: 'gb18030' ``` -* For binary characters in `ENUM` and `SET` types, TiDB currently treats them as using the `utf8mb4` character set. +- For binary characters in `ENUM` and `SET` types, TiDB currently treats them as using the `utf8mb4` character set. ## Component compatibility From da97767e464a9e51cab8e0d3f5fecfff4eedac49 Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Tue, 11 Nov 2025 14:45:51 +0800 Subject: [PATCH 04/19] Update dm/dm-overview.md --- dm/dm-overview.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dm/dm-overview.md b/dm/dm-overview.md index eeaccfbf2b4e0..c9eb5c7ce1889 100644 --- a/dm/dm-overview.md +++ b/dm/dm-overview.md @@ -61,7 +61,7 @@ Before using the DM tool, note the following restrictions: + GBK and GB18030 character sets compatibility - - DM does not support migrating `charset=GBK` tables to TiDB clusters earlier than v5.4.0. Currently, DM does not support migrating tables with `charset=GB18030` to TiDB. + - Before v5.4.0, DM does not support migrating `charset=GBK` tables to TiDB clusters. Before v9.0.0, DM does not support migrating tables with `charset=GB18030` to TiDB clusters. + Binlog compatibility From 0f94df5c52c02ae8d7ada92a9ce1261fa12b695a Mon Sep 17 00:00:00 2001 From: houfaxin Date: Tue, 11 Nov 2025 15:13:52 +0800 Subject: [PATCH 05/19] Update character-set-and-collation.md --- character-set-and-collation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/character-set-and-collation.md b/character-set-and-collation.md index 2b43840688a8e..23bef128490bc 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -1,6 +1,6 @@ --- title: Character Set and Collation -summary: TiDB supports the following character sets: ascii, binary, gbk, gb18030, latin1, utf8, and utf8mb4. The supported collations include: ascii_bin, binary, gbk_bin, gbk_chinese_ci, gb18030_bin, gb18030_chinese_ci, latin1_bin, utf8_bin, utf8_general_ci, utf8_unicode_ci, utf8mb4_0900_ai_ci, utf8mb4_0900_bin, utf8mb4_bin, utf8mb4_general_ci, and utf8mb4_unicode_ci. TiDB strongly recommends using the utf8mb4 character set because it supports a wider range of characters. In TiDB, the default collation is affected by the client’s connection collation setting. If the client uses utf8mb4_0900_ai_ci as the connection collation, TiDB follows the client configuration. TiDB also supports a new collation framework that provides semantic-level support for different collations. +summary: Learn character sets and collations supported by TiDB. aliases: ['/docs/dev/character-set-and-collation/','/docs/dev/reference/sql/characterset-and-collation/','/docs/dev/reference/sql/character-set/'] --- From 2295e9ec29a5cacec49c0ddc09390a07204c2b8c Mon Sep 17 00:00:00 2001 From: houfaxin Date: Tue, 11 Nov 2025 17:41:39 +0800 Subject: [PATCH 06/19] Update character-set-and-collation.md --- character-set-and-collation.md | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/character-set-and-collation.md b/character-set-and-collation.md index 23bef128490bc..568e06082e3d4 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -38,7 +38,7 @@ SELECT 'A' = 'a'; SET NAMES utf8mb4 COLLATE utf8mb4_general_ci; ``` -```sql +``` Query OK, 0 rows affected (0.00 sec) ``` @@ -46,7 +46,7 @@ Query OK, 0 rows affected (0.00 sec) SELECT 'A' = 'a'; ``` -```sql +``` +-----------+ | 'A' = 'a' | +-----------+ @@ -98,7 +98,7 @@ Currently, TiDB supports the following character sets: SHOW CHARACTER SET; ``` -```sql +``` +---------+-------------------------------------+--------------------+--------+ | Charset | Description | Default collation | Maxlen | +---------+-------------------------------------+--------------------+--------+ @@ -119,7 +119,7 @@ TiDB supports the following collations: SHOW COLLATION; ``` -```sql +``` +--------------------+---------+-----+---------+----------+---------+---------------+ | Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute | +--------------------+---------+-----+---------+----------+---------+---------------+ @@ -161,7 +161,7 @@ You can use the following statement to view the collations (under the [new frame SHOW COLLATION WHERE Charset = 'utf8mb4'; ``` -```sql +``` +--------------------+---------+-----+---------+----------+---------+---------------+ | Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute | +--------------------+---------+-----+---------+----------+---------+---------------+ @@ -285,7 +285,7 @@ Database changed SELECT @@character_set_database, @@collation_database; ``` -```sql +``` +--------------------------|----------------------+ | @@character_set_database | @@collation_database | +--------------------------|----------------------+ @@ -298,7 +298,7 @@ SELECT @@character_set_database, @@collation_database; CREATE SCHEMA test2 CHARACTER SET latin1 COLLATE latin1_bin; ``` -```sql +``` Query OK, 0 rows affected (0.09 sec) ``` @@ -314,7 +314,7 @@ Database changed SELECT @@character_set_database, @@collation_database; ``` -```sql +``` +--------------------------|----------------------+ | @@character_set_database | @@collation_database | +--------------------------|----------------------+ @@ -350,7 +350,7 @@ For example: CREATE TABLE t1(a int) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci; ``` -```sql +``` Query OK, 0 rows affected (0.08 sec) ``` @@ -382,7 +382,7 @@ Each string corresponds to a character set and a collation. When you use a strin Example: -```sql +``` SELECT 'string'; SELECT _utf8mb4'string'; SELECT _utf8mb4'string' COLLATE utf8mb4_general_ci; @@ -521,7 +521,7 @@ For a TiDB cluster that is already initialized, you can check whether the new co SELECT VARIABLE_VALUE FROM mysql.tidb WHERE VARIABLE_NAME='new_collation_enabled'; ``` -```sql +``` +----------------+ | VARIABLE_VALUE | +----------------+ @@ -546,7 +546,7 @@ When one of `utf8_general_ci`, `utf8mb4_general_ci`, `utf8_unicode_ci`, `utf8mb4 CREATE TABLE t(a varchar(20) charset utf8mb4 collate utf8mb4_general_ci PRIMARY KEY); ``` -```sql +``` Query OK, 0 rows affected (0.00 sec) ``` @@ -554,7 +554,7 @@ Query OK, 0 rows affected (0.00 sec) INSERT INTO t VALUES ('A'); ``` -```sql +``` Query OK, 1 row affected (0.00 sec) ``` @@ -562,7 +562,7 @@ Query OK, 1 row affected (0.00 sec) INSERT INTO t VALUES ('a'); ``` -```sql +``` ERROR 1062 (23000): Duplicate entry 'a' for key 't.PRIMARY' -- TiDB is compatible with the case-insensitive collation of MySQL. ``` @@ -570,7 +570,7 @@ ERROR 1062 (23000): Duplicate entry 'a' for key 't.PRIMARY' -- TiDB is compatibl INSERT INTO t VALUES ('a '); ``` -```sql +``` ERROR 1062 (23000): Duplicate entry 'a ' for key 't.PRIMARY' -- TiDB modifies the `PADDING` behavior to be compatible with MySQL. ``` @@ -607,7 +607,7 @@ TiDB supports using the `COLLATE` clause to specify the collation of an expressi SELECT 'a' = _utf8mb4 'A' collate utf8mb4_general_ci; ``` -```sql +``` +-----------------------------------------------+ | 'a' = _utf8mb4 'A' collate utf8mb4_general_ci | +-----------------------------------------------+ From 7271835a363f139f9fba093faec60db03bf600ce Mon Sep 17 00:00:00 2001 From: houfaxin Date: Wed, 12 Nov 2025 09:56:14 +0800 Subject: [PATCH 07/19] Update sql-statement-show-collation.md --- .../sql-statement-show-collation.md | 31 ------------------- 1 file changed, 31 deletions(-) diff --git a/sql-statements/sql-statement-show-collation.md b/sql-statements/sql-statement-show-collation.md index 8d844059adafb..976bbc1989112 100644 --- a/sql-statements/sql-statement-show-collation.md +++ b/sql-statements/sql-statement-show-collation.md @@ -27,37 +27,6 @@ ShowLikeOrWhere ::= -If [the new collation framework](/tidb-configuration-file.md#new_collations_enabled_on_first_bootstrap) is not enabled, only binary collations are displayed: - - - -```sql -SHOW COLLATION; -``` - -``` -+--------------------+---------+-----+---------+----------+---------+---------------+ -| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute | -+--------------------+---------+-----+---------+----------+---------+---------------+ -| ascii_bin | ascii | 65 | Yes | Yes | 1 | PAD SPACE | -| binary | binary | 63 | Yes | Yes | 1 | NO PAD | -| gbk_bin | gbk | 87 | | Yes | 1 | PAD SPACE | -| gbk_chinese_ci | gbk | 28 | Yes | Yes | 1 | PAD SPACE | -| latin1_bin | latin1 | 47 | Yes | Yes | 1 | PAD SPACE | -| utf8_bin | utf8 | 83 | Yes | Yes | 1 | PAD SPACE | -| utf8_general_ci | utf8 | 33 | | Yes | 1 | PAD SPACE | -| utf8_unicode_ci | utf8 | 192 | | Yes | 8 | PAD SPACE | -| utf8mb4_0900_ai_ci | utf8mb4 | 255 | | Yes | 0 | NO PAD | -| utf8mb4_0900_bin | utf8mb4 | 309 | | Yes | 1 | NO PAD | -| utf8mb4_bin | utf8mb4 | 46 | Yes | Yes | 1 | PAD SPACE | -| utf8mb4_general_ci | utf8mb4 | 45 | | Yes | 1 | PAD SPACE | -| utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 8 | PAD SPACE | -+--------------------+---------+-----+---------+----------+---------+---------------+ -13 rows in set (0.00 sec) -``` - - - If the new framework for collations is enabled, in addition to the binary collations, the following collations are also supported: - Seven case- and accent-insensitive collations, ending `with _ci` From a3d401e64ad6b3f582fb3f8433d8b79479c41faa Mon Sep 17 00:00:00 2001 From: houfaxin Date: Wed, 12 Nov 2025 09:59:47 +0800 Subject: [PATCH 08/19] Update sql-statement-show-collation.md --- sql-statements/sql-statement-show-collation.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/sql-statements/sql-statement-show-collation.md b/sql-statements/sql-statement-show-collation.md index 976bbc1989112..dd89792793c37 100644 --- a/sql-statements/sql-statement-show-collation.md +++ b/sql-statements/sql-statement-show-collation.md @@ -25,8 +25,6 @@ ShowLikeOrWhere ::= ## Examples - - If the new framework for collations is enabled, in addition to the binary collations, the following collations are also supported: - Seven case- and accent-insensitive collations, ending `with _ci` @@ -80,8 +78,6 @@ SHOW COLLATION; 7 rows in set (0.00 sec) ``` - - To filter on the character set, you can add a `WHERE` clause. ```sql From f8030e41753cf1e8680a857970ede2cb8258ba45 Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Wed, 12 Nov 2025 10:09:00 +0800 Subject: [PATCH 09/19] Update character-set-gbk.md --- character-set-gbk.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/character-set-gbk.md b/character-set-gbk.md index e65353d3d3f96..a5d3cccded1e9 100644 --- a/character-set-gbk.md +++ b/character-set-gbk.md @@ -57,9 +57,9 @@ By default, TiDB Cloud enables the [new framework for collations](/character-set Additionally, because TiDB converts GBK to `utf8mb4` and then uses a binary collation, the `gbk_bin` collation in TiDB is not the same as the `gbk_bin` collation in MySQL. -After the new framework for collations is enabled, if you check the collations corresponding to the GBK character set, you can see that the default collation for GBK in TiDB has been switched to `gbk_chinese_ci`. +After [the new framework for collations](/character-set-and-collation.md#new-framework-for-collations) is enabled, if you check the collations corresponding to the GBK character set, you can see that the default collation for GBK in TiDB has been switched to `gbk_chinese_ci`. -Starting from TiDB v6.0.0, [the new framework for collations](/character-set-and-collation.md#new-framework-for-collations) is enabled by default, which sets `gbk_chinese_ci` as the default collation for the GBK character set in TiDB, consistent with MySQL. +Starting from TiDB v6.0.0, the new framework for collations is enabled by default, which sets `gbk_chinese_ci` as the default collation for the GBK character set in TiDB, consistent with MySQL. ```sql SHOW CHARACTER SET WHERE CHARSET = 'gbk'; From dc13cc75a79034c3139ec4159fd04f5549b60cb5 Mon Sep 17 00:00:00 2001 From: houfaxin Date: Wed, 12 Nov 2025 10:15:28 +0800 Subject: [PATCH 10/19] Update sql-statement-show-collation.md --- sql-statements/sql-statement-show-collation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql-statements/sql-statement-show-collation.md b/sql-statements/sql-statement-show-collation.md index dd89792793c37..a8112506a6540 100644 --- a/sql-statements/sql-statement-show-collation.md +++ b/sql-statements/sql-statement-show-collation.md @@ -27,7 +27,7 @@ ShowLikeOrWhere ::= If the new framework for collations is enabled, in addition to the binary collations, the following collations are also supported: -- Seven case- and accent-insensitive collations, ending `with _ci` +- Seven case- and accent-insensitive collations, ending with `_ci` - `utf8mb4_0900_bin` ```sql From 59bcbb566e4bebfbe11f275e602e8d6ae7e9eb5b Mon Sep 17 00:00:00 2001 From: houfaxin Date: Wed, 12 Nov 2025 10:16:22 +0800 Subject: [PATCH 11/19] Update sql-statement-show-collation.md --- sql-statements/sql-statement-show-collation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql-statements/sql-statement-show-collation.md b/sql-statements/sql-statement-show-collation.md index a8112506a6540..bd5a90194def9 100644 --- a/sql-statements/sql-statement-show-collation.md +++ b/sql-statements/sql-statement-show-collation.md @@ -25,7 +25,7 @@ ShowLikeOrWhere ::= ## Examples -If the new framework for collations is enabled, in addition to the binary collations, the following collations are also supported: +If [the new collation framework](/tidb-configuration-file.md#new_collations_enabled_on_first_bootstrap) is enabled, in addition to the binary collations, the following collations are also supported: - Seven case- and accent-insensitive collations, ending with `_ci` - `utf8mb4_0900_bin` From 2eab8e2724ca13f951036653445e875cce11b644 Mon Sep 17 00:00:00 2001 From: lilin90 Date: Mon, 17 Nov 2025 15:51:29 +0800 Subject: [PATCH 12/19] Update title wording --- character-set-gb18030.md | 4 ++-- character-set-gbk.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/character-set-gb18030.md b/character-set-gb18030.md index 535e2f1425c53..eafa60ac58fbb 100644 --- a/character-set-gb18030.md +++ b/character-set-gb18030.md @@ -1,9 +1,9 @@ --- -title: GB18030 +title: The GB18030 Character Set summary: This document provides details about the TiDB support for the GB18030 character set. --- -# GB18030 +# The GB18030 Character Set Starting from v9.0.0, TiDB supports the GB18030-2022 character set. This document describes TiDB's support for and compatibility with the GB18030 character set. diff --git a/character-set-gbk.md b/character-set-gbk.md index a5d3cccded1e9..d4c326ac819f2 100644 --- a/character-set-gbk.md +++ b/character-set-gbk.md @@ -1,9 +1,9 @@ --- -title: GBK +title: The GBK Character Set summary: This document provides details about the TiDB support of the GBK character set. --- -# GBK +# The GBK Character Set Starting from v5.4.0, TiDB supports the GBK character set. This document provides the TiDB support and compatibility information of the GBK character set. From ced476744eeb95a60bc540f2a2a7fd719ca13828 Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Tue, 18 Nov 2025 15:41:45 +0800 Subject: [PATCH 13/19] Apply suggestions from code review Co-authored-by: Lilian Lee --- br/backup-and-restore-overview.md | 2 +- character-set-and-collation.md | 2 +- character-set-gb18030.md | 14 +++++++------- character-set-gbk.md | 4 ++-- sql-statements/sql-statement-show-collation.md | 4 ++-- 5 files changed, 13 insertions(+), 13 deletions(-) diff --git a/br/backup-and-restore-overview.md b/br/backup-and-restore-overview.md index 8c5ddbaa3bfad..4aed24e5c1542 100644 --- a/br/backup-and-restore-overview.md +++ b/br/backup-and-restore-overview.md @@ -112,7 +112,7 @@ Backup and restore might go wrong when some TiDB features are enabled or disable | Feature | Issue | Solution | | ---- | ---- | ----- | -|GBK charset|| BR of versions earlier than v5.4.0 does not support restoring `charset=GBK` tables. No version of BR supports recovering `charset=GBK` tables to TiDB clusters earlier than v5.4.0. | +|GBK charset|| Before v5.4.0, BR does not support restoring tables with `charset=GBK`. In addition, no version of BR supports restoring tables with `charset=GBK` to TiDB clusters earlier than v5.4.0. | |GB18030 charset|| Before v9.0.0, BR does not support restoring tables with `charset=GB18030`. In addition, no version of BR supports restoring tables with `charset=GB18030` to TiDB clusters earlier than v9.0.0.| | Clustered index | [#565](https://github.com/pingcap/br/issues/565) | Make sure that the value of the `tidb_enable_clustered_index` global variable during restore is consistent with that during backup. Otherwise, data inconsistency might occur, such as `default not found` error and inconsistent data index. | | New collation | [#352](https://github.com/pingcap/br/issues/352) | Make sure that the value of the `new_collation_enabled` variable in the `mysql.tidb` table during restore is consistent with that during backup. Otherwise, inconsistent data index might occur and checksum might fail to pass. For more information, see [FAQ - Why does BR report `new_collations_enabled_on_first_bootstrap` mismatch?](/faq/backup-and-restore-faq.md#why-is-new_collation_enabled-mismatch-reported-during-restore). | diff --git a/character-set-and-collation.md b/character-set-and-collation.md index 568e06082e3d4..303b14ffcf0e3 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -174,7 +174,7 @@ SHOW COLLATION WHERE Charset = 'utf8mb4'; 5 rows in set (0.001 sec) ``` -For details about the GBK character set, see [GBK](/character-set-gbk.md). For details about the GB18030 character set, see [GB18030](/character-set-gb18030.md). +For details about the GBK character set, see [The GBK Character Set](/character-set-gbk.md). For details about the GB18030 character set, see [The GB18030 Character Set](/character-set-gb18030.md). ## `utf8` and `utf8mb4` in TiDB diff --git a/character-set-gb18030.md b/character-set-gb18030.md index eafa60ac58fbb..d3e793804dfb1 100644 --- a/character-set-gb18030.md +++ b/character-set-gb18030.md @@ -1,6 +1,6 @@ --- title: The GB18030 Character Set -summary: This document provides details about the TiDB support for the GB18030 character set. +summary: Learn the details of TiDB's support for the GB18030 character set. --- # The GB18030 Character Set @@ -45,9 +45,9 @@ In MySQL, the default collation for the GB18030 character set is `gb18030_chines - By default, `new_collations_enabled_on_first_bootstrap` is set to `true`, which means enabling the [new collation framework](/character-set-and-collation.md#new-framework-for-collations). In this case, the default collation for GB18030 is `gb18030_chinese_ci`. - If `new_collations_enabled_on_first_bootstrap` is set to `false`, the new framework for collations is disabled, and the default collation for GB18030 is `gb18030_bin`. -Additionally, the `gb18030_bin` supported by TiDB differs from MySQL's `gb18030_bin`. TiDB converts GB18030 to `utf8mb4` and then performs binary sorting. +Additionally, the `gb18030_bin` supported by TiDB differs from MySQL's `gb18030_bin` collation. TiDB converts GB18030 to `utf8mb4` and then performs binary sorting. -After enabling the new framework for collations, checking the collations for the GB18030 character set shows that TiDB's default collation for GB18030 is switched to `gb18030_chinese_ci`. +After enabling the new framework for collations, if you check the collations for the GB18030 character set, you can see that TiDB's default collation for GB18030 is switched to `gb18030_chinese_ci`: ```sql SHOW CHARACTER SET WHERE CHARSET = 'gb18030'; @@ -78,15 +78,15 @@ SHOW COLLATION WHERE CHARSET = 'gb18030'; ### Character compatibility -- TiDB supports GB18030-2022 characters, while MySQL supports GB18030-2005 characters. As a result, the encoding and decoding of some characters differ. +- TiDB supports GB18030-2022 characters, while MySQL supports GB18030-2005 characters. As a result, the encoding and decoding results for certain characters differ between the two systems. -- For invalid GB18030 characters, such as `0xFE39FE39`, MySQL allows writing them to the database in hexadecimal form and stores them as `?`. In TiDB, reading or writing invalid GB18030 characters in strict mode will return an error, while in non-strict mode, it will generate a warning. +- For invalid GB18030 characters, such as `0xFE39FE39`, MySQL allows writing them to the database in hexadecimal form and stores them as `?`. In TiDB, reading or writing invalid GB18030 characters in strict mode returns an error; in non-strict mode, TiDB allows reading or writing invalid GB18030 characters but returns a warning. ### Others -- Currently, TiDB does not support changing other character sets to gb18030 or converting from `gb18030` to another character set using the `ALTER TABLE` statement. +- Currently, TiDB does not support using the `ALTER TABLE` statement to convert other character sets to `gb18030`, or to convert from `gb18030` to another character set. -- TiDB does not support using the `_gb18030` character set introducer, for example: +- TiDB does not support using the `_gb18030` character set introducer. For example: ```sql CREATE TABLE t(a CHAR(10) CHARSET BINARY); diff --git a/character-set-gbk.md b/character-set-gbk.md index d4c326ac819f2..3f812cd7e9d0c 100644 --- a/character-set-gbk.md +++ b/character-set-gbk.md @@ -55,9 +55,9 @@ By default, TiDB Cloud enables the [new framework for collations](/character-set -Additionally, because TiDB converts GBK to `utf8mb4` and then uses a binary collation, the `gbk_bin` collation in TiDB is not the same as the `gbk_bin` collation in MySQL. +Additionally, the `gbk_bin` supported by TiDB differs from MySQL's `gbk_bin` collation. TiDB converts GBK to `utf8mb4` and then performs binary sorting. -After [the new framework for collations](/character-set-and-collation.md#new-framework-for-collations) is enabled, if you check the collations corresponding to the GBK character set, you can see that the default collation for GBK in TiDB has been switched to `gbk_chinese_ci`. +After [the new framework for collations](/character-set-and-collation.md#new-framework-for-collations) is enabled, if you check the collations for the GBK character set, you can see that TiDB's default collation for GBK is switched to `gbk_chinese_ci`. Starting from TiDB v6.0.0, the new framework for collations is enabled by default, which sets `gbk_chinese_ci` as the default collation for the GBK character set in TiDB, consistent with MySQL. diff --git a/sql-statements/sql-statement-show-collation.md b/sql-statements/sql-statement-show-collation.md index bd5a90194def9..6219bd5a47454 100644 --- a/sql-statements/sql-statement-show-collation.md +++ b/sql-statements/sql-statement-show-collation.md @@ -25,7 +25,7 @@ ShowLikeOrWhere ::= ## Examples -If [the new collation framework](/tidb-configuration-file.md#new_collations_enabled_on_first_bootstrap) is enabled, in addition to the binary collations, the following collations are also supported: +If [the new collation framework](/tidb-configuration-file.md#new_collations_enabled_on_first_bootstrap) is enabled, in addition to the binary collations, TiDB also supports the following collations: - Seven case- and accent-insensitive collations, ending with `_ci` - `utf8mb4_0900_bin` @@ -57,7 +57,7 @@ SHOW COLLATION; 15 rows in set (0.000 sec) ``` -If the new framework for collations is disabled, only binary collations are listed. +If [the new collation framework](/tidb-configuration-file.md#new_collations_enabled_on_first_bootstrap) is disabled, TiDB supports only binary collations. ```sql SHOW COLLATION; From 175e524793acb3bc4a3e3e449a2a2b04780e5a68 Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Tue, 18 Nov 2025 16:03:30 +0800 Subject: [PATCH 14/19] Update character-set-and-collation.md Co-authored-by: Lilian Lee --- character-set-and-collation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/character-set-and-collation.md b/character-set-and-collation.md index 303b14ffcf0e3..0257b46a7368d 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -174,7 +174,7 @@ SHOW COLLATION WHERE Charset = 'utf8mb4'; 5 rows in set (0.001 sec) ``` -For details about the GBK character set, see [The GBK Character Set](/character-set-gbk.md). For details about the GB18030 character set, see [The GB18030 Character Set](/character-set-gb18030.md). +For details about the GBK character set, see [The GBK Character Set](/character-set-gbk.md). ## `utf8` and `utf8mb4` in TiDB From 66ae462ba082fcbb4e7d5a60128b6c8a77cf0487 Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Tue, 18 Nov 2025 16:13:19 +0800 Subject: [PATCH 15/19] Update mysql-compatibility.md Co-authored-by: Lilian Lee --- mysql-compatibility.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mysql-compatibility.md b/mysql-compatibility.md index deb85071aa1bb..d707e8a092620 100644 --- a/mysql-compatibility.md +++ b/mysql-compatibility.md @@ -210,7 +210,7 @@ For more information, see [Compatibility between TiDB local temporary tables and * For information on the MySQL compatibility of the GBK character set, refer to [GBK compatibility](/character-set-gbk.md#mysql-compatibility) . -* For information on the MySQL compatibility of the GB18030 character set, refer to [GB18030 compatibility](/character-set-gb18030.md#mysql-compatibility). + * TiDB inherits the character set used in the table as the national character set. From ff7b6d001545696fb3dc3a49e80a8cda0f0766c6 Mon Sep 17 00:00:00 2001 From: lilin90 Date: Tue, 18 Nov 2025 17:08:03 +0800 Subject: [PATCH 16/19] Update cloud TOC and add version label Linked the GB18030 character set documentation in the table of contents, character set overview, and MySQL compatibility pages. Updated the GB18030 documentation to indicate its introduction in v9.0.0. --- TOC-tidb-cloud.md | 1 + character-set-and-collation.md | 2 +- character-set-gb18030.md | 2 +- mysql-compatibility.md | 2 +- 4 files changed, 4 insertions(+), 3 deletions(-) diff --git a/TOC-tidb-cloud.md b/TOC-tidb-cloud.md index 37c66a6a0125b..4d76b05b0be09 100644 --- a/TOC-tidb-cloud.md +++ b/TOC-tidb-cloud.md @@ -644,6 +644,7 @@ - Character Set and Collation - [Overview](/character-set-and-collation.md) - [GBK](/character-set-gbk.md) + - [GB18030](/character-set-gb18030.md) - Read Historical Data - Use Stale Read (Recommended) - [Usage Scenarios of Stale Read](/stale-read.md) diff --git a/character-set-and-collation.md b/character-set-and-collation.md index 0257b46a7368d..303b14ffcf0e3 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -174,7 +174,7 @@ SHOW COLLATION WHERE Charset = 'utf8mb4'; 5 rows in set (0.001 sec) ``` -For details about the GBK character set, see [The GBK Character Set](/character-set-gbk.md). +For details about the GBK character set, see [The GBK Character Set](/character-set-gbk.md). For details about the GB18030 character set, see [The GB18030 Character Set](/character-set-gb18030.md). ## `utf8` and `utf8mb4` in TiDB diff --git a/character-set-gb18030.md b/character-set-gb18030.md index d3e793804dfb1..bc61a762a65dd 100644 --- a/character-set-gb18030.md +++ b/character-set-gb18030.md @@ -3,7 +3,7 @@ title: The GB18030 Character Set summary: Learn the details of TiDB's support for the GB18030 character set. --- -# The GB18030 Character Set +# The GB18030 Character Set New in v9.0.0 Starting from v9.0.0, TiDB supports the GB18030-2022 character set. This document describes TiDB's support for and compatibility with the GB18030 character set. diff --git a/mysql-compatibility.md b/mysql-compatibility.md index d707e8a092620..deb85071aa1bb 100644 --- a/mysql-compatibility.md +++ b/mysql-compatibility.md @@ -210,7 +210,7 @@ For more information, see [Compatibility between TiDB local temporary tables and * For information on the MySQL compatibility of the GBK character set, refer to [GBK compatibility](/character-set-gbk.md#mysql-compatibility) . - +* For information on the MySQL compatibility of the GB18030 character set, refer to [GB18030 compatibility](/character-set-gb18030.md#mysql-compatibility). * TiDB inherits the character set used in the table as the national character set. From d367a08543b092c516e50b115c05b04691ff97b3 Mon Sep 17 00:00:00 2001 From: lilin90 Date: Tue, 18 Nov 2025 17:14:10 +0800 Subject: [PATCH 17/19] Update link for configuration parameter reference Changed the link for `new_collations_enabled_on_first_bootstrap` from a relative path to the official TiDB documentation URL for improved clarity and accessibility. --- character-set-gb18030.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/character-set-gb18030.md b/character-set-gb18030.md index bc61a762a65dd..d13f5886c7474 100644 --- a/character-set-gb18030.md +++ b/character-set-gb18030.md @@ -40,7 +40,7 @@ This section describes the compatibility of the GB18030 character set in TiDB wi ### Collation compatibility -In MySQL, the default collation for the GB18030 character set is `gb18030_chinese_ci`. In TiDB, the default collation for GB18030 depends on the configuration parameter [`new_collations_enabled_on_first_bootstrap`](/tidb-configuration-file.md#new_collations_enabled_on_first_bootstrap): +In MySQL, the default collation for the GB18030 character set is `gb18030_chinese_ci`. In TiDB, the default collation for GB18030 depends on the configuration parameter [`new_collations_enabled_on_first_bootstrap`](https://docs.pingcap.com/tidb/stable/tidb-configuration-file/#new_collations_enabled_on_first_bootstrap): - By default, `new_collations_enabled_on_first_bootstrap` is set to `true`, which means enabling the [new collation framework](/character-set-and-collation.md#new-framework-for-collations). In this case, the default collation for GB18030 is `gb18030_chinese_ci`. - If `new_collations_enabled_on_first_bootstrap` is set to `false`, the new framework for collations is disabled, and the default collation for GB18030 is `gb18030_bin`. From cb7d2d92d7713bd4f6e4bb8f9672edff04047f01 Mon Sep 17 00:00:00 2001 From: lilin90 Date: Tue, 18 Nov 2025 17:22:07 +0800 Subject: [PATCH 18/19] Update sql-statement-show-collation.md --- sql-statements/sql-statement-show-collation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql-statements/sql-statement-show-collation.md b/sql-statements/sql-statement-show-collation.md index 6219bd5a47454..8daff4dee5712 100644 --- a/sql-statements/sql-statement-show-collation.md +++ b/sql-statements/sql-statement-show-collation.md @@ -25,7 +25,7 @@ ShowLikeOrWhere ::= ## Examples -If [the new collation framework](/tidb-configuration-file.md#new_collations_enabled_on_first_bootstrap) is enabled, in addition to the binary collations, TiDB also supports the following collations: +When the [new collation framework](https://docs.pingcap.com/tidb/stable/tidb-configuration-file/#new_collations_enabled_on_first_bootstrap) is enabled, in addition to the binary collations, TiDB also supports the following collations: - Seven case- and accent-insensitive collations, ending with `_ci` - `utf8mb4_0900_bin` From 0198e6b64ccd4a6eb5010e6361e97ef33de1b052 Mon Sep 17 00:00:00 2001 From: lilin90 Date: Tue, 18 Nov 2025 17:22:30 +0800 Subject: [PATCH 19/19] Update a link --- sql-statements/sql-statement-show-collation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql-statements/sql-statement-show-collation.md b/sql-statements/sql-statement-show-collation.md index 8daff4dee5712..16bdcb00df4fc 100644 --- a/sql-statements/sql-statement-show-collation.md +++ b/sql-statements/sql-statement-show-collation.md @@ -57,7 +57,7 @@ SHOW COLLATION; 15 rows in set (0.000 sec) ``` -If [the new collation framework](/tidb-configuration-file.md#new_collations_enabled_on_first_bootstrap) is disabled, TiDB supports only binary collations. +If [the new collation framework](https://docs.pingcap.com/tidb/stable/tidb-configuration-file/#new_collations_enabled_on_first_bootstrap) is disabled, TiDB supports only binary collations. ```sql SHOW COLLATION;