Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions TOC-tidb-cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -644,6 +644,7 @@
- Character Set and Collation
- [Overview](/character-set-and-collation.md)
- [GBK](/character-set-gbk.md)
- [GB18030](/character-set-gb18030.md)
- Read Historical Data
- Use Stale Read (Recommended)
- [Usage Scenarios of Stale Read](/stale-read.md)
Expand Down
1 change: 1 addition & 0 deletions TOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -1004,6 +1004,7 @@
- Character Set and Collation
- [Overview](/character-set-and-collation.md)
- [GBK](/character-set-gbk.md)
- [GB18030](/character-set-gb18030.md)
- [Placement Rules in SQL](/placement-rules-in-sql.md)
- System Tables
- `mysql` Schema
Expand Down
3 changes: 2 additions & 1 deletion br/backup-and-restore-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,8 @@ Backup and restore might go wrong when some TiDB features are enabled or disable

| Feature | Issue | Solution |
| ---- | ---- | ----- |
|GBK charset|| BR of versions earlier than v5.4.0 does not support restoring `charset=GBK` tables. No version of BR supports recovering `charset=GBK` tables to TiDB clusters earlier than v5.4.0. |
|GBK charset|| Before v5.4.0, BR does not support restoring tables with `charset=GBK`. In addition, no version of BR supports restoring tables with `charset=GBK` to TiDB clusters earlier than v5.4.0. |
|GB18030 charset|| Before v9.0.0, BR does not support restoring tables with `charset=GB18030`. In addition, no version of BR supports restoring tables with `charset=GB18030` to TiDB clusters earlier than v9.0.0.|
| Clustered index | [#565](https://github.com/pingcap/br/issues/565) | Make sure that the value of the `tidb_enable_clustered_index` global variable during restore is consistent with that during backup. Otherwise, data inconsistency might occur, such as `default not found` error and inconsistent data index. |
| New collation | [#352](https://github.com/pingcap/br/issues/352) | Make sure that the value of the `new_collation_enabled` variable in the `mysql.tidb` table during restore is consistent with that during backup. Otherwise, inconsistent data index might occur and checksum might fail to pass. For more information, see [FAQ - Why does BR report `new_collations_enabled_on_first_bootstrap` mismatch?](/faq/backup-and-restore-faq.md#why-is-new_collation_enabled-mismatch-reported-during-restore). |
| Global temporary tables | | Make sure that you are using v5.3.0 or a later version of BR to back up and restore data. Otherwise, an error occurs in the definition of the backed global temporary tables. |
Expand Down
67 changes: 35 additions & 32 deletions character-set-and-collation.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Character Set and Collation
summary: Learn about the supported character sets and collations in TiDB.
summary: Learn character sets and collations supported by TiDB.
aliases: ['/docs/dev/character-set-and-collation/','/docs/dev/reference/sql/characterset-and-collation/','/docs/dev/reference/sql/character-set/']
---

Expand Down Expand Up @@ -38,15 +38,15 @@ SELECT 'A' = 'a';
SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
```

```sql
```
Query OK, 0 rows affected (0.00 sec)
```

```sql
SELECT 'A' = 'a';
```

```sql
```
+-----------+
| 'A' = 'a' |
+-----------+
Expand Down Expand Up @@ -98,18 +98,19 @@ Currently, TiDB supports the following character sets:
SHOW CHARACTER SET;
```

```sql
+---------+-------------------------------------+-------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+-------------------------------------+-------------------+--------+
| ascii | US ASCII | ascii_bin | 1 |
| binary | binary | binary | 1 |
| gbk | Chinese Internal Code Specification | gbk_chinese_ci | 2 |
| latin1 | Latin1 | latin1_bin | 1 |
| utf8 | UTF-8 Unicode | utf8_bin | 3 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_bin | 4 |
+---------+-------------------------------------+-------------------+--------+
6 rows in set (0.00 sec)
```
+---------+-------------------------------------+--------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+-------------------------------------+--------------------+--------+
| ascii | US ASCII | ascii_bin | 1 |
| binary | binary | binary | 1 |
| gb18030 | China National Standard GB18030 | gb18030_chinese_ci | 4 |
| gbk | Chinese Internal Code Specification | gbk_chinese_ci | 2 |
| latin1 | Latin1 | latin1_bin | 1 |
| utf8 | UTF-8 Unicode | utf8_bin | 3 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_bin | 4 |
+---------+-------------------------------------+--------------------+--------+
7 rows in set (0.000 sec)
```

TiDB supports the following collations:
Expand All @@ -118,12 +119,14 @@ TiDB supports the following collations:
SHOW COLLATION;
```

```sql
```
+--------------------+---------+-----+---------+----------+---------+---------------+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+--------------------+---------+-----+---------+----------+---------+---------------+
| ascii_bin | ascii | 65 | Yes | Yes | 1 | PAD SPACE |
| binary | binary | 63 | Yes | Yes | 1 | NO PAD |
| gb18030_bin | gb18030 | 249 | | Yes | 1 | PAD SPACE |
| gb18030_chinese_ci | gb18030 | 248 | Yes | Yes | 1 | PAD SPACE |
| gbk_bin | gbk | 87 | | Yes | 1 | PAD SPACE |
| gbk_chinese_ci | gbk | 28 | Yes | Yes | 1 | PAD SPACE |
| latin1_bin | latin1 | 47 | Yes | Yes | 1 | PAD SPACE |
Expand All @@ -136,7 +139,7 @@ SHOW COLLATION;
| utf8mb4_general_ci | utf8mb4 | 45 | | Yes | 1 | PAD SPACE |
| utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 8 | PAD SPACE |
+--------------------+---------+-----+---------+----------+---------+---------------+
13 rows in set (0.00 sec)
15 rows in set (0.000 sec)
```

> **Warning:**
Expand All @@ -158,7 +161,7 @@ You can use the following statement to view the collations (under the [new frame
SHOW COLLATION WHERE Charset = 'utf8mb4';
```

```sql
```
+--------------------+---------+-----+---------+----------+---------+---------------+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+--------------------+---------+-----+---------+----------+---------+---------------+
Expand All @@ -171,7 +174,7 @@ SHOW COLLATION WHERE Charset = 'utf8mb4';
5 rows in set (0.001 sec)
```

For details about the TiDB support of the GBK character set, see [GBK](/character-set-gbk.md).
For details about the GBK character set, see [The GBK Character Set](/character-set-gbk.md). For details about the GB18030 character set, see [The GB18030 Character Set](/character-set-gb18030.md).

## `utf8` and `utf8mb4` in TiDB

Expand Down Expand Up @@ -282,7 +285,7 @@ Database changed
SELECT @@character_set_database, @@collation_database;
```

```sql
```
+--------------------------|----------------------+
| @@character_set_database | @@collation_database |
+--------------------------|----------------------+
Expand All @@ -295,7 +298,7 @@ SELECT @@character_set_database, @@collation_database;
CREATE SCHEMA test2 CHARACTER SET latin1 COLLATE latin1_bin;
```

```sql
```
Query OK, 0 rows affected (0.09 sec)
```

Expand All @@ -311,7 +314,7 @@ Database changed
SELECT @@character_set_database, @@collation_database;
```

```sql
```
+--------------------------|----------------------+
| @@character_set_database | @@collation_database |
+--------------------------|----------------------+
Expand Down Expand Up @@ -347,7 +350,7 @@ For example:
CREATE TABLE t1(a int) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
```

```sql
```
Query OK, 0 rows affected (0.08 sec)
```

Expand Down Expand Up @@ -379,7 +382,7 @@ Each string corresponds to a character set and a collation. When you use a strin

Example:

```sql
```
SELECT 'string';
SELECT _utf8mb4'string';
SELECT _utf8mb4'string' COLLATE utf8mb4_general_ci;
Expand Down Expand Up @@ -518,7 +521,7 @@ For a TiDB cluster that is already initialized, you can check whether the new co
SELECT VARIABLE_VALUE FROM mysql.tidb WHERE VARIABLE_NAME='new_collation_enabled';
```

```sql
```
+----------------+
| VARIABLE_VALUE |
+----------------+
Expand All @@ -535,39 +538,39 @@ This new framework supports semantically parsing collations. TiDB enables the ne

</CustomContent>

Under the new framework, TiDB supports the `utf8_general_ci`, `utf8mb4_general_ci`, `utf8_unicode_ci`, `utf8mb4_unicode_ci`, `utf8mb4_0900_bin`, `utf8mb4_0900_ai_ci`, `gbk_chinese_ci`, and `gbk_bin` collations, which is compatible with MySQL.
Under the new framework, TiDB supports the `utf8_general_ci`, `utf8mb4_general_ci`, `utf8_unicode_ci`, `utf8mb4_unicode_ci`, `utf8mb4_0900_bin`, `utf8mb4_0900_ai_ci`, `gbk_chinese_ci`, `gbk_bin`, `gb18030_chinese_ci` and `gb18030_bin` collations, which is compatible with MySQL.

When one of `utf8_general_ci`, `utf8mb4_general_ci`, `utf8_unicode_ci`, `utf8mb4_unicode_ci`, `utf8mb4_0900_ai_ci` and `gbk_chinese_ci` is used, the string comparison is case-insensitive and accent-insensitive. At the same time, TiDB also corrects the collation's `PADDING` behavior:
When one of `utf8_general_ci`, `utf8mb4_general_ci`, `utf8_unicode_ci`, `utf8mb4_unicode_ci`, `utf8mb4_0900_ai_ci`, `gbk_chinese_ci` and `gb18030_chinese_ci` is used, the string comparison is case-insensitive and accent-insensitive. At the same time, TiDB also corrects the collation's `PADDING` behavior:

```sql
CREATE TABLE t(a varchar(20) charset utf8mb4 collate utf8mb4_general_ci PRIMARY KEY);
```

```sql
```
Query OK, 0 rows affected (0.00 sec)
```

```sql
INSERT INTO t VALUES ('A');
```

```sql
```
Query OK, 1 row affected (0.00 sec)
```

```sql
INSERT INTO t VALUES ('a');
```

```sql
```
ERROR 1062 (23000): Duplicate entry 'a' for key 't.PRIMARY' -- TiDB is compatible with the case-insensitive collation of MySQL.
```

```sql
INSERT INTO t VALUES ('a ');
```

```sql
```
ERROR 1062 (23000): Duplicate entry 'a ' for key 't.PRIMARY' -- TiDB modifies the `PADDING` behavior to be compatible with MySQL.
```

Expand Down Expand Up @@ -604,7 +607,7 @@ TiDB supports using the `COLLATE` clause to specify the collation of an expressi
SELECT 'a' = _utf8mb4 'A' collate utf8mb4_general_ci;
```

```sql
```
+-----------------------------------------------+
| 'a' = _utf8mb4 'A' collate utf8mb4_general_ci |
+-----------------------------------------------+
Expand Down
111 changes: 111 additions & 0 deletions character-set-gb18030.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
---
title: The GB18030 Character Set
summary: Learn the details of TiDB's support for the GB18030 character set.
---

# The GB18030 Character Set <span class="version-mark">New in v9.0.0</span>

Starting from v9.0.0, TiDB supports the GB18030-2022 character set. This document describes TiDB's support for and compatibility with the GB18030 character set.

```sql
SHOW CHARACTER SET WHERE CHARSET = 'gb18030';
```

```
+---------+---------------------------------+--------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+---------------------------------+--------------------+--------+
| gb18030 | China National Standard GB18030 | gb18030_chinese_ci | 4 |
+---------+---------------------------------+--------------------+--------+
1 row in set (0.01 sec)
```

```sql
SHOW COLLATION WHERE CHARSET = 'gb18030';
```

```
+--------------------+---------+-----+---------+----------+---------+---------------+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+--------------------+---------+-----+---------+----------+---------+---------------+
| gb18030_bin | gb18030 | 249 | | Yes | 1 | PAD SPACE |
| gb18030_chinese_ci | gb18030 | 248 | Yes | Yes | 1 | PAD SPACE |
+--------------------+---------+-----+---------+----------+---------+---------------+
2 rows in set (0.001 sec)
```

## MySQL compatibility

This section describes the compatibility of the GB18030 character set in TiDB with MySQL.

### Collation compatibility

In MySQL, the default collation for the GB18030 character set is `gb18030_chinese_ci`. In TiDB, the default collation for GB18030 depends on the configuration parameter [`new_collations_enabled_on_first_bootstrap`](https://docs.pingcap.com/tidb/stable/tidb-configuration-file/#new_collations_enabled_on_first_bootstrap):

- By default, `new_collations_enabled_on_first_bootstrap` is set to `true`, which means enabling the [new collation framework](/character-set-and-collation.md#new-framework-for-collations). In this case, the default collation for GB18030 is `gb18030_chinese_ci`.
- If `new_collations_enabled_on_first_bootstrap` is set to `false`, the new framework for collations is disabled, and the default collation for GB18030 is `gb18030_bin`.

Additionally, the `gb18030_bin` supported by TiDB differs from MySQL's `gb18030_bin` collation. TiDB converts GB18030 to `utf8mb4` and then performs binary sorting.

After enabling the new framework for collations, if you check the collations for the GB18030 character set, you can see that TiDB's default collation for GB18030 is switched to `gb18030_chinese_ci`:

```sql
SHOW CHARACTER SET WHERE CHARSET = 'gb18030';
```

```
+---------+---------------------------------+--------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+---------------------------------+--------------------+--------+
| gb18030 | China National Standard GB18030 | gb18030_chinese_ci | 4 |
+---------+---------------------------------+--------------------+--------+
1 row in set (0.01 sec)
```

```sql
SHOW COLLATION WHERE CHARSET = 'gb18030';
```

```
+--------------------+---------+-----+---------+----------+---------+---------------+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+--------------------+---------+-----+---------+----------+---------+---------------+
| gb18030_bin | gb18030 | 249 | | Yes | 1 | PAD SPACE |
| gb18030_chinese_ci | gb18030 | 248 | Yes | Yes | 1 | PAD SPACE |
+--------------------+---------+-----+---------+----------+---------+---------------+
2 rows in set (0.00 sec)
```

### Character compatibility

- TiDB supports GB18030-2022 characters, while MySQL supports GB18030-2005 characters. As a result, the encoding and decoding results for certain characters differ between the two systems.

- For invalid GB18030 characters, such as `0xFE39FE39`, MySQL allows writing them to the database in hexadecimal form and stores them as `?`. In TiDB, reading or writing invalid GB18030 characters in strict mode returns an error; in non-strict mode, TiDB allows reading or writing invalid GB18030 characters but returns a warning.

### Others

- Currently, TiDB does not support using the `ALTER TABLE` statement to convert other character sets to `gb18030`, or to convert from `gb18030` to another character set.

- TiDB does not support using the `_gb18030` character set introducer. For example:

```sql
CREATE TABLE t(a CHAR(10) CHARSET BINARY);
Query OK, 0 rows affected (0.00 sec)
INSERT INTO t VALUES (_gb18030'啊');
ERROR 1115 (42000): Unsupported character introducer: 'gb18030'
```

- For binary characters in `ENUM` and `SET` types, TiDB currently treats them as using the `utf8mb4` character set.

## Component compatibility

- TiFlash, TiDB Data Migration (DM), and TiCDC currently do not support the GB18030 character set.

- Before v9.0.0, Dumpling does not support exporting tables with `charset=GB18030`, and TiDB Lightning does not support importing tables with `charset=GB18030`.

- Before v9.0.0, TiDB Backup & Restore (BR) does not support backing up or restoring tables with `charset=GB18030`. In addition, no version of BR supports restoring tables with `charset=GB18030` to TiDB clusters earlier than v9.0.0.

## See also

* [`SHOW CHARACTER SET`](/sql-statements/sql-statement-show-character-set.md)
* [Character Set and Collation](/character-set-and-collation.md)
Loading