From f71e2073ecc7548bbb757a0d3dc363c842e1014d Mon Sep 17 00:00:00 2001 From: ireneontheway Date: Thu, 23 Jul 2020 14:44:36 +0800 Subject: [PATCH 01/11] Update character-set-and-collation.md --- character-set-and-collation.md | 49 +++++++++++++++++++++++++--------- 1 file changed, 37 insertions(+), 12 deletions(-) diff --git a/character-set-and-collation.md b/character-set-and-collation.md index 59e1a82787283..9dd8ca56b0e2b 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -6,7 +6,15 @@ aliases: ['/docs/dev/character-set-and-collation/','/docs/dev/reference/sql/char # Character Set and Collation -A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. +This document introduces the character set and collation supported by TiDB. + +## Concepts + +A character set is a set of symbols and encodings. + +A collation is a set of rules for comparing characters in a character set. + +## TiDB character set and collation Currently, TiDB supports the following character sets: @@ -29,6 +37,22 @@ SHOW CHARACTER SET; 5 rows in set (0.00 sec) ``` +TiDB supports the following collation: + +```sql +mysql> show collation; ++-------------+---------+------+---------+----------+---------+ +| Collation | Charset | Id | Default | Compiled | Sortlen | ++-------------+---------+------+---------+----------+---------+ +| utf8mb4_bin | utf8mb4 | 46 | Yes | Yes | 1 | +| latin1_bin | latin1 | 47 | Yes | Yes | 1 | +| binary | binary | 63 | Yes | Yes | 1 | +| ascii_bin | ascii | 65 | Yes | Yes | 1 | +| utf8_bin | utf8 | 83 | Yes | Yes | 1 | ++-------------+---------+------+---------+----------+---------+ +5 rows in set (0.01 sec) +``` + > **Note:** > > The default collations in TiDB (binary collations, with the suffix `_bin`) are different than [the default collations in MySQL](https://dev.mysql.com/doc/refman/8.0/en/charset-charsets.html) (typically general collations, with the suffix `_general_ci`). This can cause incompatible behavior when specifying an explicit character set but relying on the implicit default collation to be chosen. @@ -51,11 +75,11 @@ SHOW COLLATION WHERE Charset = 'utf8mb4'; 2 rows in set (0.00 sec) ``` -## Cluster character set and collation +## Character set and collation in different layers -Not supported yet. +The character set and collation can be set at different layers. -## Database character set and collation +### Database character set and collation Each database has a character set and a collation. You can use the following statements to specify the database character set and collation: @@ -152,7 +176,7 @@ SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'db_name'; ``` -## Table character set and collation +### Table character set and collation You can use the following statement to specify the character set and collation for tables: @@ -180,7 +204,7 @@ Query OK, 0 rows affected (0.08 sec) If the table character set and collation are not specified, the database character set and collation are used as their default values. -## Column character set and collation +### Column character set and collation You can use the following statement to specify the character set and collation for columns: @@ -196,7 +220,7 @@ col_name {ENUM | SET} (val_list) If the column character set and collation are not specified, the table character set and collation are used as their default values. -## String character sets and collation +### String character sets and collation Each string corresponds to a character set and a collation. When you use a string, this option is available: @@ -222,7 +246,7 @@ Rules: + Rule 2: If you specify `CHARACTER SET charset_name` but do not specify `COLLATE collation_name`, the `charset_name` character set and the default collation of `charset_name` are used. + Rule 3: If you specify neither `CHARACTER SET charset_name` nor `COLLATE collation_name`, the character set and collation given by the system variables `character_set_connection` and `collation_connection` are used. -## Client connection character set and collation +### Client connection character set and collation + The server character set and collation are the values of the `character_set_server` and `collation_server` system variables. @@ -246,7 +270,7 @@ You can use the following statement to set the character set and collation that SET character_set_connection = charset_name; ``` - `COLLATE` is optional, if absent, the default collation of the `charset_name` is used. + `COLLATE` is optional, if absent, the default collation of the `charset_name` is used to set the `collation_connection`. + `SET CHARACTER SET 'charset_name'` @@ -255,12 +279,13 @@ You can use the following statement to set the character set and collation that ```sql SET character_set_client = charset_name; SET character_set_results = charset_name; + SET charset_connection = @@charset_database; SET collation_connection = @@collation_database; ``` ## Optimization levels of character sets and collations -String > Column > Table > Database > Server > Cluster +String > Column > Table > Database > Server ## General rules on selecting character sets and collation @@ -345,13 +370,13 @@ If an expression involves multiple clauses of different collations, you need to + The coercibility value of the explicit `COLLATE` clause is `0`. + If the collations of two strings are incompatible, the coercibility value of the concatenation of two strings with different collations is `1`. Currently, all implemented collations are compatible with each other. -+ The column's collation has a coercibility value of `2`. ++ The collation of the column,`CAST()`, `CONVERT()` or `BINARY()` has a coercibility value of `2`. + The system constant (the string returned by `USER ()` or `VERSION ()`) has a coercibility value of `3`. + The coercibility value of constants is `4`. + The coercibility value of numbers or intermediate variables is `5`. + `NULL` or expressions derived from `NULL` has a coercibility value of `6`. -When inferring collations, TiDB prefers using the collation of expressions with lower coercibility values (the same as MySQL). If the coercibility values of two clauses are the same, the collation is determined according to the following priority: +When inferring collations, TiDB prefers using the collation of expressions with lower coercibility values. If the coercibility values of two clauses are the same, the collation is determined according to the following priority: binary > utf8mb4_bin > utf8mb4_general_ci > utf8_bin > utf8_general_ci > latin1_bin > ascii_bin From b02d9abbfae1af2a5d0d0e0b75f1ff37a8222a46 Mon Sep 17 00:00:00 2001 From: ireneontheway <48651140+ireneontheway@users.noreply.github.com> Date: Fri, 24 Jul 2020 10:38:12 +0800 Subject: [PATCH 02/11] Update character-set-and-collation.md Co-authored-by: Null not nil <67764674+nullnotnil@users.noreply.github.com> --- character-set-and-collation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/character-set-and-collation.md b/character-set-and-collation.md index 9dd8ca56b0e2b..d086b4ec758ae 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -37,7 +37,7 @@ SHOW CHARACTER SET; 5 rows in set (0.00 sec) ``` -TiDB supports the following collation: +TiDB supports the following collations: ```sql mysql> show collation; From 59b5dc0cdeb2eaafc8b7c68ef1568cb288d845cf Mon Sep 17 00:00:00 2001 From: ireneontheway <48651140+ireneontheway@users.noreply.github.com> Date: Mon, 27 Jul 2020 11:58:13 +0800 Subject: [PATCH 03/11] Apply suggestions from code review Co-authored-by: Keke Yi <40977455+yikeke@users.noreply.github.com> --- character-set-and-collation.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/character-set-and-collation.md b/character-set-and-collation.md index d086b4ec758ae..e64068debfd46 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -283,7 +283,7 @@ You can use the following statement to set the character set and collation that SET collation_connection = @@collation_database; ``` -## Optimization levels of character sets and collations +## Selection priorities of character sets and collations String > Column > Table > Database > Server @@ -370,7 +370,7 @@ If an expression involves multiple clauses of different collations, you need to + The coercibility value of the explicit `COLLATE` clause is `0`. + If the collations of two strings are incompatible, the coercibility value of the concatenation of two strings with different collations is `1`. Currently, all implemented collations are compatible with each other. -+ The collation of the column,`CAST()`, `CONVERT()` or `BINARY()` has a coercibility value of `2`. ++ The collation of the column, `CAST()`, `CONVERT()`, or `BINARY()` has a coercibility value of `2`. + The system constant (the string returned by `USER ()` or `VERSION ()`) has a coercibility value of `3`. + The coercibility value of constants is `4`. + The coercibility value of numbers or intermediate variables is `5`. From 93a11dfa735bc8c4ed829f7ea3a42625fe26dbb8 Mon Sep 17 00:00:00 2001 From: ireneontheway <48651140+ireneontheway@users.noreply.github.com> Date: Tue, 28 Jul 2020 12:31:28 +0800 Subject: [PATCH 04/11] Update character-set-and-collation.md --- character-set-and-collation.md | 36 ++++++++++++++++++++++++++++++++-- 1 file changed, 34 insertions(+), 2 deletions(-) diff --git a/character-set-and-collation.md b/character-set-and-collation.md index e64068debfd46..7e3863a6ba662 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -10,9 +10,41 @@ This document introduces the character set and collation supported by TiDB. ## Concepts -A character set is a set of symbols and encodings. +A character set is a set of symbols and encodings. The default character set in TiDB is utf8mb4, which matches the default in MySQL 8.0 and above. In TiDB `utf8` and `utf8mb4` behave identically, and `utf8` is not restricted to a maximum of 3 bytes as in MySQL. UTF-8 encoding [accounts for between 83% - 100% of webpages](https://en.wikipedia.org/wiki/UTF-8), depending on the language and country. -A collation is a set of rules for comparing characters in a character set. +A collation is a set of rules for comparing characters in a character set, and the sorting order of characters. For example in a binary collation `A` and `a` do not compare as equal: + +{{< copyable "sql" >}} + +```sql +SET NAMES utf8mb4 COLLATE utf8mb4_bin; +SELECT 'A' = 'a'; +SET NAMES utf8mb4 COLLATE utf8mb4_general_ci; +SELECT 'A' = 'a'; +``` + +```sql +mysql> SELECT 'A' = 'a'; ++-----------+ +| 'A' = 'a' | ++-----------+ +| 0 | ++-----------+ +1 row in set (0.00 sec) + +mysql> SET NAMES utf8mb4 COLLATE utf8mb4_general_ci; +Query OK, 0 rows affected (0.00 sec) + +mysql> SELECT 'A' = 'a'; ++-----------+ +| 'A' = 'a' | ++-----------+ +| 1 | ++-----------+ +1 row in set (0.00 sec) +``` + +TiDB defaults to using a binary collation. This differs from MySQL, which uses a case-insensitive collation by default. ## TiDB character set and collation From 39fbf5feb3e8af308bcce81eb6daf69feea3ccc2 Mon Sep 17 00:00:00 2001 From: ireneontheway <48651140+ireneontheway@users.noreply.github.com> Date: Thu, 30 Jul 2020 17:18:55 +0800 Subject: [PATCH 05/11] Update character-set-and-collation.md Co-authored-by: Keke Yi <40977455+yikeke@users.noreply.github.com> --- character-set-and-collation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/character-set-and-collation.md b/character-set-and-collation.md index 7e3863a6ba662..86bde6555e502 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -10,7 +10,7 @@ This document introduces the character set and collation supported by TiDB. ## Concepts -A character set is a set of symbols and encodings. The default character set in TiDB is utf8mb4, which matches the default in MySQL 8.0 and above. In TiDB `utf8` and `utf8mb4` behave identically, and `utf8` is not restricted to a maximum of 3 bytes as in MySQL. UTF-8 encoding [accounts for between 83% - 100% of webpages](https://en.wikipedia.org/wiki/UTF-8), depending on the language and country. +A character set is a set of symbols and encodings. The default character set in TiDB is utf8mb4, which matches the default in MySQL 8.0 and above. A collation is a set of rules for comparing characters in a character set, and the sorting order of characters. For example in a binary collation `A` and `a` do not compare as equal: From 8d40e350833ccc9ce38c1ef57acfd05d23824054 Mon Sep 17 00:00:00 2001 From: ireneontheway <48651140+ireneontheway@users.noreply.github.com> Date: Fri, 31 Jul 2020 10:43:09 +0800 Subject: [PATCH 06/11] Update character-set-and-collation.md --- character-set-and-collation.md | 35 ++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) diff --git a/character-set-and-collation.md b/character-set-and-collation.md index e7f837222a203..23516e2cf7d96 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -106,6 +106,41 @@ SHOW COLLATION WHERE Charset = 'utf8mb4'; +--------------------+---------+------+---------+----------+---------+ 2 rows in set (0.00 sec) ``` +## `utf8` and `ut8mb4` in TiDB + +In MySQL, the character set `utf8` is limited to a maximum of three bytes. This is sufficient to store characters in the Basic Multilingual Plane (BMP), but not enough to store characters such as emojis. For this, it is recommended to use the character set `utf8mb4` instead. + +By default, TiDB provides the same 3-byte limit on `utf8` to ensure that data created in TiDB can still safely be restored in MySQL. This can be disabled by changing the value of `check-mb4-value-in-utf8` to `FALSE` in your TiDB configuration file. + +The following demonstrates the default behavior when inserting a 4-byte emoji character into a table. The `INSERT` statement fails for the `utf8` character set, but succeeds for `ut8mb4`: + +```sql +mysql> CREATE TABLE utf8_test ( + -> c char(1) NOT NULL + -> ) CHARACTER SET utf8; +Query OK, 0 rows affected (0.09 sec) + +mysql> CREATE TABLE utf8m4_test ( + -> c char(1) NOT NULL + -> ) CHARACTER SET utf8mb4; +Query OK, 0 rows affected (0.09 sec) + +mysql> INSERT INTO utf8_test VALUES ('😉'); +ERROR 1366 (HY000): incorrect utf8 value f09f9889(😉) for column c +mysql> INSERT INTO utf8m4_test VALUES ('😉'); +Query OK, 1 row affected (0.02 sec) + +mysql> SELECT char_length(c), length(c), c FROM utf8_test; +Empty set (0.01 sec) + +mysql> SELECT char_length(c), length(c), c FROM utf8m4_test; ++----------------+-----------+------+ +| char_length(c) | length(c) | c | ++----------------+-----------+------+ +| 1 | 4 | 😉 | ++----------------+-----------+------+ +1 row in set (0.00 sec) +``` ## Character set and collation in different layers From 2462e45b993e4bf95c4628ae1f5b766d132c4e24 Mon Sep 17 00:00:00 2001 From: ireneontheway <48651140+ireneontheway@users.noreply.github.com> Date: Fri, 31 Jul 2020 10:43:52 +0800 Subject: [PATCH 07/11] Update character-set-and-collation.md --- character-set-and-collation.md | 1 + 1 file changed, 1 insertion(+) diff --git a/character-set-and-collation.md b/character-set-and-collation.md index 23516e2cf7d96..4a54c37a0bcfa 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -106,6 +106,7 @@ SHOW COLLATION WHERE Charset = 'utf8mb4'; +--------------------+---------+------+---------+----------+---------+ 2 rows in set (0.00 sec) ``` + ## `utf8` and `ut8mb4` in TiDB In MySQL, the character set `utf8` is limited to a maximum of three bytes. This is sufficient to store characters in the Basic Multilingual Plane (BMP), but not enough to store characters such as emojis. For this, it is recommended to use the character set `utf8mb4` instead. From c793f34b1888a6c0d25453b6bb82e381f58556f6 Mon Sep 17 00:00:00 2001 From: ireneontheway <48651140+ireneontheway@users.noreply.github.com> Date: Fri, 31 Jul 2020 11:47:17 +0800 Subject: [PATCH 08/11] Apply suggestions from code review Co-authored-by: Keke Yi <40977455+yikeke@users.noreply.github.com> --- character-set-and-collation.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/character-set-and-collation.md b/character-set-and-collation.md index 4a54c37a0bcfa..eeddaa7b9be66 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -6,7 +6,7 @@ aliases: ['/docs/dev/character-set-and-collation/','/docs/dev/reference/sql/char # Character Set and Collation -This document introduces the character set and collation supported by TiDB. +This document introduces the character sets and collations supported by TiDB. ## Concepts @@ -46,7 +46,7 @@ mysql> SELECT 'A' = 'a'; TiDB defaults to using a binary collation. This differs from MySQL, which uses a case-insensitive collation by default. -## TiDB character set and collation +## Character sets and collations supported by TiDB Currently, TiDB supports the following character sets: From 60171ced647e6ce2b2578d479263608257f80586 Mon Sep 17 00:00:00 2001 From: ireneontheway <48651140+ireneontheway@users.noreply.github.com> Date: Mon, 3 Aug 2020 16:02:44 +0800 Subject: [PATCH 09/11] Update character-set-and-collation.md --- character-set-and-collation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/character-set-and-collation.md b/character-set-and-collation.md index eeddaa7b9be66..2e09f956bbd0c 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -23,7 +23,7 @@ SET NAMES utf8mb4 COLLATE utf8mb4_general_ci; SELECT 'A' = 'a'; ``` -```sql +``` mysql> SELECT 'A' = 'a'; +-----------+ | 'A' = 'a' | From bbdd7a42cdef53c744f8717effd1053fdbce73fa Mon Sep 17 00:00:00 2001 From: ireneontheway <48651140+ireneontheway@users.noreply.github.com> Date: Mon, 3 Aug 2020 16:04:59 +0800 Subject: [PATCH 10/11] Update character-set-and-collation.md --- character-set-and-collation.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/character-set-and-collation.md b/character-set-and-collation.md index 2e09f956bbd0c..16052ff87b279 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -23,8 +23,8 @@ SET NAMES utf8mb4 COLLATE utf8mb4_general_ci; SELECT 'A' = 'a'; ``` -``` -mysql> SELECT 'A' = 'a'; +```sql +SELECT 'A' = 'a'; +-----------+ | 'A' = 'a' | +-----------+ From b2708d94bce94f96211ee98bf72ef33d33636f52 Mon Sep 17 00:00:00 2001 From: ireneontheway <48651140+ireneontheway@users.noreply.github.com> Date: Mon, 3 Aug 2020 16:25:38 +0800 Subject: [PATCH 11/11] Update character-set-and-collation.md --- character-set-and-collation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/character-set-and-collation.md b/character-set-and-collation.md index 16052ff87b279..eeddaa7b9be66 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -24,7 +24,7 @@ SELECT 'A' = 'a'; ``` ```sql -SELECT 'A' = 'a'; +mysql> SELECT 'A' = 'a'; +-----------+ | 'A' = 'a' | +-----------+