-
Notifications
You must be signed in to change notification settings - Fork 709
Update character-set-and-collation.md #3402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
f71e207
b02d9ab
59b5dc0
f748816
93a11df
39fbf5f
32b1857
8d40e35
2462e45
135cc02
c793f34
54216a6
60171ce
bbdd7a4
b2708d9
f5163c4
b7be7a1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,7 +6,47 @@ aliases: ['/docs/dev/character-set-and-collation/','/docs/dev/reference/sql/char | |
|
|
||
| # Character Set and Collation | ||
|
|
||
| A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. | ||
| This document introduces the character sets and collations supported by TiDB. | ||
|
|
||
| ## Concepts | ||
|
|
||
| A character set is a set of symbols and encodings. The default character set in TiDB is utf8mb4, which matches the default in MySQL 8.0 and above. | ||
|
|
||
| A collation is a set of rules for comparing characters in a character set, and the sorting order of characters. For example in a binary collation `A` and `a` do not compare as equal: | ||
|
|
||
| {{< copyable "sql" >}} | ||
|
|
||
| ```sql | ||
| SET NAMES utf8mb4 COLLATE utf8mb4_bin; | ||
| SELECT 'A' = 'a'; | ||
| SET NAMES utf8mb4 COLLATE utf8mb4_general_ci; | ||
| SELECT 'A' = 'a'; | ||
| ``` | ||
|
|
||
| ```sql | ||
| mysql> SELECT 'A' = 'a'; | ||
| +-----------+ | ||
| | 'A' = 'a' | | ||
| +-----------+ | ||
| | 0 | | ||
| +-----------+ | ||
| 1 row in set (0.00 sec) | ||
|
|
||
| mysql> SET NAMES utf8mb4 COLLATE utf8mb4_general_ci; | ||
| Query OK, 0 rows affected (0.00 sec) | ||
|
|
||
| mysql> SELECT 'A' = 'a'; | ||
| +-----------+ | ||
| | 'A' = 'a' | | ||
| +-----------+ | ||
| | 1 | | ||
| +-----------+ | ||
| 1 row in set (0.00 sec) | ||
| ``` | ||
|
|
||
| TiDB defaults to using a binary collation. This differs from MySQL, which uses a case-insensitive collation by default. | ||
|
|
||
| ## Character sets and collations supported by TiDB | ||
|
|
||
| Currently, TiDB supports the following character sets: | ||
|
|
||
|
|
@@ -29,6 +69,22 @@ SHOW CHARACTER SET; | |
| 5 rows in set (0.00 sec) | ||
| ``` | ||
|
|
||
| TiDB supports the following collations: | ||
|
|
||
| ```sql | ||
| mysql> show collation; | ||
| +-------------+---------+------+---------+----------+---------+ | ||
| | Collation | Charset | Id | Default | Compiled | Sortlen | | ||
| +-------------+---------+------+---------+----------+---------+ | ||
| | utf8mb4_bin | utf8mb4 | 46 | Yes | Yes | 1 | | ||
| | latin1_bin | latin1 | 47 | Yes | Yes | 1 | | ||
| | binary | binary | 63 | Yes | Yes | 1 | | ||
| | ascii_bin | ascii | 65 | Yes | Yes | 1 | | ||
| | utf8_bin | utf8 | 83 | Yes | Yes | 1 | | ||
| +-------------+---------+------+---------+----------+---------+ | ||
| 5 rows in set (0.01 sec) | ||
| ``` | ||
|
|
||
| > **Note:** | ||
| > | ||
| > The default collations in TiDB (binary collations, with the suffix `_bin`) are different than [the default collations in MySQL](https://dev.mysql.com/doc/refman/8.0/en/charset-charsets.html) (typically general collations, with the suffix `_general_ci`). This can cause incompatible behavior when specifying an explicit character set but relying on the implicit default collation to be chosen. | ||
|
|
@@ -51,11 +107,47 @@ SHOW COLLATION WHERE Charset = 'utf8mb4'; | |
| 2 rows in set (0.00 sec) | ||
| ``` | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. utf8 and utf8mb4 in MySQLIn MySQL, the character set By default, TiDB provides the same 3-byte limit on The following demonstrates the default behavior when inserting a 4-byte emoji character into a table. The mysql> CREATE TABLE utf8_test (
-> c char(1) NOT NULL
-> ) CHARACTER SET utf8;
Query OK, 0 rows affected (0.09 sec)
mysql> CREATE TABLE utf8m4_test (
-> c char(1) NOT NULL
-> ) CHARACTER SET utf8mb4;
Query OK, 0 rows affected (0.09 sec)
mysql> INSERT INTO utf8_test VALUES ('😉');
ERROR 1366 (HY000): incorrect utf8 value f09f9889(😉) for column c
mysql> INSERT INTO utf8m4_test VALUES ('😉');
Query OK, 1 row affected (0.02 sec)
mysql> SELECT char_length(c), length(c), c FROM utf8_test;
Empty set (0.01 sec)
mysql> SELECT char_length(c), length(c), c FROM utf8m4_test;
+----------------+-----------+------+
| char_length(c) | length(c) | c |
+----------------+-----------+------+
| 1 | 4 | 😉 |
+----------------+-----------+------+
1 row in set (0.00 sec)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. May I suggest changing the header to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Works for me
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please apply the suggestion~ @ireneontheway
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
| ## Cluster character set and collation | ||
| ## `utf8` and `ut8mb4` in TiDB | ||
|
|
||
| In MySQL, the character set `utf8` is limited to a maximum of three bytes. This is sufficient to store characters in the Basic Multilingual Plane (BMP), but not enough to store characters such as emojis. For this, it is recommended to use the character set `utf8mb4` instead. | ||
|
|
||
| By default, TiDB provides the same 3-byte limit on `utf8` to ensure that data created in TiDB can still safely be restored in MySQL. This can be disabled by changing the value of `check-mb4-value-in-utf8` to `FALSE` in your TiDB configuration file. | ||
|
|
||
| The following demonstrates the default behavior when inserting a 4-byte emoji character into a table. The `INSERT` statement fails for the `utf8` character set, but succeeds for `ut8mb4`: | ||
|
|
||
| ```sql | ||
| mysql> CREATE TABLE utf8_test ( | ||
| -> c char(1) NOT NULL | ||
| -> ) CHARACTER SET utf8; | ||
| Query OK, 0 rows affected (0.09 sec) | ||
|
|
||
| mysql> CREATE TABLE utf8m4_test ( | ||
| -> c char(1) NOT NULL | ||
| -> ) CHARACTER SET utf8mb4; | ||
| Query OK, 0 rows affected (0.09 sec) | ||
|
|
||
| mysql> INSERT INTO utf8_test VALUES ('😉'); | ||
| ERROR 1366 (HY000): incorrect utf8 value f09f9889(😉) for column c | ||
| mysql> INSERT INTO utf8m4_test VALUES ('😉'); | ||
| Query OK, 1 row affected (0.02 sec) | ||
|
|
||
| mysql> SELECT char_length(c), length(c), c FROM utf8_test; | ||
| Empty set (0.01 sec) | ||
|
|
||
| mysql> SELECT char_length(c), length(c), c FROM utf8m4_test; | ||
| +----------------+-----------+------+ | ||
| | char_length(c) | length(c) | c | | ||
| +----------------+-----------+------+ | ||
| | 1 | 4 | 😉 | | ||
| +----------------+-----------+------+ | ||
| 1 row in set (0.00 sec) | ||
| ``` | ||
|
|
||
| ## Character set and collation in different layers | ||
|
|
||
| Not supported yet. | ||
| The character set and collation can be set at different layers. | ||
|
|
||
| ## Database character set and collation | ||
| ### Database character set and collation | ||
|
|
||
| Each database has a character set and a collation. You can use the following statements to specify the database character set and collation: | ||
|
|
||
|
|
@@ -152,7 +244,7 @@ SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME | |
| FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'db_name'; | ||
| ``` | ||
|
|
||
| ## Table character set and collation | ||
| ### Table character set and collation | ||
|
|
||
| You can use the following statement to specify the character set and collation for tables: | ||
|
|
||
|
|
@@ -180,7 +272,7 @@ Query OK, 0 rows affected (0.08 sec) | |
|
|
||
| If the table character set and collation are not specified, the database character set and collation are used as their default values. | ||
|
|
||
| ## Column character set and collation | ||
| ### Column character set and collation | ||
|
|
||
| You can use the following statement to specify the character set and collation for columns: | ||
|
|
||
|
|
@@ -196,7 +288,7 @@ col_name {ENUM | SET} (val_list) | |
|
|
||
| If the column character set and collation are not specified, the table character set and collation are used as their default values. | ||
|
|
||
| ## String character sets and collation | ||
| ### String character sets and collation | ||
|
|
||
| Each string corresponds to a character set and a collation. When you use a string, this option is available: | ||
|
|
||
|
|
@@ -222,7 +314,7 @@ Rules: | |
| + Rule 2: If you specify `CHARACTER SET charset_name` but do not specify `COLLATE collation_name`, the `charset_name` character set and the default collation of `charset_name` are used. | ||
| + Rule 3: If you specify neither `CHARACTER SET charset_name` nor `COLLATE collation_name`, the character set and collation given by the system variables `character_set_connection` and `collation_connection` are used. | ||
|
|
||
| ## Client connection character set and collation | ||
| ### Client connection character set and collation | ||
|
|
||
| + The server character set and collation are the values of the `character_set_server` and `collation_server` system variables. | ||
|
|
||
|
|
@@ -246,7 +338,7 @@ You can use the following statement to set the character set and collation that | |
| SET character_set_connection = charset_name; | ||
| ``` | ||
|
|
||
| `COLLATE` is optional, if absent, the default collation of the `charset_name` is used. | ||
| `COLLATE` is optional, if absent, the default collation of the `charset_name` is used to set the `collation_connection`. | ||
|
|
||
| + `SET CHARACTER SET 'charset_name'` | ||
|
|
||
|
|
@@ -255,12 +347,13 @@ You can use the following statement to set the character set and collation that | |
| ```sql | ||
| SET character_set_client = charset_name; | ||
| SET character_set_results = charset_name; | ||
| SET charset_connection = @@charset_database; | ||
| SET collation_connection = @@collation_database; | ||
| ``` | ||
|
|
||
| ## Optimization levels of character sets and collations | ||
| ## Selection priorities of character sets and collations | ||
|
|
||
| String > Column > Table > Database > Server > Cluster | ||
| String > Column > Table > Database > Server | ||
|
|
||
| ## General rules on selecting character sets and collation | ||
|
|
||
|
|
@@ -345,13 +438,13 @@ If an expression involves multiple clauses of different collations, you need to | |
|
|
||
| + The coercibility value of the explicit `COLLATE` clause is `0`. | ||
| + If the collations of two strings are incompatible, the coercibility value of the concatenation of two strings with different collations is `1`. Currently, all implemented collations are compatible with each other. | ||
| + The column's collation has a coercibility value of `2`. | ||
| + The collation of the column, `CAST()`, `CONVERT()`, or `BINARY()` has a coercibility value of `2`. | ||
| + The system constant (the string returned by `USER ()` or `VERSION ()`) has a coercibility value of `3`. | ||
| + The coercibility value of constants is `4`. | ||
| + The coercibility value of numbers or intermediate variables is `5`. | ||
| + `NULL` or expressions derived from `NULL` has a coercibility value of `6`. | ||
|
|
||
| When inferring collations, TiDB prefers using the collation of expressions with lower coercibility values (the same as MySQL). If the coercibility values of two clauses are the same, the collation is determined according to the following priority: | ||
| When inferring collations, TiDB prefers using the collation of expressions with lower coercibility values. If the coercibility values of two clauses are the same, the collation is determined according to the following priority: | ||
|
|
||
| binary > utf8mb4_bin > utf8mb4_general_ci > utf8_bin > utf8_general_ci > latin1_bin > ascii_bin | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.