-
Notifications
You must be signed in to change notification settings - Fork 709
Update character-set-and-collation.md #3402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update character-set-and-collation.md #3402
Conversation
|
/label size/smal,special-week,translation/from-docs-cn,needs-cherry-pick-4.0 |
|
These labels are not found |
|
/cc TomShawn, yikeke |
character-set-and-collation.md
Outdated
| This document introduces the character set and collation supported by TiDB. | ||
|
|
||
| ## Concepts | ||
|
|
||
| A character set is a set of symbols and encodings. | ||
|
|
||
| A collation is a set of rules for comparing characters in a character set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I suggest the following as an intro:
A character set is a set of symbols and encodings. The default character set in TiDB is utf8mb4, which matches the default in MySQL 8.0 and above. UTF-8 encoding accounts for between 83% - 100% of webpages, depending on the language and country.
A collation is a set of rules for comparing characters in a character set, and the sorting order of characters. For example in a binary collation A and a do not compare as equal:
{{< copyable "sql" >}}
SET NAMES utf8mb4 COLLATE utf8mb4_bin;
SELECT 'A' = 'a';
SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
SELECT 'A' = 'a';mysql> SELECT 'A' = 'a';
+-----------+
| 'A' = 'a' |
+-----------+
| 0 |
+-----------+
1 row in set (0.00 sec)
mysql> SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT 'A' = 'a';
+-----------+
| 'A' = 'a' |
+-----------+
| 1 |
+-----------+
1 row in set (0.00 sec)TiDB defaults to using a binary collation. This differs from MySQL, which uses a case-insensitive collation by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wjhuang2016 PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In TiDB utf8 and utf8mb4 behave identically, and utf8 is not restricted to a maximum of 3 bytes as in MySQL.
In fact, TiDB checks if the bytes character is greater than 3 unless check-mb4-value-in-utf8 is disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nullnotnil Do you need to modify your suggested intro according to #3402 (comment)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nullnotnil Do you need to modify your suggested intro according to #3402 (comment)?
Yes, I verified and my earlier comment was incorrect. I have updated it to describe the check-mb4 option. Edit: I will add a section on utf8mb4 vs utf8 which will make this clearer. It's useful to describe, but doesn't have to be in the intro.
Co-authored-by: Null not nil <67764674+nullnotnil@users.noreply.github.com>
yikeke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
Co-authored-by: Keke Yi <40977455+yikeke@users.noreply.github.com>
|
PTAL again @nullnotnil @wjhuang2016 |
| +--------------------+---------+------+---------+----------+---------+ | ||
| 2 rows in set (0.00 sec) | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utf8 and utf8mb4 in MySQL
In MySQL, the character set utf8 is limited to a maximum of three bytes. This is sufficient to store characters in the Basic Multilingual Plane (BMP), but not enough to store characters such as emojis. For this, it is recommended to use the character set utf8mb4 instead.
By default, TiDB provides the same 3-byte limit on utf8 to ensure that data created in TiDB can still safely be restored in MySQL. This can be disabled by changing the value of check-mb4-value-in-utf8 to FALSE in your TiDB configuration file.
The following demonstrates the default behavior when inserting a 4-byte emoji character into a table. The INSERT statement fails for the utf8 character set, but succeeds for ut8mb4:
mysql> CREATE TABLE utf8_test (
-> c char(1) NOT NULL
-> ) CHARACTER SET utf8;
Query OK, 0 rows affected (0.09 sec)
mysql> CREATE TABLE utf8m4_test (
-> c char(1) NOT NULL
-> ) CHARACTER SET utf8mb4;
Query OK, 0 rows affected (0.09 sec)
mysql> INSERT INTO utf8_test VALUES ('😉');
ERROR 1366 (HY000): incorrect utf8 value f09f9889(😉) for column c
mysql> INSERT INTO utf8m4_test VALUES ('😉');
Query OK, 1 row affected (0.02 sec)
mysql> SELECT char_length(c), length(c), c FROM utf8_test;
Empty set (0.01 sec)
mysql> SELECT char_length(c), length(c), c FROM utf8m4_test;
+----------------+-----------+------+
| char_length(c) | length(c) | c |
+----------------+-----------+------+
| 1 | 4 | 😉 |
+----------------+-----------+------+
1 row in set (0.00 sec)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I suggest changing the header to `utf8` and `ut8mb4` in TiDB to focus our topic on TiDB? @nullnotnil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please apply the suggestion~ @ireneontheway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Co-authored-by: Keke Yi <40977455+yikeke@users.noreply.github.com>
|
PTAL again @nullnotnil @wjhuang2016 |
yikeke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
Co-authored-by: Keke Yi <40977455+yikeke@users.noreply.github.com>
|
LGTM |
|
@nullnotnil,Thanks for your review. However, LGTM is restricted to Reviewers or higher roles.See the corresponding SIG page for more information. Related SIGs: docs(slack). |
wjhuang2016
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
|
cherry pick to release-4.0 in PR #3540 |
Signed-off-by: ti-srebot <ti-srebot@pingcap.com> Co-authored-by: ireneontheway <48651140+ireneontheway@users.noreply.github.com>
What is changed, added or deleted? (Required)
Which TiDB version(s) do your changes apply to? (Required)
What is the related PR or file link(s)?
Do your changes match any of the following descriptions?