Skip to content

Conversation

@ireneontheway
Copy link
Contributor

@ireneontheway ireneontheway commented Jul 23, 2020

What is changed, added or deleted? (Required)

Which TiDB version(s) do your changes apply to? (Required)

  • master (the latest development version)
  • v4.0 (TiDB 4.0 versions)
  • v3.1 (TiDB 3.1 versions)
  • v3.0 (TiDB 3.0 versions)
  • v2.1 (TiDB 2.1 versions)

What is the related PR or file link(s)?

Do your changes match any of the following descriptions?

  • Delete files
  • Change aliases
  • Have version specific changes
  • Might cause conflicts

@ireneontheway
Copy link
Contributor Author

/label size/smal,special-week,translation/from-docs-cn,needs-cherry-pick-4.0

@ti-srebot
Copy link
Contributor

These labels are not found size/smal.

@ti-srebot ti-srebot added special-week PR from Document Special Week. translation/from-docs-cn This PR is translated from a PR in pingcap/docs-cn. labels Jul 23, 2020
@ireneontheway
Copy link
Contributor Author

/cc TomShawn, yikeke

@ti-srebot ti-srebot requested review from TomShawn and yikeke July 23, 2020 06:49
Comment on lines 9 to 15
This document introduces the character set and collation supported by TiDB.

## Concepts

A character set is a set of symbols and encodings.

A collation is a set of rules for comparing characters in a character set.
Copy link

@ghost ghost Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I suggest the following as an intro:

A character set is a set of symbols and encodings. The default character set in TiDB is utf8mb4, which matches the default in MySQL 8.0 and above. UTF-8 encoding accounts for between 83% - 100% of webpages, depending on the language and country.

A collation is a set of rules for comparing characters in a character set, and the sorting order of characters. For example in a binary collation A and a do not compare as equal:

{{< copyable "sql" >}}

SET NAMES utf8mb4 COLLATE utf8mb4_bin;
SELECT 'A' = 'a';
SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
SELECT 'A' = 'a';
mysql> SELECT 'A' = 'a';
+-----------+
| 'A' = 'a' |
+-----------+
|         0 |
+-----------+
1 row in set (0.00 sec)

mysql> SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT 'A' = 'a';
+-----------+
| 'A' = 'a' |
+-----------+
|         1 |
+-----------+
1 row in set (0.00 sec)

TiDB defaults to using a binary collation. This differs from MySQL, which uses a case-insensitive collation by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In TiDB utf8 and utf8mb4 behave identically, and utf8 is not restricted to a maximum of 3 bytes as in MySQL.

In fact, TiDB checks if the bytes character is greater than 3 unless check-mb4-value-in-utf8 is disabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nullnotnil Do you need to modify your suggested intro according to #3402 (comment)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated it.

Copy link

@ghost ghost Jul 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nullnotnil Do you need to modify your suggested intro according to #3402 (comment)?

Yes, I verified and my earlier comment was incorrect. I have updated it to describe the check-mb4 option. Edit: I will add a section on utf8mb4 vs utf8 which will make this clearer. It's useful to describe, but doesn't have to be in the intro.

Co-authored-by: Null not nil <67764674+nullnotnil@users.noreply.github.com>
Copy link
Contributor

@yikeke yikeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

Co-authored-by: Keke Yi <40977455+yikeke@users.noreply.github.com>
@yikeke yikeke added the requires-followup This PR requires a follow-up task after being merged. label Jul 27, 2020
@yikeke yikeke removed the request for review from TomShawn July 27, 2020 09:33
@yikeke
Copy link
Contributor

yikeke commented Jul 28, 2020

PTAL again @nullnotnil @wjhuang2016

+--------------------+---------+------+---------+----------+---------+
2 rows in set (0.00 sec)
```

Copy link

@ghost ghost Jul 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utf8 and utf8mb4 in MySQL

In MySQL, the character set utf8 is limited to a maximum of three bytes. This is sufficient to store characters in the Basic Multilingual Plane (BMP), but not enough to store characters such as emojis. For this, it is recommended to use the character set utf8mb4 instead.

By default, TiDB provides the same 3-byte limit on utf8 to ensure that data created in TiDB can still safely be restored in MySQL. This can be disabled by changing the value of check-mb4-value-in-utf8 to FALSE in your TiDB configuration file.

The following demonstrates the default behavior when inserting a 4-byte emoji character into a table. The INSERT statement fails for the utf8 character set, but succeeds for ut8mb4:

mysql> CREATE TABLE utf8_test (
    ->  c char(1) NOT NULL
    -> ) CHARACTER SET utf8;
Query OK, 0 rows affected (0.09 sec)

mysql> CREATE TABLE utf8m4_test (
    ->  c char(1) NOT NULL
    -> ) CHARACTER SET utf8mb4;
Query OK, 0 rows affected (0.09 sec)

mysql> INSERT INTO utf8_test VALUES ('😉');
ERROR 1366 (HY000): incorrect utf8 value f09f9889(😉) for column c
mysql> INSERT INTO utf8m4_test VALUES ('😉');
Query OK, 1 row affected (0.02 sec)

mysql> SELECT char_length(c), length(c), c FROM utf8_test;
Empty set (0.01 sec)

mysql> SELECT char_length(c), length(c), c FROM utf8m4_test;
+----------------+-----------+------+
| char_length(c) | length(c) | c    |
+----------------+-----------+------+
|              1 |         4 | 😉     |
+----------------+-----------+------+
1 row in set (0.00 sec)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I suggest changing the header to `utf8` and `ut8mb4` in TiDB to focus our topic on TiDB? @nullnotnil

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please apply the suggestion~ @ireneontheway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ireneontheway
Copy link
Contributor Author

PTAL again @nullnotnil @wjhuang2016

Copy link
Contributor

@yikeke yikeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Jul 31, 2020
Co-authored-by: Keke Yi <40977455+yikeke@users.noreply.github.com>
@ghost
Copy link

ghost commented Aug 3, 2020

LGTM

@ti-srebot
Copy link
Contributor

@nullnotnil,Thanks for your review. However, LGTM is restricted to Reviewers or higher roles.See the corresponding SIG page for more information. Related SIGs: docs(slack).

Copy link
Member

@wjhuang2016 wjhuang2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ireneontheway ireneontheway merged commit 52763cb into pingcap:master Aug 4, 2020
@ireneontheway ireneontheway deleted the character-set-and-collation branch August 4, 2020 09:08
ti-srebot pushed a commit to ti-srebot/docs that referenced this pull request Aug 4, 2020
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@ti-srebot
Copy link
Contributor

cherry pick to release-4.0 in PR #3540

ireneontheway added a commit that referenced this pull request Aug 4, 2020
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>

Co-authored-by: ireneontheway <48651140+ireneontheway@users.noreply.github.com>
@yikeke yikeke added the translation/doing This PR's assignee is translating this PR. label Aug 4, 2020
@ireneontheway ireneontheway added translation/done This PR has been translated from English into Chinese and updated to pingcap/docs-cn in a PR. and removed translation/doing This PR's assignee is translating this PR. labels Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

requires-followup This PR requires a follow-up task after being merged. special-week PR from Document Special Week. status/LGT1 Indicates that a PR has LGTM 1. translation/done This PR has been translated from English into Chinese and updated to pingcap/docs-cn in a PR. translation/from-docs-cn This PR is translated from a PR in pingcap/docs-cn.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants