pingcap · ireneontheway · Aug 4, 2020 · Jul 23, 2020 · Jul 24, 2020 · Jul 27, 2020
diff --git a/character-set-and-collation.md b/character-set-and-collation.md
@@ -6,7 +6,47 @@ aliases: ['/docs/dev/character-set-and-collation/','/docs/dev/reference/sql/char
 
 # Character Set and Collation
 
-A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.
+This document introduces the character sets and collations supported by TiDB.
+
+## Concepts
+
+A character set is a set of symbols and encodings. The default character set in TiDB is utf8mb4, which matches the default in MySQL 8.0 and above.
+
+A collation is a set of rules for comparing characters in a character set, and the sorting order of characters. For example in a binary collation `A` and `a` do not compare as equal:
+
+{{< copyable "sql" >}}
+
+```sql
+SET NAMES utf8mb4 COLLATE utf8mb4_bin;
+SELECT 'A' = 'a';
+SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
+SELECT 'A' = 'a';
+```
+
+```sql
+mysql> SELECT 'A' = 'a';
++-----------+
+| 'A' = 'a' |
++-----------+
+|         0 |
++-----------+
+1 row in set (0.00 sec)
+
+mysql> SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
+Query OK, 0 rows affected (0.00 sec)
+
+mysql> SELECT 'A' = 'a';
++-----------+
+| 'A' = 'a' |
++-----------+
+|         1 |
++-----------+
+1 row in set (0.00 sec)
+```
+
+TiDB defaults to using a binary collation. This differs from MySQL, which uses a case-insensitive collation by default.
+
+## Character sets and collations supported by TiDB
 
 Currently, TiDB supports the following character sets:
 
@@ -29,6 +69,22 @@ SHOW CHARACTER SET;
 5 rows in set (0.00 sec)
 ```
 
+TiDB supports the following collations:
+
+```sql
+mysql> show collation;
++-------------+---------+------+---------+----------+---------+
+| Collation   | Charset | Id   | Default | Compiled | Sortlen |
++-------------+---------+------+---------+----------+---------+
+| utf8mb4_bin | utf8mb4 |   46 | Yes     | Yes      |       1 |
+| latin1_bin  | latin1  |   47 | Yes     | Yes      |       1 |
+| binary      | binary  |   63 | Yes     | Yes      |       1 |
+| ascii_bin   | ascii   |   65 | Yes     | Yes      |       1 |
+| utf8_bin    | utf8    |   83 | Yes     | Yes      |       1 |
++-------------+---------+------+---------+----------+---------+
+5 rows in set (0.01 sec)
+```
+
 > **Note:**
 >
 > The default collations in TiDB (binary collations, with the suffix `_bin`) are different than [the default collations in MySQL](https://dev.mysql.com/doc/refman/8.0/en/charset-charsets.html) (typically general collations, with the suffix `_general_ci`). This can cause incompatible behavior when specifying an explicit character set but relying on the implicit default collation to be chosen.
@@ -51,11 +107,47 @@ SHOW COLLATION WHERE Charset = 'utf8mb4';
 2 rows in set (0.00 sec)
 ```
 
-## Cluster character set and collation
+## `utf8` and `ut8mb4` in TiDB
+
+In MySQL, the character set `utf8` is limited to a maximum of three bytes. This is sufficient to store characters in the Basic Multilingual Plane (BMP), but not enough to store characters such as emojis. For this, it is recommended to use the character set `utf8mb4` instead.
+
+By default, TiDB provides the same 3-byte limit on `utf8` to ensure that data created in TiDB can still safely be restored in MySQL. This can be disabled by changing the value of `check-mb4-value-in-utf8` to `FALSE` in your TiDB configuration file.
+
+The following demonstrates the default behavior when inserting a 4-byte emoji character into a table. The `INSERT` statement fails for the `utf8` character set, but succeeds for `ut8mb4`:
+
+```sql
+mysql> CREATE TABLE utf8_test (
+    ->  c char(1) NOT NULL
+    -> ) CHARACTER SET utf8;
+Query OK, 0 rows affected (0.09 sec)
+
+mysql> CREATE TABLE utf8m4_test (
+    ->  c char(1) NOT NULL
+    -> ) CHARACTER SET utf8mb4;
+Query OK, 0 rows affected (0.09 sec)
+
+mysql> INSERT INTO utf8_test VALUES ('😉');
+ERROR 1366 (HY000): incorrect utf8 value f09f9889(😉) for column c
+mysql> INSERT INTO utf8m4_test VALUES ('😉');
+Query OK, 1 row affected (0.02 sec)
+
+mysql> SELECT char_length(c), length(c), c FROM utf8_test;
+Empty set (0.01 sec)
+
+mysql> SELECT char_length(c), length(c), c FROM utf8m4_test;
++----------------+-----------+------+
+| char_length(c) | length(c) | c    |
++----------------+-----------+------+
+|              1 |         4 | 😉     |
++----------------+-----------+------+
+1 row in set (0.00 sec)
+```
+
+## Character set and collation in different layers
 
-Not supported yet.
+The character set and collation can be set at different layers.
 
-## Database character set and collation
+### Database character set and collation
 
 Each database has a character set and a collation. You can use the following statements to specify the database character set and collation:
 
@@ -152,7 +244,7 @@ SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME
 FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'db_name';
 ```
 
-## Table character set and collation
+### Table character set and collation
 
 You can use the following statement to specify the character set and collation for tables:
 
@@ -180,7 +272,7 @@ Query OK, 0 rows affected (0.08 sec)
 
 If the table character set and collation are not specified, the database character set and collation are used as their default values.
 
-## Column character set and collation
+### Column character set and collation
 
 You can use the following statement to specify the character set and collation for columns:
 
@@ -196,7 +288,7 @@ col_name {ENUM | SET} (val_list)
 
 If the column character set and collation are not specified, the table character set and collation are used as their default values.
 
-## String character sets and collation
+### String character sets and collation
 
 Each string corresponds to a character set and a collation. When you use a string, this option is available:
 
@@ -222,7 +314,7 @@ Rules:
 + Rule 2: If you specify `CHARACTER SET charset_name` but do not specify `COLLATE collation_name`, the `charset_name` character set and the default collation of `charset_name` are used.
 + Rule 3: If you specify neither `CHARACTER SET charset_name` nor `COLLATE collation_name`, the character set and collation given by the system variables `character_set_connection` and `collation_connection` are used.
 
-## Client connection character set and collation
+### Client connection character set and collation
 
 + The server character set and collation are the values of the `character_set_server` and `collation_server` system variables.
 
@@ -246,7 +338,7 @@ You can use the following statement to set the character set and collation that
     SET character_set_connection = charset_name;
     ```
 
-    `COLLATE` is optional, if absent, the default collation of the `charset_name` is used.
+    `COLLATE` is optional, if absent, the default collation of the `charset_name` is used to set the `collation_connection`.
 
 + `SET CHARACTER SET 'charset_name'`
 
@@ -255,12 +347,13 @@ You can use the following statement to set the character set and collation that
     ```sql
     SET character_set_client = charset_name;
     SET character_set_results = charset_name;
+    SET charset_connection = @@charset_database;
     SET collation_connection = @@collation_database;
     ```
 
-## Optimization levels of character sets and collations
+## Selection priorities of character sets and collations
 
-String > Column > Table > Database > Server > Cluster
+String > Column > Table > Database > Server
 
 ## General rules on selecting character sets and collation
 
@@ -345,13 +438,13 @@ If an expression involves multiple clauses of different collations, you need to
 
 + The coercibility value of the explicit `COLLATE` clause is `0`.
 + If the collations of two strings are incompatible, the coercibility value of the concatenation of two strings with different collations is `1`. Currently, all implemented collations are compatible with each other.
-+ The column's collation has a coercibility value of `2`.
++ The collation of the column, `CAST()`, `CONVERT()`, or `BINARY()` has a coercibility value of `2`.
 + The system constant (the string returned by `USER ()` or `VERSION ()`) has a coercibility value of `3`.
 + The coercibility value of constants is `4`.
 + The coercibility value of numbers or intermediate variables is `5`.
 + `NULL` or expressions derived from `NULL` has a coercibility value of `6`.
 
-When inferring collations, TiDB prefers using the collation of expressions with lower coercibility values (the same as MySQL). If the coercibility values of two clauses are the same, the collation is determined according to the following priority:
+When inferring collations, TiDB prefers using the collation of expressions with lower coercibility values. If the coercibility values of two clauses are the same, the collation is determined according to the following priority:
 
 binary > utf8mb4_bin > utf8mb4_general_ci > utf8_bin > utf8_general_ci > latin1_bin > ascii_bin