Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 106 additions & 13 deletions character-set-and-collation.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,47 @@ aliases: ['/docs/stable/character-set-and-collation/','/docs/v4.0/character-set-

# Character Set and Collation

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.
This document introduces the character sets and collations supported by TiDB.

## Concepts

A character set is a set of symbols and encodings. The default character set in TiDB is utf8mb4, which matches the default in MySQL 8.0 and above.

A collation is a set of rules for comparing characters in a character set, and the sorting order of characters. For example in a binary collation `A` and `a` do not compare as equal:

{{< copyable "sql" >}}

```sql
SET NAMES utf8mb4 COLLATE utf8mb4_bin;
SELECT 'A' = 'a';
SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
SELECT 'A' = 'a';
```

```sql
mysql> SELECT 'A' = 'a';
+-----------+
| 'A' = 'a' |
+-----------+
| 0 |
+-----------+
1 row in set (0.00 sec)

mysql> SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT 'A' = 'a';
+-----------+
| 'A' = 'a' |
+-----------+
| 1 |
+-----------+
1 row in set (0.00 sec)
```

TiDB defaults to using a binary collation. This differs from MySQL, which uses a case-insensitive collation by default.

## Character sets and collations supported by TiDB

Currently, TiDB supports the following character sets:

Expand All @@ -29,6 +69,22 @@ SHOW CHARACTER SET;
5 rows in set (0.00 sec)
```

TiDB supports the following collations:

```sql
mysql> show collation;
+-------------+---------+------+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+-------------+---------+------+---------+----------+---------+
| utf8mb4_bin | utf8mb4 | 46 | Yes | Yes | 1 |
| latin1_bin | latin1 | 47 | Yes | Yes | 1 |
| binary | binary | 63 | Yes | Yes | 1 |
| ascii_bin | ascii | 65 | Yes | Yes | 1 |
| utf8_bin | utf8 | 83 | Yes | Yes | 1 |
+-------------+---------+------+---------+----------+---------+
5 rows in set (0.01 sec)
```

> **Note:**
>
> The default collations in TiDB (binary collations, with the suffix `_bin`) are different than [the default collations in MySQL](https://dev.mysql.com/doc/refman/8.0/en/charset-charsets.html) (typically general collations, with the suffix `_general_ci`). This can cause incompatible behavior when specifying an explicit character set but relying on the implicit default collation to be chosen.
Expand All @@ -51,11 +107,47 @@ SHOW COLLATION WHERE Charset = 'utf8mb4';
2 rows in set (0.00 sec)
```

## Cluster character set and collation
## `utf8` and `ut8mb4` in TiDB

In MySQL, the character set `utf8` is limited to a maximum of three bytes. This is sufficient to store characters in the Basic Multilingual Plane (BMP), but not enough to store characters such as emojis. For this, it is recommended to use the character set `utf8mb4` instead.

By default, TiDB provides the same 3-byte limit on `utf8` to ensure that data created in TiDB can still safely be restored in MySQL. This can be disabled by changing the value of `check-mb4-value-in-utf8` to `FALSE` in your TiDB configuration file.

The following demonstrates the default behavior when inserting a 4-byte emoji character into a table. The `INSERT` statement fails for the `utf8` character set, but succeeds for `ut8mb4`:

```sql
mysql> CREATE TABLE utf8_test (
-> c char(1) NOT NULL
-> ) CHARACTER SET utf8;
Query OK, 0 rows affected (0.09 sec)

mysql> CREATE TABLE utf8m4_test (
-> c char(1) NOT NULL
-> ) CHARACTER SET utf8mb4;
Query OK, 0 rows affected (0.09 sec)

mysql> INSERT INTO utf8_test VALUES ('😉');
ERROR 1366 (HY000): incorrect utf8 value f09f9889(😉) for column c
mysql> INSERT INTO utf8m4_test VALUES ('😉');
Query OK, 1 row affected (0.02 sec)

mysql> SELECT char_length(c), length(c), c FROM utf8_test;
Empty set (0.01 sec)

mysql> SELECT char_length(c), length(c), c FROM utf8m4_test;
+----------------+-----------+------+
| char_length(c) | length(c) | c |
+----------------+-----------+------+
| 1 | 4 | 😉 |
+----------------+-----------+------+
1 row in set (0.00 sec)
```

## Character set and collation in different layers

Not supported yet.
The character set and collation can be set at different layers.

## Database character set and collation
### Database character set and collation

Each database has a character set and a collation. You can use the following statements to specify the database character set and collation:

Expand Down Expand Up @@ -152,7 +244,7 @@ SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME
FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'db_name';
```

## Table character set and collation
### Table character set and collation

You can use the following statement to specify the character set and collation for tables:

Expand Down Expand Up @@ -180,7 +272,7 @@ Query OK, 0 rows affected (0.08 sec)

If the table character set and collation are not specified, the database character set and collation are used as their default values.

## Column character set and collation
### Column character set and collation

You can use the following statement to specify the character set and collation for columns:

Expand All @@ -196,7 +288,7 @@ col_name {ENUM | SET} (val_list)

If the column character set and collation are not specified, the table character set and collation are used as their default values.

## String character sets and collation
### String character sets and collation

Each string corresponds to a character set and a collation. When you use a string, this option is available:

Expand All @@ -222,7 +314,7 @@ Rules:
+ Rule 2: If you specify `CHARACTER SET charset_name` but do not specify `COLLATE collation_name`, the `charset_name` character set and the default collation of `charset_name` are used.
+ Rule 3: If you specify neither `CHARACTER SET charset_name` nor `COLLATE collation_name`, the character set and collation given by the system variables `character_set_connection` and `collation_connection` are used.

## Client connection character set and collation
### Client connection character set and collation

+ The server character set and collation are the values of the `character_set_server` and `collation_server` system variables.

Expand All @@ -246,7 +338,7 @@ You can use the following statement to set the character set and collation that
SET character_set_connection = charset_name;
```

`COLLATE` is optional, if absent, the default collation of the `charset_name` is used.
`COLLATE` is optional, if absent, the default collation of the `charset_name` is used to set the `collation_connection`.

+ `SET CHARACTER SET 'charset_name'`

Expand All @@ -255,12 +347,13 @@ You can use the following statement to set the character set and collation that
```sql
SET character_set_client = charset_name;
SET character_set_results = charset_name;
SET charset_connection = @@charset_database;
SET collation_connection = @@collation_database;
```

## Optimization levels of character sets and collations
## Selection priorities of character sets and collations

String > Column > Table > Database > Server > Cluster
String > Column > Table > Database > Server

## General rules on selecting character sets and collation

Expand Down Expand Up @@ -345,13 +438,13 @@ If an expression involves multiple clauses of different collations, you need to

+ The coercibility value of the explicit `COLLATE` clause is `0`.
+ If the collations of two strings are incompatible, the coercibility value of the concatenation of two strings with different collations is `1`. Currently, all implemented collations are compatible with each other.
+ The column's collation has a coercibility value of `2`.
+ The collation of the column, `CAST()`, `CONVERT()`, or `BINARY()` has a coercibility value of `2`.
+ The system constant (the string returned by `USER ()` or `VERSION ()`) has a coercibility value of `3`.
+ The coercibility value of constants is `4`.
+ The coercibility value of numbers or intermediate variables is `5`.
+ `NULL` or expressions derived from `NULL` has a coercibility value of `6`.

When inferring collations, TiDB prefers using the collation of expressions with lower coercibility values (the same as MySQL). If the coercibility values of two clauses are the same, the collation is determined according to the following priority:
When inferring collations, TiDB prefers using the collation of expressions with lower coercibility values. If the coercibility values of two clauses are the same, the collation is determined according to the following priority:

binary > utf8mb4_bin > utf8mb4_general_ci > utf8_bin > utf8_general_ci > latin1_bin > ascii_bin

Expand Down