From f2d29cfd52e9c279d1557f002e27b54ad487a645 Mon Sep 17 00:00:00 2001 From: xixirangrang <35301108+hfxsd@users.noreply.github.com> Date: Mon, 26 Sep 2022 21:43:58 +0800 Subject: [PATCH 01/25] Create data-migration-best-practices.md --- dm/data-migration-best-practices.md | 31 +++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) create mode 100644 dm/data-migration-best-practices.md diff --git a/dm/data-migration-best-practices.md b/dm/data-migration-best-practices.md new file mode 100644 index 000000000000..a8f86688aa03 --- /dev/null +++ b/dm/data-migration-best-practices.md @@ -0,0 +1,31 @@ +--- +title: Data Migration (DM) Best Practices +summary: Learn about best practices when you use TiDB Data Migration (DM) to migrate data. +--- + +# Data Migration (DM) Best Practices + +TiDB Data Migration (DM) is an data migration tool developed by PingCAP. It supports the full data migration and the incremental data replication from MySQL-compatible databases such as MySQL, Percona MySQL, MariaDB, AWS MySQL RDS, AWS Aurora into TiDB. + +You can use DM in the following scenarios: + +- Perform full and incremental data migration from a single MySQL-compatible database instance to TiDB +- Migrate and merge MySQL shards of small datasets to TiDB +- In the DataHUB scenario, such as the middle platform of business data, and real-time aggregation of business data, use DM as the middleware for data migration + +This document introduces how to use DM in an elegant and efficient way, and how to avoid the common mistakes when using DM. + +## Performance limitations + +| Performance item | Limitation | +| ----------------- | :--------: | +| Max Work Nodes | 1000 | +| Max Task number | 600 | +| Max QPS | 30k QPS/worker | +| Max Binlog throughput | 20 MB/s/worker | +| Table number limit per task | Unlimited | + +- DM 支持同时管理 1000 个同步节点(Work Node),最大同步任务数量为 600 个。为了保证同步节点的高可用,应预留一部分 Work Node 节点作为备用节点,保证数据同步的高可用。预留已开启同步任务 Work Node 数量的 20% ~ 50%。 +- 单机部署 Work Node 数量。在服务器配置较好情况下,要保证每个 Work Node 至少有 2 核 CPU 加 4G 内存的可用工作资源,并且应为主机预留 10% ~ 20% 的系统资源。 +- 单个同步节点(Work Node),理论最大同步 QPS 在 30K QPS/worker(不同 Schema 和 workload 会有所差异),处理上游 Binlog 的能力最高为 20 MB/s/worker。 +- 如果将 DM 作为需要长期使用的数据同步中间件,需要注意 DM 组件的部署架构。请参见 [Master 与 Woker 部署实践](#master-与-woker-部署实践)。 From 91ceb80496e40e0d1a3b6c0661a2c891b792e04f Mon Sep 17 00:00:00 2001 From: xixirangrang <35301108+hfxsd@users.noreply.github.com> Date: Tue, 27 Sep 2022 09:53:55 +0800 Subject: [PATCH 02/25] Update data-migration-best-practices.md --- dm/data-migration-best-practices.md | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/dm/data-migration-best-practices.md b/dm/data-migration-best-practices.md index a8f86688aa03..40b403cb881e 100644 --- a/dm/data-migration-best-practices.md +++ b/dm/data-migration-best-practices.md @@ -27,5 +27,15 @@ This document introduces how to use DM in an elegant and efficient way, and how - DM 支持同时管理 1000 个同步节点(Work Node),最大同步任务数量为 600 个。为了保证同步节点的高可用,应预留一部分 Work Node 节点作为备用节点,保证数据同步的高可用。预留已开启同步任务 Work Node 数量的 20% ~ 50%。 - 单机部署 Work Node 数量。在服务器配置较好情况下,要保证每个 Work Node 至少有 2 核 CPU 加 4G 内存的可用工作资源,并且应为主机预留 10% ~ 20% 的系统资源。 -- 单个同步节点(Work Node),理论最大同步 QPS 在 30K QPS/worker(不同 Schema 和 workload 会有所差异),处理上游 Binlog 的能力最高为 20 MB/s/worker。 +- A single Work Node 理论最大同步 QPS 是 30K QPS/worker(不同 Schema 和 workload 会有所差异),处理上游 Binlog 的能力最高为 20 MB/s/worker。 - 如果将 DM 作为需要长期使用的数据同步中间件,需要注意 DM 组件的部署架构。请参见 [Master 与 Woker 部署实践](#master-与-woker-部署实践)。 + +- DM supports managing 1000 work nodes simultaneously, and the maximum number of tasks is 600. To ensure the high availability of work nodes, you should reserve some work nodes as standby nodes. Reserve 20% to 50% of the number of the work nodes that have run migration task. +- A single work node has a theoretical maximum replication QPS of 30K QPS/worker. It varies for different schemas and workloads. The ability to handle upstream binlog is up to 20 MB/s/worker. +- If you use DM as a data replication middleware that will be used for a long time, you need to carefully design the deployment architecture of DM components. For more information, see [Deploy DM-Master and DM-Woker](#deploy-dm-master-and-dm-woker) + + + +### Best practices for deployment + +#### Deploy DM-Master and DM-Woker \ No newline at end of file From deaaa9ae52efe74714d333bc5af0ba3ce32f9faa Mon Sep 17 00:00:00 2001 From: xixirangrang <35301108+hfxsd@users.noreply.github.com> Date: Wed, 28 Sep 2022 18:36:39 +0800 Subject: [PATCH 03/25] Update data-migration-best-practices.md --- dm/data-migration-best-practices.md | 25 +++++++++++++++++-------- 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/dm/data-migration-best-practices.md b/dm/data-migration-best-practices.md index 40b403cb881e..b981ee84ad31 100644 --- a/dm/data-migration-best-practices.md +++ b/dm/data-migration-best-practices.md @@ -25,17 +25,26 @@ This document introduces how to use DM in an elegant and efficient way, and how | Max Binlog throughput | 20 MB/s/worker | | Table number limit per task | Unlimited | -- DM 支持同时管理 1000 个同步节点(Work Node),最大同步任务数量为 600 个。为了保证同步节点的高可用,应预留一部分 Work Node 节点作为备用节点,保证数据同步的高可用。预留已开启同步任务 Work Node 数量的 20% ~ 50%。 -- 单机部署 Work Node 数量。在服务器配置较好情况下,要保证每个 Work Node 至少有 2 核 CPU 加 4G 内存的可用工作资源,并且应为主机预留 10% ~ 20% 的系统资源。 -- A single Work Node 理论最大同步 QPS 是 30K QPS/worker(不同 Schema 和 workload 会有所差异),处理上游 Binlog 的能力最高为 20 MB/s/worker。 -- 如果将 DM 作为需要长期使用的数据同步中间件,需要注意 DM 组件的部署架构。请参见 [Master 与 Woker 部署实践](#master-与-woker-部署实践)。 - - DM supports managing 1000 work nodes simultaneously, and the maximum number of tasks is 600. To ensure the high availability of work nodes, you should reserve some work nodes as standby nodes. Reserve 20% to 50% of the number of the work nodes that have run migration task. -- A single work node has a theoretical maximum replication QPS of 30K QPS/worker. It varies for different schemas and workloads. The ability to handle upstream binlog is up to 20 MB/s/worker. -- If you use DM as a data replication middleware that will be used for a long time, you need to carefully design the deployment architecture of DM components. For more information, see [Deploy DM-Master and DM-Woker](#deploy-dm-master-and-dm-woker) +- A single work node can theoretically support replication QPS of up to 30K QPS/worker. It varies for different schemas and workloads. The ability to handle upstream binlog is up to 20 MB/s/worker. +- If you use DM as a data replication middleware that will be used for a long time, you need to carefully design the deployment architecture of DM components. For more information, see [Deploy DM-master and DM-woker](#deploy-dm-master-and-dm-woker) + +## Before data migration + +Before data migration, the design of the overall solution is critical. Especially the design of the scheme before migration is the most important part of the whole scheme. The following sections describe best practices and scenarios from the business side and the implementation side. + +### Key points in the business side + +To distribute workloads evenly on multiple nodes, the schema design for the distributed database is very different from traditonal databases. It is disgned for both low migration cost and logic correctness after migration. The following sections describe best practices before data migration. + +#### Business impact of AUTO_INCREMENT in Schema design + +TiDB 的 `AUTO_INCREMENT` 与 MySQL 的 `AUTO_INCREMENT` 整体上看是相互兼容的。但因为 TiDB 作为分布式数据库,一般会有多个计算节点(client 端入口),应用数据写入时会将负载均分开,这就导致在有 `AUTO_INCREMENT` 列的表上,可能出现不连续的自增 ID。详细原理参考 [`AUTO_INCREMENT`](/auto-increment.md#实现原理)。 +如果业务对自增 ID 有强依赖,可以考虑使用 [SEQUENCE 函数](/sql-statements/sql-statement-create-sequence.md#sequence-函数)。 +In general, `AUTO_INCREMENT` in TiDB is compatible ### Best practices for deployment -#### Deploy DM-Master and DM-Woker \ No newline at end of file +#### Deploy DM-master and DM-woker \ No newline at end of file From 084907ffdb6efcf22189df4d399e0dc9bf8aee87 Mon Sep 17 00:00:00 2001 From: xixirangrang <35301108+hfxsd@users.noreply.github.com> Date: Wed, 28 Sep 2022 23:38:29 +0800 Subject: [PATCH 04/25] Update data-migration-best-practices.md --- dm/data-migration-best-practices.md | 188 +++++++++++++++++++++++++++- 1 file changed, 183 insertions(+), 5 deletions(-) diff --git a/dm/data-migration-best-practices.md b/dm/data-migration-best-practices.md index b981ee84ad31..7292bba4a4c7 100644 --- a/dm/data-migration-best-practices.md +++ b/dm/data-migration-best-practices.md @@ -33,18 +33,196 @@ This document introduces how to use DM in an elegant and efficient way, and how Before data migration, the design of the overall solution is critical. Especially the design of the scheme before migration is the most important part of the whole scheme. The following sections describe best practices and scenarios from the business side and the implementation side. -### Key points in the business side +### Best practices for the business side To distribute workloads evenly on multiple nodes, the schema design for the distributed database is very different from traditonal databases. It is disgned for both low migration cost and logic correctness after migration. The following sections describe best practices before data migration. #### Business impact of AUTO_INCREMENT in Schema design -TiDB 的 `AUTO_INCREMENT` 与 MySQL 的 `AUTO_INCREMENT` 整体上看是相互兼容的。但因为 TiDB 作为分布式数据库,一般会有多个计算节点(client 端入口),应用数据写入时会将负载均分开,这就导致在有 `AUTO_INCREMENT` 列的表上,可能出现不连续的自增 ID。详细原理参考 [`AUTO_INCREMENT`](/auto-increment.md#实现原理)。 +`AUTO_INCREMENT` in TiDB is compatible with `AUTO_INCREMENT` in MySQL. However, as a distributed database, TiDB usually has multiple computing nodes (entry on the client end). When the application data is written, the workload is evenly distributed. This leads to the result that when there is an `AUTO_INCREMENT` column in the table, the auto-increment ID may not be consecutive. For more details, see [AUTO_INCREMENT](/auto-increment.md#implementation-principles). -如果业务对自增 ID 有强依赖,可以考虑使用 [SEQUENCE 函数](/sql-statements/sql-statement-create-sequence.md#sequence-函数)。 +If your business has a strong dependence on the auto-increment ID, consider using the [SEQUENCE function](/sql-statements/sql-statement-create-sequence.md#sequence-function). -In general, `AUTO_INCREMENT` in TiDB is compatible +#### Usage of Clustered indexes + +When you create a table, you can state that the primary key is either a clustered index or a non-clustered index. The following sections describe the pros and cons of each solution. + +- Clustered indexes + + [Clustered indexes](/clustered-indexes.md) use the primary key as the handle ID (row ID) for data storage. Querying using the primary key can avoid table lookup, which effectively improves the query performance. However, if the table is write-intensive and the primary key uses [`AUTO_INCREMENT`](/auto-increment.md), it is very likely to cause the [write hotspot problem](/best-practices/high-concurrency-best-practices.md#highly-concurrent-write-intensive-scenario) of data storage, resulting in a mediocre performance of the cluster and the performance bottleneck of a single storage node. + +- Non-clustered indexes + `shard row id bit` + + Using non-clustered indexes and `shard row id bit`, you can avoid the write hotspot problem when using `AUTO_INCREMENT`. However, table lookup in this scenario can affect the query performance when querying using the primary key. + +- Clustered indexes + external distributed ID generators + + If you want to use clustered indexes and keep the IDs consecutive, consider using external distributed ID generators, such as Snowflake and Leaf. The application program generates sequence IDs, which can guarantee the IDs consecutive to a certain extent. It also retains the benefits of using clustered indexes. But you need to customize the related applications. + +- Clustered indexes + `AUTO_RANDOM` + + This solution can retain the benefits of using clustered indexes and avoid the write hotspot problem. It requires less effort for customization. You can modify the schema attribute when you switch to use TiDB as the write database. In the subsequtial queries, you can sort using the ID column. You can use the [`AUTO_RANDOM`](/auto-random.md) ID column to left shift 5 bits to ensure the order of the query data. For example: + + ```sql + + ```sql + CREATE TABLE t (a bigint PRIMARY KEY AUTO_RANDOM, b varchar(255)); + Select a, a<<5 ,b from t order by a <<5 desc + ``` + +The following table summarizes the pros and cons of each solution. + +| Scenario | Recommended solution | Pros | Cons | +| :--- | :--- | :--- | :--- | +| TIDB will act as the primary and write-intensive database. The business logic is strongly dependent on the continuity of the primary key IDs. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use `SEQUENCE` as the primary key column. | It can avoid data writing hotspots and ensure the continuity and monotonic increment of business data. | The throughput capacity of data write is reduced (to ensure data write continuity). The performance of primary key queries is reduced. | +| TIDB will act as the primary and write-intensive database. The business logic is strongly dependent on the increment of the primary key IDs. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use the application ID generator to define the primary key IDs. | It can avoid data writing hotspots, guarantees the performance of data wirte, guarantees the increment of business data, but cannot guarantees continuity. | Need code customization. External ID generators strongly depend on the clock accurancy and might introduce failure risks | +| **Recommended solution for data migration**
TIDB will act as the primary and write-intensive database. The business logic does not depend on the continuity of the primary key IDs. | Create the table with clustered indexes and set `AUTO_RANDOM` for the primary key column. | It can avoid data writing hotspots. Limited write throughput ability. Excellent query performance of the primary keys. You can smoothly switch `AUTO_INCREMENT` to `AUTO_RANDOM`. | The primary key ID is random. It is recommended to sort the business data by using inserting the time column. If you have to use the primary key ID to sort data, you can query by using leftshift 5 bits, which can guarantee the increment of the data. | +| **Recommended solution for middle platforms**
TIDB will act as the read-only database. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Keep the primary key column consistent with the data source. | It can avoid data writing hotspots. Low customization cost. | The query performance of the primary keys is impacted. | + +### Key points for MySQL shards + +#### Splitting and merging + +It is recommended that you use DM to [migrate and merge MySQL shards of small datasets to TiDB](/migrate-small-mysql-shards-to-tidb.md). The benefits are not described here. + +This section only introduces the scenario of data archiving. Data is constantly being written. As time goes by, large amounts of data gradually change from hot data to warm or even cold data. Fortunately, in TiDB, you can use [placement rules](/configure-placement-rules.md) to set different placement rules for data. The minimum granularity is [partitioned tables](/partitioned-table.md). + +Therefore, it is recommended that for write-intensive scenarios, you need to evaluate from the beginning whether you need to archive data and store hot and cold data on different media separately. If you need to archive data, you need to set the partition rules before migration (TiDB does not support Table Rebuild operations yet). This can save you from the need to create tables and import data in future. + +#### The pessimistic mode and the optimistic mode + +DM uses the pessimistic mode by default. In scenarios of migrating and merging MySQL shards, changes in upstream shard schemas can block DML writing to downstream databases. You need to wait for all the schemas to change and have the same structure, and then continue migration from the break point. + +- If the upstream schema changes take a long time, it can cause the upstream Binlog to be cleaned up. You can enable the Relay log to avoid this problem. + +- If you do not want to block data write due to upstream schema changes, consider using the optimistic mode. In this case, DM will not block the data migration even when it spots changes in the upstream shard schemas, but will continue to migrate the data. However, if incompatible formats are spotted in upstream and downstream, the migration task will stop. You need to resolve this issue manually. + +The following table summarizes the pros and cons of optimistic mode and pessimistic mode. + +| Scenario | Pros | Cons | +| :--- | :--- | :--- | +| Optimistic mode (Default) | It can ensure that the data migrated to the downstream will not go wrong. | If there are a large number of shards, the migration task will be blocked for a long time, or even stopped if the upstream Binlogs have been cleaned up. You can enable the Relay log to avoid this problem. For more limitations, see [Use Relay log](#use-relay-log). | +| Pessimistic mode| Data can not be blocked during migration. | In this mode, ensure that schame changes are compatible (whether the incremental column has a default value). It is possible that inconsistent data can be overlooked. For more limitations, see [Merge and Migrate Data from Sharded Tables in Optimistic Mode](/dm/feature-shard-merge-optimistic.md#restrictions).| + +### Other restrictions and impact + +#### Data types in upstream and downstream + +TiDB supports most MySQL data types. However, some special types are not supported yet (such as `SPATIAL`). For the compatibility of data types, see [Data Types](/data-type-overview.md). + +#### Character sets and collations + +If you want TiDB to support utf8_general_ci, utf8mb4_general_ci, utf8_unicode_ci, utf8mb4_unicode_ci, gbk_chinese_ci and gbk_bin, you need explicitly state it when creating the cluster by setting the value of `new_collations_enabled_on_first_bootstrap` to `true`. For more information, see [New framework for collations](character-set-and-collation.md#new-framework-for-collations) + +The default character set in TiDB is utf8mb4. It is recommended that you use utf8mb4 for the upstream and downstream databases and applications. If the upstream database has explicitly specified a character set or collation, you need to check whether TiDB supports it. Since TiDB v6.0.0, GBK is supported. For more information, see the following documents: + +- [Character Set and Collation](/character-set-and-collation.md) +- [GBK compatibility](/character-set-gbk.md#mysql-compatibility) ### Best practices for deployment -#### Deploy DM-master and DM-woker \ No newline at end of file +#### Deploy DM-master and DM-woker + +DM consists of DM-master and DM-worker. + +- DM-master manages the metadata of migration tasks and is the central scheduling of DM-worker. It is the core of the whole DM platform. Therefore, DM-master can be deployed as clusters to ensure the availability of the DM platform. +- DM-worker executes upstream and downstream migration tasks. It is a stateless node. You can deploy at most 1000 DM-worker nodes. When using DM, you can reserve some idle DM-workers to ensure high availability. + +#### Plan the migration tasks + +Splitting the migration task can only guarantee the final consistency of data. Real-time consistency may deviate significantly due to various reasons. + +- When migrating and merging MysQL shards, you can split the migration task according to the types of shards in the upstream. For example, if `usertable_1~50` and `Logtable_1~50` are two types of shards, you can create two migration tasks. It can simplify the migration task template and effectively control the impact of interruption in data migration. + +- For large-scale data migration, you can refer to the following to split the migration task: + - If you need to migrate multiple databases in the upstream, you can split the migration task in terms of databases. + - Split the task according to the write pressure in the upstream. That is, split the tables with frequent DML operations in the upstream to a separate migration task. Use another migration task to migrate the tables without frequent DML operations. This method can speed up the progress of the migration task to some extent. Especially when there are a large number of logs written to a table in the upstream. But if this table does not matter, this method can effectively solve such problems. + +The following table gives the recommended deployment plans for DM-master and DM-worker in different data migration scenarios. + +| Scenario | DM-master deployment | DM-worker deployment | +| :--- | :--- | :--- | +| Small dataset (less than 1 TB) and one-time data migration | Deploy 1 DM-master node | Deploy 1~N DM-worker nodes according to the number of upstream data sources. Generally, 1 DM-worker node is recommended. | +| Large dataset (more than 1 TB) and MySQL shards, one-time data migration | It is recommended to deploy 3 DM-master nodes to ensure the availability of the DM cluster during long-time data migration. | Deploy DM-worker nodes according to the number of data sources or migration tasks. It is recommended to deploy 1~3 idle DM-worker nodes. | +| Long-term data migration | It is necessary to deploy 3 DM-master nodes. If you deploy DM-master nodes on the cloud, try to deploy them in different availability zones (AZ). | Deploy DM-worker nodes according to the number of data sources or migration tasks. It is necessary to deploy 1.5~2 times the number of DM-worker nodes that are actually needed. | + +#### 上游数据源选择与设置 + +DM supports full data migration, but when doing it, it backs up the full data of the entire database. DM uses the parallel logical backup method. During the backup, it uses a relatively heavy lock [`FLUSH TABLES WITH READ LOCK`](https://dev.mysql.com/doc/refman/8.0/en/flush.html#flush-tables-with-read-lock). At this time, DML and DDL operations of the upstream database will be blocked for a short time. Therefore, it is strongly recommended to use a backup database to perform the full data backup, and enable the GTID function of the data source at the same time (`enable-gtid: true`). This way, you can not only avoid the impact of the upstream during migration, but also switch to the master node in the upstream to reduce the delay during the incremental migration. For the method of switching the upstream MySQL data source, see [Switch DM-worker Connection between Upstream MySQL Instances](/dm/usage-scenario-master-slave-switch.md#switch-dm-worker-connection-via-virtual-ip). + +Note the following: + +- You can only perform full data backup on the master node of the upstream database. + + In this scenario, you can set the consistency parameter to `none` in the configuration file, `mydumpers.global.extra-args: "--consistency none"`, to avoid adding a heavy lock to the master node. But this may damage the data consistency of the full backup, which may lead to inconsistent data between the upstream and downstream. + +- Use backup snapshots to perform full data migration (only applicable to the migration of MySQL RDS and Aurora RDS on AWS) + + If the database to be migrated is AWS MySQL RDS or Aurora RDS, you can use RDS snapshots to directly migrate the backup data in Amazon S3 to TiDB to ensure data consistency. For more information, see [Migrate Data from Amazon Aurora to TiDB](/migrate-aurora-to-tidb.md). + +### Details of configurations + +#### Capitalization + +TiDB is case-insensitive to Schema name by default, that is, `lower_case_table_names:2`. But most upstream MySQL use Linux systems that are case-sensitive by default. In this case, you need to set `case-sensitive` to `true` to ensure that the schema can be correctly migrated from the upstream. + +In a special case, for example, if there is a database in the upstream that has both uppercase tables such as `Table` and lowercase tables such as `table`, then an error occurs when creating the schema: + +`ERROR 1050 (42S01): Table '{tablename}' already exists` + +#### Filter rules + +This section does not introduce the filter rules in detail. It is reminded that you configure the filter rules as soon as you configure the data source. The benefits of configuring the filter rules are: + +- Reduce the number of Binlog events that the downstream needs to process, thereby improving migration efficiency. +- Reduce unnecessary Relay log storage, saving disk space. + +> **Note:** +> +> When you migrate and merge MySQL shards, if you have configured filter rules in the data source, you must make sure that the rules match between the data source and the migration task. If they do not match, it may cause the issue that the migration task can not receive incremental data for a long time. + +### Use Relay log + +In the MySQL master/secondary mechanism, the secondary node saves a copy of the Relay logs to ensure the reliability and efficiency of asynchronous replication. DM also supports saving a copy of Relay logs on DM-worker. You can configure information such as the storage location and expiration time. This feature applies to the following scenarios: + +- During full and incremental data migration, because the amount of full data is large, the entire process takes more time than the time for the upstream Binlog to be archived. It causes the incremental replication task fail to start normally. If you enable Relay log, DM-worker will start receiving Relay log when the full migration is started. This avoids the failure of the incremental task. +- When you use DM to perform long-time data migration, sometimes the migration task is blocked for a long time due to various reasons. If you enable Relay log, you can effectively deal with the problem of upstream Binlog being recycled due to the blocking of the migration task. + +There are some restrictions using Relay log. DM supports high availability. When a DM-worker fails, it will try to promote an idle DM-worker instance to a working instance. If the upstream Binlog does not contain the necessary migration logs, it may cause interruption. You need to intervene manually to copy the Relay log to the new DM-worker node as soon as possible, and modify the corresponding Relay meta file. For details, see [Troubleshooting](/dm/dm-error-handling.md#the-relay-unit-throws-error-event-from--in--diff-from-passed-in-event--or-a-migration-task-is-interrupted-with-failing-to-get-or-parse-binlog-errors-like-get-binlog-error-error-1236-hy000-and-binlog-checksum-mismatch-data-may-be-corrupted-returned). + +#### Use PT-osc/GH-ost in upstream + +In daily MySQL operation and maintenance, usually you use tools such as PT-osc/GH-ost to change the schema online to minimize impact on the business. However, the whole process will be logged to MySQL Binlog. Migrating such data to TiDB downstream will result in a lot of write operations, which is neither efficient nor economical. DM supports third-party data tools such as PT-osc or GH-ost when you configure the migration task. After the configuration, DM will not migrate redundant data and ensure data consistency. For details, see [Migrate from Databases that Use GH-ost/PT-osc](/dm/feature-online-ddl.md). + +## Best practices during migration + +This section introduce how to troubleshoot porblems you encounter during migration. + +### Inconsistent schemas in upstream and downstream + +Common errors include: + +- `messages: Column count doesn't match value count: 3 (columns) vs 2 (values)` +- `Schema/Column doesn't match` + +Usually such issues are caused by changed or added indexes in the downstream TiDB, or tere are more columns in the downstream. When such errors occur, check whether the upstream and downstream schemas are inconsistent. + +To resolve such issues, update the schema information cached in DM to be consistent with the downstream TiDB Schema. For details, see [Manage Table Schemas of Tables to be Migrated](/dm/dm-manage-schema.md). + +If the downstream has more columns, see [Migrate Data to a Downstream TiDB Table with More Columns](/migrate-with-more-columns-downstream.md). + +### Interrupted migration task due to failed DDL + +DM supports skipping or replacing DDL statements that cause the migration task to interrupt. For details, see [Handle Failed DDL Statements](/dm/handle-failed-ddl-statements.md#usage-examples). + +## Data validation after data migration + +Usually you need to validate the consistency of data after data migration. TiDB provides [sync-diff-inspector](/sync-diff-inspector/sync-diff-inspector-overview.md) to help you complete the data validation. + +Now sync-diff-inspector can automatically manage the table list to be checked for data consistency through DM tasks. Compared with the previous manual configuration, it is more efficient. For details, see [Data Check in the DM Replication Scenario](/sync-diff-inspector/dm-diff.md). + +Since DM v6.2, data validation is also supported for incremental replication. For details, see [Continuous Data Validation in DM](/dm/dm-continuous-data-validation.md). + +## Long-term data replication + +If you use DM as a long-term data replication platform, it is necessary to back up the metadata. On the one hand, it ensures the ability to rebuild the migration cluster. On the other hand, it can implement the version control of the migration task through the version control capability. For details, see [Export and Import Data Sources and Task Configuration of Clusters](/dm/dm-export-import-config.md). From 892ce055c093ffe1cfaf16170e9feb5c6acf2eee Mon Sep 17 00:00:00 2001 From: xixirangrang <35301108+hfxsd@users.noreply.github.com> Date: Wed, 28 Sep 2022 23:49:02 +0800 Subject: [PATCH 05/25] Update data-migration-best-practices.md --- dm/data-migration-best-practices.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/dm/data-migration-best-practices.md b/dm/data-migration-best-practices.md index 7292bba4a4c7..1fa73ab56501 100644 --- a/dm/data-migration-best-practices.md +++ b/dm/data-migration-best-practices.md @@ -11,7 +11,7 @@ You can use DM in the following scenarios: - Perform full and incremental data migration from a single MySQL-compatible database instance to TiDB - Migrate and merge MySQL shards of small datasets to TiDB -- In the DataHUB scenario, such as the middle platform of business data, and real-time aggregation of business data, use DM as the middleware for data migration +- In the Data HUB scenario, such as the middle platform of business data, and real-time aggregation of business data, use DM as the middleware for data migration This document introduces how to use DM in an elegant and efficient way, and how to avoid the common mistakes when using DM. @@ -27,7 +27,7 @@ This document introduces how to use DM in an elegant and efficient way, and how - DM supports managing 1000 work nodes simultaneously, and the maximum number of tasks is 600. To ensure the high availability of work nodes, you should reserve some work nodes as standby nodes. Reserve 20% to 50% of the number of the work nodes that have run migration task. - A single work node can theoretically support replication QPS of up to 30K QPS/worker. It varies for different schemas and workloads. The ability to handle upstream binlog is up to 20 MB/s/worker. -- If you use DM as a data replication middleware that will be used for a long time, you need to carefully design the deployment architecture of DM components. For more information, see [Deploy DM-master and DM-woker](#deploy-dm-master-and-dm-woker) +- If you use DM as a data replication middleware that will be used for a long time, you need to carefully design the deployment architecture of DM components. For more information, see [Deploy DM-master and DM-worker](#deploy-dm-master-and-dm-worker) ## Before data migration @@ -35,7 +35,7 @@ Before data migration, the design of the overall solution is critical. Especiall ### Best practices for the business side -To distribute workloads evenly on multiple nodes, the schema design for the distributed database is very different from traditonal databases. It is disgned for both low migration cost and logic correctness after migration. The following sections describe best practices before data migration. +To distribute workloads evenly on multiple nodes, the schema design for the distributed database is very different from traditional databases. It is designed for both low migration cost and logic correctness after migration. The following sections describe best practices before data migration. #### Business impact of AUTO_INCREMENT in Schema design @@ -61,7 +61,7 @@ When you create a table, you can state that the primary key is either a clustere - Clustered indexes + `AUTO_RANDOM` - This solution can retain the benefits of using clustered indexes and avoid the write hotspot problem. It requires less effort for customization. You can modify the schema attribute when you switch to use TiDB as the write database. In the subsequtial queries, you can sort using the ID column. You can use the [`AUTO_RANDOM`](/auto-random.md) ID column to left shift 5 bits to ensure the order of the query data. For example: + This solution can retain the benefits of using clustered indexes and avoid the write hotspot problem. It requires less effort for customization. You can modify the schema attribute when you switch to use TiDB as the write database. In the subsequent queries, you can sort using the ID column. You can use the [`AUTO_RANDOM`](/auto-random.md) ID column to left shift 5 bits to ensure the order of the query data. For example: ```sql @@ -75,7 +75,7 @@ The following table summarizes the pros and cons of each solution. | Scenario | Recommended solution | Pros | Cons | | :--- | :--- | :--- | :--- | | TIDB will act as the primary and write-intensive database. The business logic is strongly dependent on the continuity of the primary key IDs. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use `SEQUENCE` as the primary key column. | It can avoid data writing hotspots and ensure the continuity and monotonic increment of business data. | The throughput capacity of data write is reduced (to ensure data write continuity). The performance of primary key queries is reduced. | -| TIDB will act as the primary and write-intensive database. The business logic is strongly dependent on the increment of the primary key IDs. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use the application ID generator to define the primary key IDs. | It can avoid data writing hotspots, guarantees the performance of data wirte, guarantees the increment of business data, but cannot guarantees continuity. | Need code customization. External ID generators strongly depend on the clock accurancy and might introduce failure risks | +| TIDB will act as the primary and write-intensive database. The business logic is strongly dependent on the increment of the primary key IDs. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use the application ID generator to define the primary key IDs. | It can avoid data writing hotspots, guarantees the performance of data write, guarantees the increment of business data, but cannot guarantees continuity. | Need code customization. External ID generators strongly depend on the clock accuracy and might introduce failure risks | | **Recommended solution for data migration**
TIDB will act as the primary and write-intensive database. The business logic does not depend on the continuity of the primary key IDs. | Create the table with clustered indexes and set `AUTO_RANDOM` for the primary key column. | It can avoid data writing hotspots. Limited write throughput ability. Excellent query performance of the primary keys. You can smoothly switch `AUTO_INCREMENT` to `AUTO_RANDOM`. | The primary key ID is random. It is recommended to sort the business data by using inserting the time column. If you have to use the primary key ID to sort data, you can query by using leftshift 5 bits, which can guarantee the increment of the data. | | **Recommended solution for middle platforms**
TIDB will act as the read-only database. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Keep the primary key column consistent with the data source. | It can avoid data writing hotspots. Low customization cost. | The query performance of the primary keys is impacted. | @@ -102,7 +102,7 @@ The following table summarizes the pros and cons of optimistic mode and pessimis | Scenario | Pros | Cons | | :--- | :--- | :--- | | Optimistic mode (Default) | It can ensure that the data migrated to the downstream will not go wrong. | If there are a large number of shards, the migration task will be blocked for a long time, or even stopped if the upstream Binlogs have been cleaned up. You can enable the Relay log to avoid this problem. For more limitations, see [Use Relay log](#use-relay-log). | -| Pessimistic mode| Data can not be blocked during migration. | In this mode, ensure that schame changes are compatible (whether the incremental column has a default value). It is possible that inconsistent data can be overlooked. For more limitations, see [Merge and Migrate Data from Sharded Tables in Optimistic Mode](/dm/feature-shard-merge-optimistic.md#restrictions).| +| Pessimistic mode| Data can not be blocked during migration. | In this mode, ensure that schema changes are compatible (whether the incremental column has a default value). It is possible that inconsistent data can be overlooked. For more limitations, see [Merge and Migrate Data from Sharded Tables in Optimistic Mode](/dm/feature-shard-merge-optimistic.md#restrictions).| ### Other restrictions and impact @@ -121,7 +121,7 @@ The default character set in TiDB is utf8mb4. It is recommended that you use utf ### Best practices for deployment -#### Deploy DM-master and DM-woker +#### Deploy DM-master and DM-worker DM consists of DM-master and DM-worker. @@ -132,7 +132,7 @@ DM consists of DM-master and DM-worker. Splitting the migration task can only guarantee the final consistency of data. Real-time consistency may deviate significantly due to various reasons. -- When migrating and merging MysQL shards, you can split the migration task according to the types of shards in the upstream. For example, if `usertable_1~50` and `Logtable_1~50` are two types of shards, you can create two migration tasks. It can simplify the migration task template and effectively control the impact of interruption in data migration. +- When migrating and merging MySQL shards, you can split the migration task according to the types of shards in the upstream. For example, if `usertable_1~50` and `Logtable_1~50` are two types of shards, you can create two migration tasks. It can simplify the migration task template and effectively control the impact of interruption in data migration. - For large-scale data migration, you can refer to the following to split the migration task: - If you need to migrate multiple databases in the upstream, you can split the migration task in terms of databases. @@ -146,7 +146,7 @@ The following table gives the recommended deployment plans for DM-master and DM- | Large dataset (more than 1 TB) and MySQL shards, one-time data migration | It is recommended to deploy 3 DM-master nodes to ensure the availability of the DM cluster during long-time data migration. | Deploy DM-worker nodes according to the number of data sources or migration tasks. It is recommended to deploy 1~3 idle DM-worker nodes. | | Long-term data migration | It is necessary to deploy 3 DM-master nodes. If you deploy DM-master nodes on the cloud, try to deploy them in different availability zones (AZ). | Deploy DM-worker nodes according to the number of data sources or migration tasks. It is necessary to deploy 1.5~2 times the number of DM-worker nodes that are actually needed. | -#### 上游数据源选择与设置 +#### Choose and configure the upstream data source DM supports full data migration, but when doing it, it backs up the full data of the entire database. DM uses the parallel logical backup method. During the backup, it uses a relatively heavy lock [`FLUSH TABLES WITH READ LOCK`](https://dev.mysql.com/doc/refman/8.0/en/flush.html#flush-tables-with-read-lock). At this time, DML and DDL operations of the upstream database will be blocked for a short time. Therefore, it is strongly recommended to use a backup database to perform the full data backup, and enable the GTID function of the data source at the same time (`enable-gtid: true`). This way, you can not only avoid the impact of the upstream during migration, but also switch to the master node in the upstream to reduce the delay during the incremental migration. For the method of switching the upstream MySQL data source, see [Switch DM-worker Connection between Upstream MySQL Instances](/dm/usage-scenario-master-slave-switch.md#switch-dm-worker-connection-via-virtual-ip). @@ -196,7 +196,7 @@ In daily MySQL operation and maintenance, usually you use tools such as PT-osc/G ## Best practices during migration -This section introduce how to troubleshoot porblems you encounter during migration. +This section introduce how to troubleshoot problems you encounter during migration. ### Inconsistent schemas in upstream and downstream @@ -205,7 +205,7 @@ Common errors include: - `messages: Column count doesn't match value count: 3 (columns) vs 2 (values)` - `Schema/Column doesn't match` -Usually such issues are caused by changed or added indexes in the downstream TiDB, or tere are more columns in the downstream. When such errors occur, check whether the upstream and downstream schemas are inconsistent. +Usually such issues are caused by changed or added indexes in the downstream TiDB, or there are more columns in the downstream. When such errors occur, check whether the upstream and downstream schemas are inconsistent. To resolve such issues, update the schema information cached in DM to be consistent with the downstream TiDB Schema. For details, see [Manage Table Schemas of Tables to be Migrated](/dm/dm-manage-schema.md). From 80529999b16519498370b401c49fe1b5b338d016 Mon Sep 17 00:00:00 2001 From: xixirangrang <35301108+hfxsd@users.noreply.github.com> Date: Wed, 28 Sep 2022 23:50:44 +0800 Subject: [PATCH 06/25] Update data-migration-best-practices.md --- dm/data-migration-best-practices.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dm/data-migration-best-practices.md b/dm/data-migration-best-practices.md index 1fa73ab56501..4bc7bbd9dbf8 100644 --- a/dm/data-migration-best-practices.md +++ b/dm/data-migration-best-practices.md @@ -112,7 +112,7 @@ TiDB supports most MySQL data types. However, some special types are not support #### Character sets and collations -If you want TiDB to support utf8_general_ci, utf8mb4_general_ci, utf8_unicode_ci, utf8mb4_unicode_ci, gbk_chinese_ci and gbk_bin, you need explicitly state it when creating the cluster by setting the value of `new_collations_enabled_on_first_bootstrap` to `true`. For more information, see [New framework for collations](character-set-and-collation.md#new-framework-for-collations) +If you want TiDB to support utf8_general_ci, utf8mb4_general_ci, utf8_unicode_ci, utf8mb4_unicode_ci, gbk_chinese_ci and gbk_bin, you need explicitly state it when creating the cluster by setting the value of `new_collations_enabled_on_first_bootstrap` to `true`. For more information, see [New framework for collations](/character-set-and-collation.md#new-framework-for-collations) The default character set in TiDB is utf8mb4. It is recommended that you use utf8mb4 for the upstream and downstream databases and applications. If the upstream database has explicitly specified a character set or collation, you need to check whether TiDB supports it. Since TiDB v6.0.0, GBK is supported. For more information, see the following documents: From 0b933a13f3179071b1faf1914933cb7b1788fefd Mon Sep 17 00:00:00 2001 From: xixirangrang <35301108+hfxsd@users.noreply.github.com> Date: Wed, 28 Sep 2022 23:55:39 +0800 Subject: [PATCH 07/25] Update data-migration-best-practices.md --- dm/data-migration-best-practices.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dm/data-migration-best-practices.md b/dm/data-migration-best-practices.md index 4bc7bbd9dbf8..14ff69af84f7 100644 --- a/dm/data-migration-best-practices.md +++ b/dm/data-migration-best-practices.md @@ -76,7 +76,7 @@ The following table summarizes the pros and cons of each solution. | :--- | :--- | :--- | :--- | | TIDB will act as the primary and write-intensive database. The business logic is strongly dependent on the continuity of the primary key IDs. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use `SEQUENCE` as the primary key column. | It can avoid data writing hotspots and ensure the continuity and monotonic increment of business data. | The throughput capacity of data write is reduced (to ensure data write continuity). The performance of primary key queries is reduced. | | TIDB will act as the primary and write-intensive database. The business logic is strongly dependent on the increment of the primary key IDs. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use the application ID generator to define the primary key IDs. | It can avoid data writing hotspots, guarantees the performance of data write, guarantees the increment of business data, but cannot guarantees continuity. | Need code customization. External ID generators strongly depend on the clock accuracy and might introduce failure risks | -| **Recommended solution for data migration**
TIDB will act as the primary and write-intensive database. The business logic does not depend on the continuity of the primary key IDs. | Create the table with clustered indexes and set `AUTO_RANDOM` for the primary key column. | It can avoid data writing hotspots. Limited write throughput ability. Excellent query performance of the primary keys. You can smoothly switch `AUTO_INCREMENT` to `AUTO_RANDOM`. | The primary key ID is random. It is recommended to sort the business data by using inserting the time column. If you have to use the primary key ID to sort data, you can query by using leftshift 5 bits, which can guarantee the increment of the data. | +| **Recommended solution for data migration**
TIDB will act as the primary and write-intensive database. The business logic does not depend on the continuity of the primary key IDs. | Create the table with clustered indexes and set `AUTO_RANDOM` for the primary key column. | It can avoid data writing hotspots. Limited write throughput ability. Excellent query performance of the primary keys. You can smoothly switch `AUTO_INCREMENT` to `AUTO_RANDOM`. | The primary key ID is random. It is recommended to sort the business data by using inserting the time column. If you have to use the primary key ID to sort data, you can left shift 5 bits to query, which can guarantee the increment of the data. | | **Recommended solution for middle platforms**
TIDB will act as the read-only database. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Keep the primary key column consistent with the data source. | It can avoid data writing hotspots. Low customization cost. | The query performance of the primary keys is impacted. | ### Key points for MySQL shards From 2a6f46fd3c24679a4523e82994520cdb6fe0c286 Mon Sep 17 00:00:00 2001 From: xixirangrang <35301108+hfxsd@users.noreply.github.com> Date: Thu, 29 Sep 2022 00:10:36 +0800 Subject: [PATCH 08/25] Update data-migration-best-practices.md --- dm/data-migration-best-practices.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/dm/data-migration-best-practices.md b/dm/data-migration-best-practices.md index 14ff69af84f7..bf911f9d0d6f 100644 --- a/dm/data-migration-best-practices.md +++ b/dm/data-migration-best-practices.md @@ -63,8 +63,6 @@ When you create a table, you can state that the primary key is either a clustere This solution can retain the benefits of using clustered indexes and avoid the write hotspot problem. It requires less effort for customization. You can modify the schema attribute when you switch to use TiDB as the write database. In the subsequent queries, you can sort using the ID column. You can use the [`AUTO_RANDOM`](/auto-random.md) ID column to left shift 5 bits to ensure the order of the query data. For example: - ```sql - ```sql CREATE TABLE t (a bigint PRIMARY KEY AUTO_RANDOM, b varchar(255)); Select a, a<<5 ,b from t order by a <<5 desc From 689a80a87c99a357ca57a67d339a610b27207c8e Mon Sep 17 00:00:00 2001 From: xixirangrang <35301108+hfxsd@users.noreply.github.com> Date: Thu, 29 Sep 2022 00:11:22 +0800 Subject: [PATCH 09/25] Update data-migration-best-practices.md --- dm/data-migration-best-practices.md | 1 + 1 file changed, 1 insertion(+) diff --git a/dm/data-migration-best-practices.md b/dm/data-migration-best-practices.md index bf911f9d0d6f..fe1c04f48cf0 100644 --- a/dm/data-migration-best-practices.md +++ b/dm/data-migration-best-practices.md @@ -184,6 +184,7 @@ This section does not introduce the filter rules in detail. It is reminded that In the MySQL master/secondary mechanism, the secondary node saves a copy of the Relay logs to ensure the reliability and efficiency of asynchronous replication. DM also supports saving a copy of Relay logs on DM-worker. You can configure information such as the storage location and expiration time. This feature applies to the following scenarios: - During full and incremental data migration, because the amount of full data is large, the entire process takes more time than the time for the upstream Binlog to be archived. It causes the incremental replication task fail to start normally. If you enable Relay log, DM-worker will start receiving Relay log when the full migration is started. This avoids the failure of the incremental task. + - When you use DM to perform long-time data migration, sometimes the migration task is blocked for a long time due to various reasons. If you enable Relay log, you can effectively deal with the problem of upstream Binlog being recycled due to the blocking of the migration task. There are some restrictions using Relay log. DM supports high availability. When a DM-worker fails, it will try to promote an idle DM-worker instance to a working instance. If the upstream Binlog does not contain the necessary migration logs, it may cause interruption. You need to intervene manually to copy the Relay log to the new DM-worker node as soon as possible, and modify the corresponding Relay meta file. For details, see [Troubleshooting](/dm/dm-error-handling.md#the-relay-unit-throws-error-event-from--in--diff-from-passed-in-event--or-a-migration-task-is-interrupted-with-failing-to-get-or-parse-binlog-errors-like-get-binlog-error-error-1236-hy000-and-binlog-checksum-mismatch-data-may-be-corrupted-returned). From 4be3fe6ff73b17627379297265cb78edb17ceafd Mon Sep 17 00:00:00 2001 From: xixirangrang <35301108+hfxsd@users.noreply.github.com> Date: Thu, 29 Sep 2022 07:20:45 +0800 Subject: [PATCH 10/25] change the file name --- TOC.md | 1 + ...best-practices.md => dm-best-practices.md} | 22 +++++++++---------- 2 files changed, 12 insertions(+), 11 deletions(-) rename dm/{data-migration-best-practices.md => dm-best-practices.md} (87%) diff --git a/TOC.md b/TOC.md index a9e7724aa1f5..2c04682af9ef 100644 --- a/TOC.md +++ b/TOC.md @@ -388,6 +388,7 @@ - [About TiDB Data Migration](/dm/dm-overview.md) - [Architecture](/dm/dm-arch.md) - [Quick Start](/dm/quick-start-with-dm.md) + - [Best Practices](/dm/dm-best-practices.md) - Deploy a DM cluster - [Hardware and Software Requirements](/dm/dm-hardware-and-software-requirements.md) - [Use TiUP (Recommended)](/dm/deploy-a-dm-cluster-using-tiup.md) diff --git a/dm/data-migration-best-practices.md b/dm/dm-best-practices.md similarity index 87% rename from dm/data-migration-best-practices.md rename to dm/dm-best-practices.md index fe1c04f48cf0..cdd958ed7e66 100644 --- a/dm/data-migration-best-practices.md +++ b/dm/dm-best-practices.md @@ -5,12 +5,12 @@ summary: Learn about best practices when you use TiDB Data Migration (DM) to mig # Data Migration (DM) Best Practices -TiDB Data Migration (DM) is an data migration tool developed by PingCAP. It supports the full data migration and the incremental data replication from MySQL-compatible databases such as MySQL, Percona MySQL, MariaDB, AWS MySQL RDS, AWS Aurora into TiDB. +[TiDB Data Migration (DM)](https://github.com/pingcap/tiflow/tree/master/dm) is an data migration tool developed by PingCAP. It supports the full data migration and the incremental data replication from MySQL-compatible databases such as MySQL, Percona MySQL, MariaDB, AWS MySQL RDS, AWS Aurora into TiDB. You can use DM in the following scenarios: - Perform full and incremental data migration from a single MySQL-compatible database instance to TiDB -- Migrate and merge MySQL shards of small datasets to TiDB +- Migrate and merge MySQL shards of small datasets (less than 1 TB) to TiDB - In the Data HUB scenario, such as the middle platform of business data, and real-time aggregation of business data, use DM as the middleware for data migration This document introduces how to use DM in an elegant and efficient way, and how to avoid the common mistakes when using DM. @@ -61,7 +61,7 @@ When you create a table, you can state that the primary key is either a clustere - Clustered indexes + `AUTO_RANDOM` - This solution can retain the benefits of using clustered indexes and avoid the write hotspot problem. It requires less effort for customization. You can modify the schema attribute when you switch to use TiDB as the write database. In the subsequent queries, you can sort using the ID column. You can use the [`AUTO_RANDOM`](/auto-random.md) ID column to left shift 5 bits to ensure the order of the query data. For example: + This solution can retain the benefits of using clustered indexes and avoid the write hotspot problem. It requires less effort for customization. You can modify the schema attribute when you switch to use TiDB as the write database. In the subsequent queries, you can sort using the ID column. If you have to use the ID column to sort data, you can use the [`AUTO_RANDOM`](/auto-random.md) ID column and left shift 5 bits to ensure the order of the query data. For example: ```sql CREATE TABLE t (a bigint PRIMARY KEY AUTO_RANDOM, b varchar(255)); @@ -72,10 +72,10 @@ The following table summarizes the pros and cons of each solution. | Scenario | Recommended solution | Pros | Cons | | :--- | :--- | :--- | :--- | -| TIDB will act as the primary and write-intensive database. The business logic is strongly dependent on the continuity of the primary key IDs. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use `SEQUENCE` as the primary key column. | It can avoid data writing hotspots and ensure the continuity and monotonic increment of business data. | The throughput capacity of data write is reduced (to ensure data write continuity). The performance of primary key queries is reduced. | -| TIDB will act as the primary and write-intensive database. The business logic is strongly dependent on the increment of the primary key IDs. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use the application ID generator to define the primary key IDs. | It can avoid data writing hotspots, guarantees the performance of data write, guarantees the increment of business data, but cannot guarantees continuity. | Need code customization. External ID generators strongly depend on the clock accuracy and might introduce failure risks | -| **Recommended solution for data migration**
TIDB will act as the primary and write-intensive database. The business logic does not depend on the continuity of the primary key IDs. | Create the table with clustered indexes and set `AUTO_RANDOM` for the primary key column. | It can avoid data writing hotspots. Limited write throughput ability. Excellent query performance of the primary keys. You can smoothly switch `AUTO_INCREMENT` to `AUTO_RANDOM`. | The primary key ID is random. It is recommended to sort the business data by using inserting the time column. If you have to use the primary key ID to sort data, you can left shift 5 bits to query, which can guarantee the increment of the data. | -| **Recommended solution for middle platforms**
TIDB will act as the read-only database. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Keep the primary key column consistent with the data source. | It can avoid data writing hotspots. Low customization cost. | The query performance of the primary keys is impacted. | +| TiDB will act as the primary and write-intensive database. The business logic is strongly dependent on the continuity of the primary key IDs. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use `SEQUENCE` as the primary key column. | It can avoid data writing hotspots and ensure the continuity and monotonic increment of business data. | The throughput capacity of data write is reduced (to ensure data write continuity). The performance of primary key queries is reduced. | +| TiDB will act as the primary and write-intensive database. The business logic is strongly dependent on the increment of the primary key IDs. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use the application ID generator to define the primary key IDs. | It can avoid data writing hotspots, guarantees the performance of data write, guarantees the increment of business data, but cannot guarantees continuity. | Need code customization. External ID generators strongly depend on the clock accuracy and might introduce failure risks | +| TiDB will act as the primary and write-intensive database. The business logic does not depend on the continuity of the primary key IDs. | Create the table with clustered indexes and set `AUTO_RANDOM` for the primary key column. | It can avoid data writing hotspots. Limited write throughput ability. Excellent query performance of the primary keys. You can smoothly switch `AUTO_INCREMENT` to `AUTO_RANDOM`. | The primary key ID is random. It is recommended to sort the business data by using inserting the time column. If you have to use the primary key ID to sort data, you can left shift 5 bits to query, which can guarantee the increment of the data. | +| TiDB will act as the read-only database. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Keep the primary key column consistent with the data source. | It can avoid data writing hotspots. Low customization cost. | The query performance of the primary keys is impacted. | ### Key points for MySQL shards @@ -83,7 +83,7 @@ The following table summarizes the pros and cons of each solution. It is recommended that you use DM to [migrate and merge MySQL shards of small datasets to TiDB](/migrate-small-mysql-shards-to-tidb.md). The benefits are not described here. -This section only introduces the scenario of data archiving. Data is constantly being written. As time goes by, large amounts of data gradually change from hot data to warm or even cold data. Fortunately, in TiDB, you can use [placement rules](/configure-placement-rules.md) to set different placement rules for data. The minimum granularity is [partitioned tables](/partitioned-table.md). +Besides data merging, another typical scenario is data archiving. Data is constantly being written. As time goes by, large amounts of data gradually change from hot data to warm or even cold data. Fortunately, in TiDB, you can use [placement rules](/configure-placement-rules.md) to set different placement rules for data. The minimum granularity is [partitioned tables](/partitioned-table.md). Therefore, it is recommended that for write-intensive scenarios, you need to evaluate from the beginning whether you need to archive data and store hot and cold data on different media separately. If you need to archive data, you need to set the partition rules before migration (TiDB does not support Table Rebuild operations yet). This can save you from the need to create tables and import data in future. @@ -91,9 +91,9 @@ Therefore, it is recommended that for write-intensive scenarios, you need to eva DM uses the pessimistic mode by default. In scenarios of migrating and merging MySQL shards, changes in upstream shard schemas can block DML writing to downstream databases. You need to wait for all the schemas to change and have the same structure, and then continue migration from the break point. -- If the upstream schema changes take a long time, it can cause the upstream Binlog to be cleaned up. You can enable the Relay log to avoid this problem. +- If the upstream schema changes take a long time, it might cause the upstream Binlog to be cleaned up. You can enable the Relay log to avoid this problem. -- If you do not want to block data write due to upstream schema changes, consider using the optimistic mode. In this case, DM will not block the data migration even when it spots changes in the upstream shard schemas, but will continue to migrate the data. However, if incompatible formats are spotted in upstream and downstream, the migration task will stop. You need to resolve this issue manually. +- If you do not want to block data write due to upstream schema changes, consider using the optimistic mode. In this case, DM will not block the data migration even when it spots changes in the upstream shard schemas, but will continue to migrate the data. However, if DM spots incompatible formats in upstream and downstream, the migration task will stop. You need to resolve this issue manually. The following table summarizes the pros and cons of optimistic mode and pessimistic mode. @@ -110,7 +110,7 @@ TiDB supports most MySQL data types. However, some special types are not support #### Character sets and collations -If you want TiDB to support utf8_general_ci, utf8mb4_general_ci, utf8_unicode_ci, utf8mb4_unicode_ci, gbk_chinese_ci and gbk_bin, you need explicitly state it when creating the cluster by setting the value of `new_collations_enabled_on_first_bootstrap` to `true`. For more information, see [New framework for collations](/character-set-and-collation.md#new-framework-for-collations) +Since TiDB v6.0.0, the new framework for collations are used by default. If you want TiDB to support utf8_general_ci, utf8mb4_general_ci, utf8_unicode_ci, utf8mb4_unicode_ci, gbk_chinese_ci and gbk_bin, you need explicitly state it when creating the cluster by setting the value of `new_collations_enabled_on_first_bootstrap` to `true`. For more information, see [New framework for collations](/character-set-and-collation.md#new-framework-for-collations) The default character set in TiDB is utf8mb4. It is recommended that you use utf8mb4 for the upstream and downstream databases and applications. If the upstream database has explicitly specified a character set or collation, you need to check whether TiDB supports it. Since TiDB v6.0.0, GBK is supported. For more information, see the following documents: From 1ea61de7a0f7ba15afb9bbbac8308d37ed405b0f Mon Sep 17 00:00:00 2001 From: xixirangrang <35301108+hfxsd@users.noreply.github.com> Date: Thu, 29 Sep 2022 10:52:40 +0800 Subject: [PATCH 11/25] Update dm-best-practices.md --- dm/dm-best-practices.md | 108 +++++++++++++++++++++------------------- 1 file changed, 57 insertions(+), 51 deletions(-) diff --git a/dm/dm-best-practices.md b/dm/dm-best-practices.md index cdd958ed7e66..e51f9dbde3df 100644 --- a/dm/dm-best-practices.md +++ b/dm/dm-best-practices.md @@ -3,9 +3,9 @@ title: Data Migration (DM) Best Practices summary: Learn about best practices when you use TiDB Data Migration (DM) to migrate data. --- -# Data Migration (DM) Best Practices +# TiDB DM Best Practices -[TiDB Data Migration (DM)](https://github.com/pingcap/tiflow/tree/master/dm) is an data migration tool developed by PingCAP. It supports the full data migration and the incremental data replication from MySQL-compatible databases such as MySQL, Percona MySQL, MariaDB, AWS MySQL RDS, AWS Aurora into TiDB. +[TiDB Data Migration (DM)](https://github.com/pingcap/tiflow/tree/master/dm) is an data migration tool developed by PingCAP. It supports full and incremental data migration from MySQL-compatible databases such as MySQL, Percona MySQL, MariaDB, Amazon RDS for MySQL and Amazon Aurora into TiDB. You can use DM in the following scenarios: @@ -19,49 +19,49 @@ This document introduces how to use DM in an elegant and efficient way, and how | Performance item | Limitation | | ----------------- | :--------: | -| Max Work Nodes | 1000 | -| Max Task number | 600 | +| Max work nodes | 1000 | +| Max task number | 600 | | Max QPS | 30k QPS/worker | | Max Binlog throughput | 20 MB/s/worker | -| Table number limit per task | Unlimited | +| Table number limit per task | Unlimited | -- DM supports managing 1000 work nodes simultaneously, and the maximum number of tasks is 600. To ensure the high availability of work nodes, you should reserve some work nodes as standby nodes. Reserve 20% to 50% of the number of the work nodes that have run migration task. +- DM supports managing 1000 work nodes simultaneously, and the maximum number of tasks is 600. To ensure the high availability of work nodes, you should reserve some work nodes as standby nodes. Reserve 20% to 50% of the number of the work nodes that have running migration tasks. - A single work node can theoretically support replication QPS of up to 30K QPS/worker. It varies for different schemas and workloads. The ability to handle upstream binlog is up to 20 MB/s/worker. -- If you use DM as a data replication middleware that will be used for a long time, you need to carefully design the deployment architecture of DM components. For more information, see [Deploy DM-master and DM-worker](#deploy-dm-master-and-dm-worker) +- If you use DM as a data replication middleware that will run for a long time, you need to carefully design the deployment architecture of DM components. For more information, see [Deploy DM-master and DM-worker](#deploy-dm-master-and-dm-worker) ## Before data migration -Before data migration, the design of the overall solution is critical. Especially the design of the scheme before migration is the most important part of the whole scheme. The following sections describe best practices and scenarios from the business side and the implementation side. +Before data migration, the design of the overall solution is critical. The following sections describe best practices and scenarios from the business side and the implementation side. ### Best practices for the business side -To distribute workloads evenly on multiple nodes, the schema design for the distributed database is very different from traditional databases. It is designed for both low migration cost and logic correctness after migration. The following sections describe best practices before data migration. +To distribute the workload evenly on multiple nodes, the design for the distributed database is different from traditional databases. It is designed for both low migration cost and logic correctness after migration. The following sections describe best practices before data migration. -#### Business impact of AUTO_INCREMENT in Schema design +#### Business impact of AUTO_INCREMENT in schema design `AUTO_INCREMENT` in TiDB is compatible with `AUTO_INCREMENT` in MySQL. However, as a distributed database, TiDB usually has multiple computing nodes (entry on the client end). When the application data is written, the workload is evenly distributed. This leads to the result that when there is an `AUTO_INCREMENT` column in the table, the auto-increment ID may not be consecutive. For more details, see [AUTO_INCREMENT](/auto-increment.md#implementation-principles). If your business has a strong dependence on the auto-increment ID, consider using the [SEQUENCE function](/sql-statements/sql-statement-create-sequence.md#sequence-function). -#### Usage of Clustered indexes +#### Usage of clustered indexes -When you create a table, you can state that the primary key is either a clustered index or a non-clustered index. The following sections describe the pros and cons of each solution. +When you create a table, you can state that the primary key is either a clustered index or a non-clustered index. The following sections describe the pros and cons of each choice. - Clustered indexes - [Clustered indexes](/clustered-indexes.md) use the primary key as the handle ID (row ID) for data storage. Querying using the primary key can avoid table lookup, which effectively improves the query performance. However, if the table is write-intensive and the primary key uses [`AUTO_INCREMENT`](/auto-increment.md), it is very likely to cause the [write hotspot problem](/best-practices/high-concurrency-best-practices.md#highly-concurrent-write-intensive-scenario) of data storage, resulting in a mediocre performance of the cluster and the performance bottleneck of a single storage node. + [Clustered indexes](/clustered-indexes.md) use the primary key as the handle ID (row ID) for data storage. Querying using the primary key can avoid table lookup, which effectively improves the query performance. However, if the table is write-intensive and the primary key uses [`AUTO_INCREMENT`](/auto-increment.md), it is very likely to cause [write hotspot problems](/best-practices/high-concurrency-best-practices.md#highly-concurrent-write-intensive-scenario), resulting in a mediocre performance of the cluster and the performance bottleneck of a single storage node. - Non-clustered indexes + `shard row id bit` - Using non-clustered indexes and `shard row id bit`, you can avoid the write hotspot problem when using `AUTO_INCREMENT`. However, table lookup in this scenario can affect the query performance when querying using the primary key. + Using non-clustered indexes and `shard row id bit`, you can avoid the write hotspot problem when using `AUTO_INCREMENT`. However, table lookup in this scenario can impact the query performance when querying using the primary key. - Clustered indexes + external distributed ID generators - If you want to use clustered indexes and keep the IDs consecutive, consider using external distributed ID generators, such as Snowflake and Leaf. The application program generates sequence IDs, which can guarantee the IDs consecutive to a certain extent. It also retains the benefits of using clustered indexes. But you need to customize the related applications. + If you want to use clustered indexes and keep the IDs consecutive, consider using external distributed ID generators, such as the Snowflake algorithm and Leaf. The application program generates sequence IDs, which can guarantee that the IDs are consecutive to a certain extent. It also retains the benefits of using clustered indexes. But you need to customize the applications. - Clustered indexes + `AUTO_RANDOM` - This solution can retain the benefits of using clustered indexes and avoid the write hotspot problem. It requires less effort for customization. You can modify the schema attribute when you switch to use TiDB as the write database. In the subsequent queries, you can sort using the ID column. If you have to use the ID column to sort data, you can use the [`AUTO_RANDOM`](/auto-random.md) ID column and left shift 5 bits to ensure the order of the query data. For example: + This solution can retain the benefits of using clustered indexes and avoid the write hotspot problem. It requires less effort for customization. You can modify the schema attribute when you switch to use TiDB as the write database. In subsequent queries, if you have to use the ID column to sort data, you can use the [`AUTO_RANDOM`](/auto-random.md) ID column and left shift 5 bits to ensure the order of the query data. For example: ```sql CREATE TABLE t (a bigint PRIMARY KEY AUTO_RANDOM, b varchar(255)); @@ -72,20 +72,20 @@ The following table summarizes the pros and cons of each solution. | Scenario | Recommended solution | Pros | Cons | | :--- | :--- | :--- | :--- | -| TiDB will act as the primary and write-intensive database. The business logic is strongly dependent on the continuity of the primary key IDs. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use `SEQUENCE` as the primary key column. | It can avoid data writing hotspots and ensure the continuity and monotonic increment of business data. | The throughput capacity of data write is reduced (to ensure data write continuity). The performance of primary key queries is reduced. | -| TiDB will act as the primary and write-intensive database. The business logic is strongly dependent on the increment of the primary key IDs. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use the application ID generator to define the primary key IDs. | It can avoid data writing hotspots, guarantees the performance of data write, guarantees the increment of business data, but cannot guarantees continuity. | Need code customization. External ID generators strongly depend on the clock accuracy and might introduce failure risks | -| TiDB will act as the primary and write-intensive database. The business logic does not depend on the continuity of the primary key IDs. | Create the table with clustered indexes and set `AUTO_RANDOM` for the primary key column. | It can avoid data writing hotspots. Limited write throughput ability. Excellent query performance of the primary keys. You can smoothly switch `AUTO_INCREMENT` to `AUTO_RANDOM`. | The primary key ID is random. It is recommended to sort the business data by using inserting the time column. If you have to use the primary key ID to sort data, you can left shift 5 bits to query, which can guarantee the increment of the data. | -| TiDB will act as the read-only database. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Keep the primary key column consistent with the data source. | It can avoid data writing hotspots. Low customization cost. | The query performance of the primary keys is impacted. | +|
  • TiDB will act as the primary and write-intensive database.
  • The business logic strongly depends on the continuity of the primary key IDs.
  • | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use `SEQUENCE` as the primary key column. | It can avoid data write hotspots and ensure the continuity and monotonic increment of business data. |
  • The throughput capacity of data write is reduced to ensure data write continuity.
  • The performance of primary key queries is reduced.
  • | +|
  • TiDB will act as the primary and write-intensive database.
  • The business logic strongly depends on the increment of the primary key IDs.
  • | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use an application ID generator to generate the primary key IDs. | It can avoid data write hotspots, guarantee the performance of data write, guarantee the increment of business data, but cannot guarantee continuity. |
  • You need to customize the application.
  • External ID generators strongly depend on the clock accuracy and might introduce failure risks.
  • | +|
  • TiDB will act as the primary and write-intensive database.
  • The business logic does not depend on the continuity of the primary key IDs.
  • | Create the table with clustered indexes and set `AUTO_RANDOM` for the primary key column. |
  • It can avoid data write hotspots and has excellent query performance of the primary keys.
  • You can smoothly switch `AUTO_INCREMENT` to `AUTO_RANDOM`.
  • |
  • The primary key ID is random.
  • The write throughput ability is limited.
  • It is recommended to sort the business data by using inserting the time column.
  • If you have to use the primary key ID to sort data, you can left shift 5 bits to query, which can guarantee the increment of the data.
  • | +| TiDB will act as the read-only database. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Keep the primary key column consistent with the data source. |
  • It can avoid data write hotspots.
  • It requires less customization cost.
  • | The query performance of the primary keys is impacted. | ### Key points for MySQL shards #### Splitting and merging -It is recommended that you use DM to [migrate and merge MySQL shards of small datasets to TiDB](/migrate-small-mysql-shards-to-tidb.md). The benefits are not described here. +It is recommended that you use DM to [migrate and merge MySQL shards of small datasets to TiDB](/migrate-small-mysql-shards-to-tidb.md). -Besides data merging, another typical scenario is data archiving. Data is constantly being written. As time goes by, large amounts of data gradually change from hot data to warm or even cold data. Fortunately, in TiDB, you can use [placement rules](/configure-placement-rules.md) to set different placement rules for data. The minimum granularity is [partitioned tables](/partitioned-table.md). +Besides data merging, another typical scenario is data archiving. Data is constantly being written. As time goes by, large amounts of data gradually change from hot data to warm or even cold data. Fortunately, in TiDB, you can use [placement rules](/configure-placement-rules.md) to set different placement rules for data. The minimum granularity is [a partition](/partitioned-table.md). -Therefore, it is recommended that for write-intensive scenarios, you need to evaluate from the beginning whether you need to archive data and store hot and cold data on different media separately. If you need to archive data, you need to set the partition rules before migration (TiDB does not support Table Rebuild operations yet). This can save you from the need to create tables and import data in future. +Therefore, it is recommended that for write-intensive scenarios, you need to evaluate from the beginning whether you need to archive data and store hot and cold data on different media separately. If you need to archive data, you need to set the partitioning rules before migration (TiDB does not support Table Rebuild operations yet). This can save you from the need to create tables and import data in future. #### The pessimistic mode and the optimistic mode @@ -95,12 +95,12 @@ DM uses the pessimistic mode by default. In scenarios of migrating and merging M - If you do not want to block data write due to upstream schema changes, consider using the optimistic mode. In this case, DM will not block the data migration even when it spots changes in the upstream shard schemas, but will continue to migrate the data. However, if DM spots incompatible formats in upstream and downstream, the migration task will stop. You need to resolve this issue manually. -The following table summarizes the pros and cons of optimistic mode and pessimistic mode. +The following table summarizes the pros and cons of optimistic mode and pessimistic modes. | Scenario | Pros | Cons | | :--- | :--- | :--- | -| Optimistic mode (Default) | It can ensure that the data migrated to the downstream will not go wrong. | If there are a large number of shards, the migration task will be blocked for a long time, or even stopped if the upstream Binlogs have been cleaned up. You can enable the Relay log to avoid this problem. For more limitations, see [Use Relay log](#use-relay-log). | -| Pessimistic mode| Data can not be blocked during migration. | In this mode, ensure that schema changes are compatible (whether the incremental column has a default value). It is possible that inconsistent data can be overlooked. For more limitations, see [Merge and Migrate Data from Sharded Tables in Optimistic Mode](/dm/feature-shard-merge-optimistic.md#restrictions).| +| Pessimistic mode (Default) | It can ensure that the data migrated to the downstream will not go wrong. | If there are a large number of shards, the migration task will be blocked for a long time, or even stop if the upstream Binlogs have been cleaned up. You can enable the Relay log to avoid this problem. For more information, see [Use Relay log](#use-relay-log). | +| Optimistic mode| Data can not be blocked during migration. | In this mode, ensure that schema changes are compatible (whether the incremental column has a default value). It is possible that inconsistent data can be overlooked. For more information, see [Merge and Migrate Data from Sharded Tables in Optimistic Mode](/dm/feature-shard-merge-optimistic.md#restrictions).| ### Other restrictions and impact @@ -110,9 +110,11 @@ TiDB supports most MySQL data types. However, some special types are not support #### Character sets and collations -Since TiDB v6.0.0, the new framework for collations are used by default. If you want TiDB to support utf8_general_ci, utf8mb4_general_ci, utf8_unicode_ci, utf8mb4_unicode_ci, gbk_chinese_ci and gbk_bin, you need explicitly state it when creating the cluster by setting the value of `new_collations_enabled_on_first_bootstrap` to `true`. For more information, see [New framework for collations](/character-set-and-collation.md#new-framework-for-collations) +Since TiDB v6.0.0, the new framework for collations are used by default. If you want TiDB to support utf8_general_ci, utf8mb4_general_ci, utf8_unicode_ci, utf8mb4_unicode_ci, gbk_chinese_ci and gbk_bin, you need explicitly state it when creating the cluster by setting the value of `new_collations_enabled_on_first_bootstrap` to `true`. For more information, see [New framework for collations](/character-set-and-collation.md#new-framework-for-collations). -The default character set in TiDB is utf8mb4. It is recommended that you use utf8mb4 for the upstream and downstream databases and applications. If the upstream database has explicitly specified a character set or collation, you need to check whether TiDB supports it. Since TiDB v6.0.0, GBK is supported. For more information, see the following documents: +The default character set in TiDB is utf8mb4. It is recommended that you use utf8mb4 for the upstream and downstream databases and applications. If the upstream database has explicitly specified a character set or collation, you need to check whether TiDB supports it. + +Since TiDB v6.0.0, GBK is supported. For more information, see the following documents: - [Character Set and Collation](/character-set-and-collation.md) - [GBK compatibility](/character-set-gbk.md#mysql-compatibility) @@ -123,30 +125,32 @@ The default character set in TiDB is utf8mb4. It is recommended that you use utf DM consists of DM-master and DM-worker. -- DM-master manages the metadata of migration tasks and is the central scheduling of DM-worker. It is the core of the whole DM platform. Therefore, DM-master can be deployed as clusters to ensure the availability of the DM platform. +- DM-master manages the metadata of migration tasks and scheduls DM-worker nodes. It is the core of the whole DM platform. Therefore, DM-master can be deployed as clusters to ensure the availability of the DM platform. - DM-worker executes upstream and downstream migration tasks. It is a stateless node. You can deploy at most 1000 DM-worker nodes. When using DM, you can reserve some idle DM-workers to ensure high availability. #### Plan the migration tasks -Splitting the migration task can only guarantee the final consistency of data. Real-time consistency may deviate significantly due to various reasons. +When migrating and merging MySQL shards, you can split the migration task according to the types of shards in the upstream. For example, if `usertable_1~50` and `Logtable_1~50` are two types of shards, you can create two migration tasks. It can simplify the migration task template and effectively control the impact of interruption in data migration. + +For migration of large datasets, you can refer to the following suggestions to split the migration task: -- When migrating and merging MySQL shards, you can split the migration task according to the types of shards in the upstream. For example, if `usertable_1~50` and `Logtable_1~50` are two types of shards, you can create two migration tasks. It can simplify the migration task template and effectively control the impact of interruption in data migration. +- If you need to migrate multiple databases in the upstream, you can split the migration task in terms of databases. -- For large-scale data migration, you can refer to the following to split the migration task: - - If you need to migrate multiple databases in the upstream, you can split the migration task in terms of databases. - - Split the task according to the write pressure in the upstream. That is, split the tables with frequent DML operations in the upstream to a separate migration task. Use another migration task to migrate the tables without frequent DML operations. This method can speed up the progress of the migration task to some extent. Especially when there are a large number of logs written to a table in the upstream. But if this table does not matter, this method can effectively solve such problems. +- Split the task according to the write pressure in the upstream, that is, split the tables with frequent DML operations in the upstream to a separate migration task. Use another migration task to migrate the tables without frequent DML operations. This method can speed up the migration progress, especially when there are a large number of logs written to a table in the upstream. But if this table that contains logs does not affect the whole business, this method still works well. -The following table gives the recommended deployment plans for DM-master and DM-worker in different data migration scenarios. +Note that splitting the migration task can only guarantee the final consistency of data. Real-time consistency may deviate significantly due to various reasons. + +The following table describes the recommended deployment plans for DM-master and DM-worker in different scenarios. | Scenario | DM-master deployment | DM-worker deployment | | :--- | :--- | :--- | -| Small dataset (less than 1 TB) and one-time data migration | Deploy 1 DM-master node | Deploy 1~N DM-worker nodes according to the number of upstream data sources. Generally, 1 DM-worker node is recommended. | -| Large dataset (more than 1 TB) and MySQL shards, one-time data migration | It is recommended to deploy 3 DM-master nodes to ensure the availability of the DM cluster during long-time data migration. | Deploy DM-worker nodes according to the number of data sources or migration tasks. It is recommended to deploy 1~3 idle DM-worker nodes. | -| Long-term data migration | It is necessary to deploy 3 DM-master nodes. If you deploy DM-master nodes on the cloud, try to deploy them in different availability zones (AZ). | Deploy DM-worker nodes according to the number of data sources or migration tasks. It is necessary to deploy 1.5~2 times the number of DM-worker nodes that are actually needed. | +|
  • Small dataset (less than 1 TB)
  • One-time data migration
  • | Deploy 1 DM-master node | Deploy 1~N DM-worker nodes according to the number of upstream data sources. Generally, 1 DM-worker node is recommended. | +|
  • Large dataset (more than 1 TB) and migrating and merging MySQL shards
  • One-time data migration
  • | It is recommended to deploy 3 DM-master nodes to ensure the availability of the DM cluster during long-time data migration. | Deploy DM-worker nodes according to the number of data sources or migration tasks. Besides working DM-worker nodes, it is recommended to deploy 1~3 idle DM-worker nodes. | +| Long-term data replication | It is necessary to deploy 3 DM-master nodes. If you deploy DM-master nodes on the cloud, try to deploy them in different availability zones (AZ). | Deploy DM-worker nodes according to the number of data sources or migration tasks. It is necessary to deploy 1.5~2 times the number of DM-worker nodes that are actually needed. | #### Choose and configure the upstream data source -DM supports full data migration, but when doing it, it backs up the full data of the entire database. DM uses the parallel logical backup method. During the backup, it uses a relatively heavy lock [`FLUSH TABLES WITH READ LOCK`](https://dev.mysql.com/doc/refman/8.0/en/flush.html#flush-tables-with-read-lock). At this time, DML and DDL operations of the upstream database will be blocked for a short time. Therefore, it is strongly recommended to use a backup database to perform the full data backup, and enable the GTID function of the data source at the same time (`enable-gtid: true`). This way, you can not only avoid the impact of the upstream during migration, but also switch to the master node in the upstream to reduce the delay during the incremental migration. For the method of switching the upstream MySQL data source, see [Switch DM-worker Connection between Upstream MySQL Instances](/dm/usage-scenario-master-slave-switch.md#switch-dm-worker-connection-via-virtual-ip). +DM backs up the full data of the entire database when performing full data migration, and uses the parallel logical backup method. During the backup, it uses a relatively heavy lock [`FLUSH TABLES WITH READ LOCK`](https://dev.mysql.com/doc/refman/8.0/en/flush.html#flush-tables-with-read-lock). DML and DDL operations of the upstream database will be blocked for a short time. Therefore, it is strongly recommended to use a backup database to perform the full data backup, and enable the GTID function of the data source (`enable-gtid: true`). This way, you can avoid the impact from the upstream, and switch to the master node in the upstream to reduce the delay during the incremental migration. For the method of switching the upstream MySQL data source, see [Switch DM-worker Connection between Upstream MySQL Instances](/dm/usage-scenario-master-slave-switch.md#switch-dm-worker-connection-via-virtual-ip). Note the following: @@ -162,7 +166,7 @@ Note the following: #### Capitalization -TiDB is case-insensitive to Schema name by default, that is, `lower_case_table_names:2`. But most upstream MySQL use Linux systems that are case-sensitive by default. In this case, you need to set `case-sensitive` to `true` to ensure that the schema can be correctly migrated from the upstream. +TiDB schema names are case-insensitive by default, that is, `lower_case_table_names:2`. But most upstream MySQL databases use Linux systems that are case-sensitive by default. In this case, you need to set `case-sensitive` to `true` to ensure that the schema can be correctly migrated from the upstream. In a special case, for example, if there is a database in the upstream that has both uppercase tables such as `Table` and lowercase tables such as `table`, then an error occurs when creating the schema: @@ -170,32 +174,34 @@ In a special case, for example, if there is a database in the upstream that has #### Filter rules -This section does not introduce the filter rules in detail. It is reminded that you configure the filter rules as soon as you configure the data source. The benefits of configuring the filter rules are: +This section does not elaborate on the filter rules. It is reminded that you configure the filter rules as soon as you configure the data source. The benefits of configuring the filter rules are: - Reduce the number of Binlog events that the downstream needs to process, thereby improving migration efficiency. -- Reduce unnecessary Relay log storage, saving disk space. +- Reduce unnecessary Relay log storage, thereby saving disk space. > **Note:** > > When you migrate and merge MySQL shards, if you have configured filter rules in the data source, you must make sure that the rules match between the data source and the migration task. If they do not match, it may cause the issue that the migration task can not receive incremental data for a long time. -### Use Relay log +### Use the relay log -In the MySQL master/secondary mechanism, the secondary node saves a copy of the Relay logs to ensure the reliability and efficiency of asynchronous replication. DM also supports saving a copy of Relay logs on DM-worker. You can configure information such as the storage location and expiration time. This feature applies to the following scenarios: +In the MySQL master/standby mechanism, the standby node saves a copy of relay logs to ensure the reliability and efficiency of asynchronous replication. DM also supports saving a copy of relay logs on DM-worker. You can configure information such as the storage location and expiration time. This feature applies to the following scenarios: -- During full and incremental data migration, because the amount of full data is large, the entire process takes more time than the time for the upstream Binlog to be archived. It causes the incremental replication task fail to start normally. If you enable Relay log, DM-worker will start receiving Relay log when the full migration is started. This avoids the failure of the incremental task. +- During full and incremental data migration, if the amount of full data is large, the entire process takes more time than the time for the upstream Binlog to be archived. It causes the incremental replication task to fail to start normally. If you enable the relay log, DM-worker will start receiving relay logs when the full migration is started. This avoids the failure of the incremental task. -- When you use DM to perform long-time data migration, sometimes the migration task is blocked for a long time due to various reasons. If you enable Relay log, you can effectively deal with the problem of upstream Binlog being recycled due to the blocking of the migration task. +- When you use DM to perform long-time data replication, sometimes the migration task is blocked for a long time due to various reasons. If you enable the relay log, you can effectively deal with the problem of upstream Binlog being recycled due to the blocking of the migration task. -There are some restrictions using Relay log. DM supports high availability. When a DM-worker fails, it will try to promote an idle DM-worker instance to a working instance. If the upstream Binlog does not contain the necessary migration logs, it may cause interruption. You need to intervene manually to copy the Relay log to the new DM-worker node as soon as possible, and modify the corresponding Relay meta file. For details, see [Troubleshooting](/dm/dm-error-handling.md#the-relay-unit-throws-error-event-from--in--diff-from-passed-in-event--or-a-migration-task-is-interrupted-with-failing-to-get-or-parse-binlog-errors-like-get-binlog-error-error-1236-hy000-and-binlog-checksum-mismatch-data-may-be-corrupted-returned). +There are some restrictions using the relay log. DM supports high availability. When a DM-worker fails, it will try to promote an idle DM-worker instance to a working instance. If the upstream Binlog does not contain the necessary migration logs, it may cause interruption. You need to intervene manually to copy the relay log to the new DM-worker node as soon as possible, and modify the corresponding relay meta file. For details, see [Troubleshooting](/dm/dm-error-handling.md#the-relay-unit-throws-error-event-from--in--diff-from-passed-in-event--or-a-migration-task-is-interrupted-with-failing-to-get-or-parse-binlog-errors-like-get-binlog-error-error-1236-hy000-and-binlog-checksum-mismatch-data-may-be-corrupted-returned). #### Use PT-osc/GH-ost in upstream -In daily MySQL operation and maintenance, usually you use tools such as PT-osc/GH-ost to change the schema online to minimize impact on the business. However, the whole process will be logged to MySQL Binlog. Migrating such data to TiDB downstream will result in a lot of write operations, which is neither efficient nor economical. DM supports third-party data tools such as PT-osc or GH-ost when you configure the migration task. After the configuration, DM will not migrate redundant data and ensure data consistency. For details, see [Migrate from Databases that Use GH-ost/PT-osc](/dm/feature-online-ddl.md). +In daily MySQL operation and maintenance, usually you use tools such as PT-osc/GH-ost to change the schema online to minimize impact on the business. However, the whole process will be logged to MySQL Binlog. Migrating such data to TiDB downstream will result in a lot of write operations, which is neither efficient nor economical. + +To resolve this issue, DM supports third-party data tools such as PT-osc and GH-ost when you configure the migration task. When you use such tools, DM does not migrate redundant data and ensure data consistency. For details, see [Migrate from Databases that Use GH-ost/PT-osc](/dm/feature-online-ddl.md). ## Best practices during migration -This section introduce how to troubleshoot problems you encounter during migration. +This section introduces how to troubleshoot problems you encounter during migration. ### Inconsistent schemas in upstream and downstream @@ -206,7 +212,7 @@ Common errors include: Usually such issues are caused by changed or added indexes in the downstream TiDB, or there are more columns in the downstream. When such errors occur, check whether the upstream and downstream schemas are inconsistent. -To resolve such issues, update the schema information cached in DM to be consistent with the downstream TiDB Schema. For details, see [Manage Table Schemas of Tables to be Migrated](/dm/dm-manage-schema.md). +To resolve such issues, update the schema information cached in DM to be consistent with the downstream TiDB schema. For details, see [Manage Table Schemas of Tables to be Migrated](/dm/dm-manage-schema.md). If the downstream has more columns, see [Migrate Data to a Downstream TiDB Table with More Columns](/migrate-with-more-columns-downstream.md). @@ -224,4 +230,4 @@ Since DM v6.2, data validation is also supported for incremental replication. Fo ## Long-term data replication -If you use DM as a long-term data replication platform, it is necessary to back up the metadata. On the one hand, it ensures the ability to rebuild the migration cluster. On the other hand, it can implement the version control of the migration task through the version control capability. For details, see [Export and Import Data Sources and Task Configuration of Clusters](/dm/dm-export-import-config.md). +If you use DM to perform a long-term data replication task, it is necessary to back up the metadata. On the one hand, it ensures the ability to rebuild the migration cluster. On the other hand, it can implement the version control of the migration task. For details, see [Export and Import Data Sources and Task Configuration of Clusters](/dm/dm-export-import-config.md). From 05141fc94074a82e85560f4b3776c4a4fe911c31 Mon Sep 17 00:00:00 2001 From: xixirangrang <35301108+hfxsd@users.noreply.github.com> Date: Thu, 29 Sep 2022 11:03:41 +0800 Subject: [PATCH 12/25] Update dm-best-practices.md --- dm/dm-best-practices.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/dm/dm-best-practices.md b/dm/dm-best-practices.md index e51f9dbde3df..4daa527a279d 100644 --- a/dm/dm-best-practices.md +++ b/dm/dm-best-practices.md @@ -1,5 +1,5 @@ --- -title: Data Migration (DM) Best Practices +title: TiDB DM Best Practices summary: Learn about best practices when you use TiDB Data Migration (DM) to migrate data. --- @@ -74,8 +74,8 @@ The following table summarizes the pros and cons of each solution. | :--- | :--- | :--- | :--- | |
  • TiDB will act as the primary and write-intensive database.
  • The business logic strongly depends on the continuity of the primary key IDs.
  • | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use `SEQUENCE` as the primary key column. | It can avoid data write hotspots and ensure the continuity and monotonic increment of business data. |
  • The throughput capacity of data write is reduced to ensure data write continuity.
  • The performance of primary key queries is reduced.
  • | |
  • TiDB will act as the primary and write-intensive database.
  • The business logic strongly depends on the increment of the primary key IDs.
  • | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use an application ID generator to generate the primary key IDs. | It can avoid data write hotspots, guarantee the performance of data write, guarantee the increment of business data, but cannot guarantee continuity. |
  • You need to customize the application.
  • External ID generators strongly depend on the clock accuracy and might introduce failure risks.
  • | -|
  • TiDB will act as the primary and write-intensive database.
  • The business logic does not depend on the continuity of the primary key IDs.
  • | Create the table with clustered indexes and set `AUTO_RANDOM` for the primary key column. |
  • It can avoid data write hotspots and has excellent query performance of the primary keys.
  • You can smoothly switch `AUTO_INCREMENT` to `AUTO_RANDOM`.
  • |
  • The primary key ID is random.
  • The write throughput ability is limited.
  • It is recommended to sort the business data by using inserting the time column.
  • If you have to use the primary key ID to sort data, you can left shift 5 bits to query, which can guarantee the increment of the data.
  • | -| TiDB will act as the read-only database. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Keep the primary key column consistent with the data source. |
  • It can avoid data write hotspots.
  • It requires less customization cost.
  • | The query performance of the primary keys is impacted. | +|
  • TiDB will act as the primary and write-intensive database.
  • The business logic does not depend on the continuity of the primary key IDs.
  • | Create the table with clustered indexes and set `AUTO_RANDOM` for the primary key column. |
  • It can avoid data write hotspots and has excellent query performance of the primary keys.
  • You can smoothly switch `AUTO_INCREMENT` to `AUTO_RANDOM`.
  • |
  • The primary key ID is random.
  • The write throughput ability is limited.
  • It is recommended to sort the business data by using inserting the time column.
  • If you have to use the primary key ID to sort data, you can left shift 5 bits to query, which can guarantee the increment of the data.
  • | +| TiDB will act as the read-only database. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Keep the primary key column consistent with the data source. |
  • It can avoid data write hotspots.
  • It requires less customization cost.
  • | The query performance of the primary keys is impacted. | ### Key points for MySQL shards From d0e3337e9835daa35dd0dcaacb74ab0736328f3f Mon Sep 17 00:00:00 2001 From: xixirangrang <35301108+hfxsd@users.noreply.github.com> Date: Thu, 29 Sep 2022 11:06:43 +0800 Subject: [PATCH 13/25] Update dm-best-practices.md --- dm/dm-best-practices.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dm/dm-best-practices.md b/dm/dm-best-practices.md index 4daa527a279d..14f456792b68 100644 --- a/dm/dm-best-practices.md +++ b/dm/dm-best-practices.md @@ -183,7 +183,7 @@ This section does not elaborate on the filter rules. It is reminded that you con > > When you migrate and merge MySQL shards, if you have configured filter rules in the data source, you must make sure that the rules match between the data source and the migration task. If they do not match, it may cause the issue that the migration task can not receive incremental data for a long time. -### Use the relay log +#### Use the relay log In the MySQL master/standby mechanism, the standby node saves a copy of relay logs to ensure the reliability and efficiency of asynchronous replication. DM also supports saving a copy of relay logs on DM-worker. You can configure information such as the storage location and expiration time. This feature applies to the following scenarios: From 554ede96c79be323abbce98d83c48bea0e563f46 Mon Sep 17 00:00:00 2001 From: xixirangrang <35301108+hfxsd@users.noreply.github.com> Date: Thu, 29 Sep 2022 22:04:19 +0800 Subject: [PATCH 14/25] Update dm-best-practices.md --- dm/dm-best-practices.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/dm/dm-best-practices.md b/dm/dm-best-practices.md index 14f456792b68..97dd78d32981 100644 --- a/dm/dm-best-practices.md +++ b/dm/dm-best-practices.md @@ -150,13 +150,13 @@ The following table describes the recommended deployment plans for DM-master and #### Choose and configure the upstream data source -DM backs up the full data of the entire database when performing full data migration, and uses the parallel logical backup method. During the backup, it uses a relatively heavy lock [`FLUSH TABLES WITH READ LOCK`](https://dev.mysql.com/doc/refman/8.0/en/flush.html#flush-tables-with-read-lock). DML and DDL operations of the upstream database will be blocked for a short time. Therefore, it is strongly recommended to use a backup database to perform the full data backup, and enable the GTID function of the data source (`enable-gtid: true`). This way, you can avoid the impact from the upstream, and switch to the master node in the upstream to reduce the delay during the incremental migration. For the method of switching the upstream MySQL data source, see [Switch DM-worker Connection between Upstream MySQL Instances](/dm/usage-scenario-master-slave-switch.md#switch-dm-worker-connection-via-virtual-ip). +DM backs up the full data of the entire database when performing full data migration, and uses the parallel logical backup method. During the backup, it adds a global read lock [`FLUSH TABLES WITH READ LOCK`](https://dev.mysql.com/doc/refman/8.0/en/flush.html#flush-tables-with-read-lock). DML and DDL operations of the upstream database will be blocked for a short time. Therefore, it is strongly recommended to use a backup database in upstream to perform the full data backup, and enable the GTID function of the data source (`enable-gtid: true`). This way, you can avoid the impact from the upstream, and switch to the master node in the upstream to reduce the delay during the incremental migration. For the method of switching the upstream MySQL data source, see [Switch DM-worker Connection between Upstream MySQL Instances](/dm/usage-scenario-master-slave-switch.md#switch-dm-worker-connection-via-virtual-ip). Note the following: - You can only perform full data backup on the master node of the upstream database. - In this scenario, you can set the consistency parameter to `none` in the configuration file, `mydumpers.global.extra-args: "--consistency none"`, to avoid adding a heavy lock to the master node. But this may damage the data consistency of the full backup, which may lead to inconsistent data between the upstream and downstream. + In this scenario, you can set the consistency parameter to `none` in the configuration file, `mydumpers.global.extra-args: "--consistency none"`, to avoid adding a global read lock to the master node. But this may damage the data consistency of the full backup, which may lead to inconsistent data between the upstream and downstream. - Use backup snapshots to perform full data migration (only applicable to the migration of MySQL RDS and Aurora RDS on AWS) @@ -174,7 +174,7 @@ In a special case, for example, if there is a database in the upstream that has #### Filter rules -This section does not elaborate on the filter rules. It is reminded that you configure the filter rules as soon as you configure the data source. The benefits of configuring the filter rules are: +You can configure the filter rules as soon as you start configuring the data source. For more information, see [Data Migration Task Configuration Guide](/dm/dm-task-configuration-guide.md). The benefits of configuring the filter rules are: - Reduce the number of Binlog events that the downstream needs to process, thereby improving migration efficiency. - Reduce unnecessary Relay log storage, thereby saving disk space. @@ -222,11 +222,11 @@ DM supports skipping or replacing DDL statements that cause the migration task t ## Data validation after data migration -Usually you need to validate the consistency of data after data migration. TiDB provides [sync-diff-inspector](/sync-diff-inspector/sync-diff-inspector-overview.md) to help you complete the data validation. +It is recommended that you validate the consistency of data after data migration. TiDB provides [sync-diff-inspector](/sync-diff-inspector/sync-diff-inspector-overview.md) to help you complete the data validation. Now sync-diff-inspector can automatically manage the table list to be checked for data consistency through DM tasks. Compared with the previous manual configuration, it is more efficient. For details, see [Data Check in the DM Replication Scenario](/sync-diff-inspector/dm-diff.md). -Since DM v6.2, data validation is also supported for incremental replication. For details, see [Continuous Data Validation in DM](/dm/dm-continuous-data-validation.md). +Since DM v6.2, DM supports continuous data validation for incremental replication. For details, see [Continuous Data Validation in DM](/dm/dm-continuous-data-validation.md). ## Long-term data replication From c4433108805c73827f20ad15f938086febfe9eb1 Mon Sep 17 00:00:00 2001 From: xixirangrang <35301108+hfxsd@users.noreply.github.com> Date: Thu, 29 Sep 2022 22:07:05 +0800 Subject: [PATCH 15/25] Update dm-best-practices.md --- dm/dm-best-practices.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/dm/dm-best-practices.md b/dm/dm-best-practices.md index 97dd78d32981..94835dac03be 100644 --- a/dm/dm-best-practices.md +++ b/dm/dm-best-practices.md @@ -1,16 +1,16 @@ --- -title: TiDB DM Best Practices +title: TiDB Data Migration (DM) Best Practices summary: Learn about best practices when you use TiDB Data Migration (DM) to migrate data. --- -# TiDB DM Best Practices +# TiDB Data Migration (DM) Best Practices [TiDB Data Migration (DM)](https://github.com/pingcap/tiflow/tree/master/dm) is an data migration tool developed by PingCAP. It supports full and incremental data migration from MySQL-compatible databases such as MySQL, Percona MySQL, MariaDB, Amazon RDS for MySQL and Amazon Aurora into TiDB. You can use DM in the following scenarios: - Perform full and incremental data migration from a single MySQL-compatible database instance to TiDB -- Migrate and merge MySQL shards of small datasets (less than 1 TB) to TiDB +- Migrate and merge MySQL shards of small datasets (less than 1 TiB) to TiDB - In the Data HUB scenario, such as the middle platform of business data, and real-time aggregation of business data, use DM as the middleware for data migration This document introduces how to use DM in an elegant and efficient way, and how to avoid the common mistakes when using DM. @@ -144,8 +144,8 @@ The following table describes the recommended deployment plans for DM-master and | Scenario | DM-master deployment | DM-worker deployment | | :--- | :--- | :--- | -|
  • Small dataset (less than 1 TB)
  • One-time data migration
  • | Deploy 1 DM-master node | Deploy 1~N DM-worker nodes according to the number of upstream data sources. Generally, 1 DM-worker node is recommended. | -|
  • Large dataset (more than 1 TB) and migrating and merging MySQL shards
  • One-time data migration
  • | It is recommended to deploy 3 DM-master nodes to ensure the availability of the DM cluster during long-time data migration. | Deploy DM-worker nodes according to the number of data sources or migration tasks. Besides working DM-worker nodes, it is recommended to deploy 1~3 idle DM-worker nodes. | +|
  • Small dataset (less than 1 TiB)
  • One-time data migration
  • | Deploy 1 DM-master node | Deploy 1~N DM-worker nodes according to the number of upstream data sources. Generally, 1 DM-worker node is recommended. | +|
  • Large dataset (more than 1 TiB) and migrating and merging MySQL shards
  • One-time data migration
  • | It is recommended to deploy 3 DM-master nodes to ensure the availability of the DM cluster during long-time data migration. | Deploy DM-worker nodes according to the number of data sources or migration tasks. Besides working DM-worker nodes, it is recommended to deploy 1~3 idle DM-worker nodes. | | Long-term data replication | It is necessary to deploy 3 DM-master nodes. If you deploy DM-master nodes on the cloud, try to deploy them in different availability zones (AZ). | Deploy DM-worker nodes according to the number of data sources or migration tasks. It is necessary to deploy 1.5~2 times the number of DM-worker nodes that are actually needed. | #### Choose and configure the upstream data source From 554b047f678c7f268b40318d86d1804416ddd22a Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Fri, 30 Sep 2022 13:08:00 +0800 Subject: [PATCH 16/25] Update dm/dm-best-practices.md --- dm/dm-best-practices.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dm/dm-best-practices.md b/dm/dm-best-practices.md index 94835dac03be..a1ab8a775094 100644 --- a/dm/dm-best-practices.md +++ b/dm/dm-best-practices.md @@ -150,7 +150,7 @@ The following table describes the recommended deployment plans for DM-master and #### Choose and configure the upstream data source -DM backs up the full data of the entire database when performing full data migration, and uses the parallel logical backup method. During the backup, it adds a global read lock [`FLUSH TABLES WITH READ LOCK`](https://dev.mysql.com/doc/refman/8.0/en/flush.html#flush-tables-with-read-lock). DML and DDL operations of the upstream database will be blocked for a short time. Therefore, it is strongly recommended to use a backup database in upstream to perform the full data backup, and enable the GTID function of the data source (`enable-gtid: true`). This way, you can avoid the impact from the upstream, and switch to the master node in the upstream to reduce the delay during the incremental migration. For the method of switching the upstream MySQL data source, see [Switch DM-worker Connection between Upstream MySQL Instances](/dm/usage-scenario-master-slave-switch.md#switch-dm-worker-connection-via-virtual-ip). +DM backs up the full data of the entire database when performing full data migration, and uses the parallel logical backup method. During backing up MySQL, it adds a global read lock [`FLUSH TABLES WITH READ LOCK`](https://dev.mysql.com/doc/refman/8.0/en/flush.html#flush-tables-with-read-lock). DML and DDL operations of the upstream database will be blocked for a short time. Therefore, it is strongly recommended to use a backup database in upstream to perform the full data backup, and enable the GTID function of the data source (`enable-gtid: true`). This way, you can avoid the impact from the upstream, and switch to the master node in the upstream to reduce the delay during the incremental migration. For the method of switching the upstream MySQL data source, see [Switch DM-worker Connection between Upstream MySQL Instances](/dm/usage-scenario-master-slave-switch.md#switch-dm-worker-connection-via-virtual-ip). Note the following: From 11d415979f2046e342f025fdf820da1eba876b04 Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Fri, 30 Sep 2022 14:36:13 +0800 Subject: [PATCH 17/25] Apply suggestions from code review --- dm/dm-best-practices.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/dm/dm-best-practices.md b/dm/dm-best-practices.md index a1ab8a775094..cb55f72a324a 100644 --- a/dm/dm-best-practices.md +++ b/dm/dm-best-practices.md @@ -45,7 +45,7 @@ If your business has a strong dependence on the auto-increment ID, consider usin #### Usage of clustered indexes -When you create a table, you can state that the primary key is either a clustered index or a non-clustered index. The following sections describe the pros and cons of each choice. +When you create a table, you can declare that the primary key is either a clustered index or a non-clustered index. The following sections describe the pros and cons of each choice. - Clustered indexes @@ -110,7 +110,7 @@ TiDB supports most MySQL data types. However, some special types are not support #### Character sets and collations -Since TiDB v6.0.0, the new framework for collations are used by default. If you want TiDB to support utf8_general_ci, utf8mb4_general_ci, utf8_unicode_ci, utf8mb4_unicode_ci, gbk_chinese_ci and gbk_bin, you need explicitly state it when creating the cluster by setting the value of `new_collations_enabled_on_first_bootstrap` to `true`. For more information, see [New framework for collations](/character-set-and-collation.md#new-framework-for-collations). +Since TiDB v6.0.0, the new framework for collations are used by default. If you want TiDB to support utf8_general_ci, utf8mb4_general_ci, utf8_unicode_ci, utf8mb4_unicode_ci, gbk_chinese_ci and gbk_bin, you need explicitly declare it when creating the cluster by setting the value of `new_collations_enabled_on_first_bootstrap` to `true`. For more information, see [New framework for collations](/character-set-and-collation.md#new-framework-for-collations). The default character set in TiDB is utf8mb4. It is recommended that you use utf8mb4 for the upstream and downstream databases and applications. If the upstream database has explicitly specified a character set or collation, you need to check whether TiDB supports it. From 9f805622ea318aa0dcb76255548db978b1218f47 Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Fri, 30 Sep 2022 14:44:11 +0800 Subject: [PATCH 18/25] Update dm/dm-best-practices.md --- dm/dm-best-practices.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/dm/dm-best-practices.md b/dm/dm-best-practices.md index cb55f72a324a..197960014a30 100644 --- a/dm/dm-best-practices.md +++ b/dm/dm-best-practices.md @@ -99,8 +99,8 @@ The following table summarizes the pros and cons of optimistic mode and pessimis | Scenario | Pros | Cons | | :--- | :--- | :--- | -| Pessimistic mode (Default) | It can ensure that the data migrated to the downstream will not go wrong. | If there are a large number of shards, the migration task will be blocked for a long time, or even stop if the upstream Binlogs have been cleaned up. You can enable the Relay log to avoid this problem. For more information, see [Use Relay log](#use-relay-log). | -| Optimistic mode| Data can not be blocked during migration. | In this mode, ensure that schema changes are compatible (whether the incremental column has a default value). It is possible that inconsistent data can be overlooked. For more information, see [Merge and Migrate Data from Sharded Tables in Optimistic Mode](/dm/feature-shard-merge-optimistic.md#restrictions).| +| Pessimistic mode (Default) | It can ensure that the data migrated to the downstream will not go wrong. | If there are a large number of shards, the migration task will be blocked for a long time, or even stop if the upstream binlogs have been cleaned up. You can enable the relay log to avoid this problem. For more information, see [Use relay log](#use-relay-log). | +| Optimistic mode| Data can not be blocked during migration. | In this mode, ensure that schema changes are compatible (whether the incremental column has a default value). It is possible that the inconsistent data can be overlooked. For more information, see [Merge and Migrate Data from Sharded Tables in Optimistic Mode](/dm/feature-shard-merge-optimistic.md#restrictions).| ### Other restrictions and impact From a8fc88bc44be2e60b04de22ee070af3708a337a9 Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Fri, 30 Sep 2022 14:50:36 +0800 Subject: [PATCH 19/25] Update dm/dm-best-practices.md --- dm/dm-best-practices.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dm/dm-best-practices.md b/dm/dm-best-practices.md index 197960014a30..a0c7311d3314 100644 --- a/dm/dm-best-practices.md +++ b/dm/dm-best-practices.md @@ -72,7 +72,7 @@ The following table summarizes the pros and cons of each solution. | Scenario | Recommended solution | Pros | Cons | | :--- | :--- | :--- | :--- | -|
  • TiDB will act as the primary and write-intensive database.
  • The business logic strongly depends on the continuity of the primary key IDs.
  • | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use `SEQUENCE` as the primary key column. | It can avoid data write hotspots and ensure the continuity and monotonic increment of business data. |
  • The throughput capacity of data write is reduced to ensure data write continuity.
  • The performance of primary key queries is reduced.
  • | +|
  • TiDB will act as the primary and write-intensive database.
  • The business logic strongly depends on the continuity of the primary key IDs.
  • | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use `SEQUENCE` as the primary key column. | It can avoid data write hotspots and ensure the continuity and monotonic increment of business data. |
  • The throughput capacity of data write is decreased to ensure data write continuity.
  • The performance of primary key queries is decreased.
  • | |
  • TiDB will act as the primary and write-intensive database.
  • The business logic strongly depends on the increment of the primary key IDs.
  • | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use an application ID generator to generate the primary key IDs. | It can avoid data write hotspots, guarantee the performance of data write, guarantee the increment of business data, but cannot guarantee continuity. |
  • You need to customize the application.
  • External ID generators strongly depend on the clock accuracy and might introduce failure risks.
  • | |
  • TiDB will act as the primary and write-intensive database.
  • The business logic does not depend on the continuity of the primary key IDs.
  • | Create the table with clustered indexes and set `AUTO_RANDOM` for the primary key column. |
  • It can avoid data write hotspots and has excellent query performance of the primary keys.
  • You can smoothly switch `AUTO_INCREMENT` to `AUTO_RANDOM`.
  • |
  • The primary key ID is random.
  • The write throughput ability is limited.
  • It is recommended to sort the business data by using inserting the time column.
  • If you have to use the primary key ID to sort data, you can left shift 5 bits to query, which can guarantee the increment of the data.
  • | | TiDB will act as the read-only database. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Keep the primary key column consistent with the data source. |
  • It can avoid data write hotspots.
  • It requires less customization cost.
  • | The query performance of the primary keys is impacted. | From a31086d7b673ce10a7f1922ff5692d382451df60 Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Fri, 30 Sep 2022 15:01:21 +0800 Subject: [PATCH 20/25] Update dm/dm-best-practices.md --- dm/dm-best-practices.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dm/dm-best-practices.md b/dm/dm-best-practices.md index a0c7311d3314..80f972478fee 100644 --- a/dm/dm-best-practices.md +++ b/dm/dm-best-practices.md @@ -85,7 +85,7 @@ It is recommended that you use DM to [migrate and merge MySQL shards of small da Besides data merging, another typical scenario is data archiving. Data is constantly being written. As time goes by, large amounts of data gradually change from hot data to warm or even cold data. Fortunately, in TiDB, you can use [placement rules](/configure-placement-rules.md) to set different placement rules for data. The minimum granularity is [a partition](/partitioned-table.md). -Therefore, it is recommended that for write-intensive scenarios, you need to evaluate from the beginning whether you need to archive data and store hot and cold data on different media separately. If you need to archive data, you need to set the partitioning rules before migration (TiDB does not support Table Rebuild operations yet). This can save you from the need to create tables and import data in future. +Therefore, it is recommended that for write-intensive scenarios, you need to evaluate from the beginning whether you need to archive data and store hot and cold data on different media separately. If you need to archive data, you can set the partitioning rules before migration (TiDB does not support Table Rebuild operations yet). It saves you from the need to create tables and import data in future. #### The pessimistic mode and the optimistic mode From 5c351feba7c38e69d5012ad1d71384aa65693e1e Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Fri, 30 Sep 2022 15:54:20 +0800 Subject: [PATCH 21/25] Apply suggestions from code review --- dm/dm-best-practices.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/dm/dm-best-practices.md b/dm/dm-best-practices.md index 80f972478fee..1e4df6becb73 100644 --- a/dm/dm-best-practices.md +++ b/dm/dm-best-practices.md @@ -91,7 +91,7 @@ Therefore, it is recommended that for write-intensive scenarios, you need to eva DM uses the pessimistic mode by default. In scenarios of migrating and merging MySQL shards, changes in upstream shard schemas can block DML writing to downstream databases. You need to wait for all the schemas to change and have the same structure, and then continue migration from the break point. -- If the upstream schema changes take a long time, it might cause the upstream Binlog to be cleaned up. You can enable the Relay log to avoid this problem. +- If the upstream schema changes take a long time, it might cause the upstream Binlog to be cleaned up. You can enable the relay log to avoid this problem. For more information, see [Use relay log](#use-relay-log). - If you do not want to block data write due to upstream schema changes, consider using the optimistic mode. In this case, DM will not block the data migration even when it spots changes in the upstream shard schemas, but will continue to migrate the data. However, if DM spots incompatible formats in upstream and downstream, the migration task will stop. You need to resolve this issue manually. @@ -100,7 +100,7 @@ The following table summarizes the pros and cons of optimistic mode and pessimis | Scenario | Pros | Cons | | :--- | :--- | :--- | | Pessimistic mode (Default) | It can ensure that the data migrated to the downstream will not go wrong. | If there are a large number of shards, the migration task will be blocked for a long time, or even stop if the upstream binlogs have been cleaned up. You can enable the relay log to avoid this problem. For more information, see [Use relay log](#use-relay-log). | -| Optimistic mode| Data can not be blocked during migration. | In this mode, ensure that schema changes are compatible (whether the incremental column has a default value). It is possible that the inconsistent data can be overlooked. For more information, see [Merge and Migrate Data from Sharded Tables in Optimistic Mode](/dm/feature-shard-merge-optimistic.md#restrictions).| +| Optimistic mode| Data can not be blocked during migration. | In this mode, ensure that schema changes are compatible (check whether the incremental column has a default value). It is possible that the inconsistent data can be overlooked. For more information, see [Merge and Migrate Data from Sharded Tables in Optimistic Mode](/dm/feature-shard-merge-optimistic.md#restrictions).| ### Other restrictions and impact @@ -110,7 +110,7 @@ TiDB supports most MySQL data types. However, some special types are not support #### Character sets and collations -Since TiDB v6.0.0, the new framework for collations are used by default. If you want TiDB to support utf8_general_ci, utf8mb4_general_ci, utf8_unicode_ci, utf8mb4_unicode_ci, gbk_chinese_ci and gbk_bin, you need explicitly declare it when creating the cluster by setting the value of `new_collations_enabled_on_first_bootstrap` to `true`. For more information, see [New framework for collations](/character-set-and-collation.md#new-framework-for-collations). +Since TiDB v6.0.0, the new framework for collations is used by default. If you want TiDB to support utf8_general_ci, utf8mb4_general_ci, utf8_unicode_ci, utf8mb4_unicode_ci, gbk_chinese_ci and gbk_bin, you need explicitly declare it when creating the cluster by setting the value of `new_collations_enabled_on_first_bootstrap` to `true`. For more information, see [New framework for collations](/character-set-and-collation.md#new-framework-for-collations). The default character set in TiDB is utf8mb4. It is recommended that you use utf8mb4 for the upstream and downstream databases and applications. If the upstream database has explicitly specified a character set or collation, you need to check whether TiDB supports it. @@ -134,7 +134,7 @@ When migrating and merging MySQL shards, you can split the migration task accord For migration of large datasets, you can refer to the following suggestions to split the migration task: -- If you need to migrate multiple databases in the upstream, you can split the migration task in terms of databases. +- If you need to migrate multiple databases in the upstream, you can split the migration task according to the number of databases. - Split the task according to the write pressure in the upstream, that is, split the tables with frequent DML operations in the upstream to a separate migration task. Use another migration task to migrate the tables without frequent DML operations. This method can speed up the migration progress, especially when there are a large number of logs written to a table in the upstream. But if this table that contains logs does not affect the whole business, this method still works well. @@ -226,7 +226,7 @@ It is recommended that you validate the consistency of data after data migration Now sync-diff-inspector can automatically manage the table list to be checked for data consistency through DM tasks. Compared with the previous manual configuration, it is more efficient. For details, see [Data Check in the DM Replication Scenario](/sync-diff-inspector/dm-diff.md). -Since DM v6.2, DM supports continuous data validation for incremental replication. For details, see [Continuous Data Validation in DM](/dm/dm-continuous-data-validation.md). +Since DM v6.2.0, DM supports continuous data validation for incremental replication. For details, see [Continuous Data Validation in DM](/dm/dm-continuous-data-validation.md). ## Long-term data replication From 3688b129a44ad894859977ffb005da87d285166f Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Fri, 30 Sep 2022 16:21:06 +0800 Subject: [PATCH 22/25] Apply suggestions from code review --- dm/dm-best-practices.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dm/dm-best-practices.md b/dm/dm-best-practices.md index 1e4df6becb73..b85f61f7b697 100644 --- a/dm/dm-best-practices.md +++ b/dm/dm-best-practices.md @@ -74,7 +74,7 @@ The following table summarizes the pros and cons of each solution. | :--- | :--- | :--- | :--- | |
  • TiDB will act as the primary and write-intensive database.
  • The business logic strongly depends on the continuity of the primary key IDs.
  • | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use `SEQUENCE` as the primary key column. | It can avoid data write hotspots and ensure the continuity and monotonic increment of business data. |
  • The throughput capacity of data write is decreased to ensure data write continuity.
  • The performance of primary key queries is decreased.
  • | |
  • TiDB will act as the primary and write-intensive database.
  • The business logic strongly depends on the increment of the primary key IDs.
  • | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use an application ID generator to generate the primary key IDs. | It can avoid data write hotspots, guarantee the performance of data write, guarantee the increment of business data, but cannot guarantee continuity. |
  • You need to customize the application.
  • External ID generators strongly depend on the clock accuracy and might introduce failure risks.
  • | -|
  • TiDB will act as the primary and write-intensive database.
  • The business logic does not depend on the continuity of the primary key IDs.
  • | Create the table with clustered indexes and set `AUTO_RANDOM` for the primary key column. |
  • It can avoid data write hotspots and has excellent query performance of the primary keys.
  • You can smoothly switch `AUTO_INCREMENT` to `AUTO_RANDOM`.
  • |
  • The primary key ID is random.
  • The write throughput ability is limited.
  • It is recommended to sort the business data by using inserting the time column.
  • If you have to use the primary key ID to sort data, you can left shift 5 bits to query, which can guarantee the increment of the data.
  • | +|
  • TiDB will act as the primary and write-intensive database.
  • The business logic does not depend on the continuity of the primary key IDs.
  • | Create the table with clustered indexes and set `AUTO_RANDOM` for the primary key column. |
  • It can avoid data write hotspots and has excellent query performance of the primary keys.
  • You can smoothly switch from `AUTO_INCREMENT` to `AUTO_RANDOM`.
  • |
  • The primary key IDs are random.
  • The write throughput ability is limited.
  • It is recommended to sort the business data by using the insert time column.
  • If you have to use the primary key ID to sort data, you can left shift 5 bits to query, which can guarantee the increment of the data.
  • | | TiDB will act as the read-only database. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Keep the primary key column consistent with the data source. |
  • It can avoid data write hotspots.
  • It requires less customization cost.
  • | The query performance of the primary keys is impacted. | ### Key points for MySQL shards From 75b8683ee2f9f28aae1e362e1c3e9bc936a0eb84 Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Fri, 30 Sep 2022 16:39:08 +0800 Subject: [PATCH 23/25] Update dm/dm-best-practices.md --- dm/dm-best-practices.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/dm/dm-best-practices.md b/dm/dm-best-practices.md index b85f61f7b697..2226e981a7cb 100644 --- a/dm/dm-best-practices.md +++ b/dm/dm-best-practices.md @@ -125,8 +125,9 @@ Since TiDB v6.0.0, GBK is supported. For more information, see the following doc DM consists of DM-master and DM-worker. -- DM-master manages the metadata of migration tasks and scheduls DM-worker nodes. It is the core of the whole DM platform. Therefore, DM-master can be deployed as clusters to ensure the availability of the DM platform. -- DM-worker executes upstream and downstream migration tasks. It is a stateless node. You can deploy at most 1000 DM-worker nodes. When using DM, you can reserve some idle DM-workers to ensure high availability. +- DM-master manages the metadata of migration tasks and scheduls DM-worker nodes. It is the core of the whole DM platform. Therefore, you can deploy DM-master as clusters to ensure high availability of the DM platform. + +- DM-worker executes upstream and downstream migration tasks. A DM-worker node is stateless. You can deploy at most 1000 DM-worker nodes. When using DM, it is recommended that you reserve some idle DM-workers to ensure high availability. #### Plan the migration tasks From 5e8857a0ce6520ab1b31e7f9681c585cdfa486d2 Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Fri, 30 Sep 2022 17:31:59 +0800 Subject: [PATCH 24/25] Apply suggestions from code review --- dm/dm-best-practices.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/dm/dm-best-practices.md b/dm/dm-best-practices.md index 2226e981a7cb..6ec43bfb6b5e 100644 --- a/dm/dm-best-practices.md +++ b/dm/dm-best-practices.md @@ -91,7 +91,7 @@ Therefore, it is recommended that for write-intensive scenarios, you need to eva DM uses the pessimistic mode by default. In scenarios of migrating and merging MySQL shards, changes in upstream shard schemas can block DML writing to downstream databases. You need to wait for all the schemas to change and have the same structure, and then continue migration from the break point. -- If the upstream schema changes take a long time, it might cause the upstream Binlog to be cleaned up. You can enable the relay log to avoid this problem. For more information, see [Use relay log](#use-relay-log). +- If the upstream schema changes take a long time, it might cause the upstream Binlog to be cleaned up. You can enable the relay log to avoid this problem. For more information, see [Use the relay log](#use-the-relay-log). - If you do not want to block data write due to upstream schema changes, consider using the optimistic mode. In this case, DM will not block the data migration even when it spots changes in the upstream shard schemas, but will continue to migrate the data. However, if DM spots incompatible formats in upstream and downstream, the migration task will stop. You need to resolve this issue manually. @@ -99,7 +99,7 @@ The following table summarizes the pros and cons of optimistic mode and pessimis | Scenario | Pros | Cons | | :--- | :--- | :--- | -| Pessimistic mode (Default) | It can ensure that the data migrated to the downstream will not go wrong. | If there are a large number of shards, the migration task will be blocked for a long time, or even stop if the upstream binlogs have been cleaned up. You can enable the relay log to avoid this problem. For more information, see [Use relay log](#use-relay-log). | +| Pessimistic mode (Default) | It can ensure that the data migrated to the downstream will not go wrong. | If there are a large number of shards, the migration task will be blocked for a long time, or even stop if the upstream binlogs have been cleaned up. You can enable the relay log to avoid this problem. For more information, see [Use the relay log](#use-the relay-log). | | Optimistic mode| Data can not be blocked during migration. | In this mode, ensure that schema changes are compatible (check whether the incremental column has a default value). It is possible that the inconsistent data can be overlooked. For more information, see [Merge and Migrate Data from Sharded Tables in Optimistic Mode](/dm/feature-shard-merge-optimistic.md#restrictions).| ### Other restrictions and impact From e87f1097d95a47f88792494c75432b5d245b3e0e Mon Sep 17 00:00:00 2001 From: xixirangrang Date: Fri, 30 Sep 2022 19:01:08 +0800 Subject: [PATCH 25/25] Apply suggestions from code review Co-authored-by: Grace Cai --- dm/dm-best-practices.md | 68 ++++++++++++++++++++--------------------- 1 file changed, 34 insertions(+), 34 deletions(-) diff --git a/dm/dm-best-practices.md b/dm/dm-best-practices.md index 6ec43bfb6b5e..c38618d27007 100644 --- a/dm/dm-best-practices.md +++ b/dm/dm-best-practices.md @@ -5,15 +5,15 @@ summary: Learn about best practices when you use TiDB Data Migration (DM) to mig # TiDB Data Migration (DM) Best Practices -[TiDB Data Migration (DM)](https://github.com/pingcap/tiflow/tree/master/dm) is an data migration tool developed by PingCAP. It supports full and incremental data migration from MySQL-compatible databases such as MySQL, Percona MySQL, MariaDB, Amazon RDS for MySQL and Amazon Aurora into TiDB. +[TiDB Data Migration (DM)](https://github.com/pingcap/tiflow/tree/master/dm) is a data migration tool developed by PingCAP. It supports full and incremental data migration from MySQL-compatible databases such as MySQL, Percona MySQL, MariaDB, Amazon RDS for MySQL, and Amazon Aurora into TiDB. You can use DM in the following scenarios: - Perform full and incremental data migration from a single MySQL-compatible database instance to TiDB - Migrate and merge MySQL shards of small datasets (less than 1 TiB) to TiDB -- In the Data HUB scenario, such as the middle platform of business data, and real-time aggregation of business data, use DM as the middleware for data migration +- In the data hub scenario, such as the middle platform of business data, and real-time aggregation of business data, use DM as the middleware for data migration -This document introduces how to use DM in an elegant and efficient way, and how to avoid the common mistakes when using DM. +This document introduces how to use DM in an elegant and efficient way, and how to avoid common mistakes when using DM. ## Performance limitations @@ -25,23 +25,23 @@ This document introduces how to use DM in an elegant and efficient way, and how | Max Binlog throughput | 20 MB/s/worker | | Table number limit per task | Unlimited | -- DM supports managing 1000 work nodes simultaneously, and the maximum number of tasks is 600. To ensure the high availability of work nodes, you should reserve some work nodes as standby nodes. Reserve 20% to 50% of the number of the work nodes that have running migration tasks. -- A single work node can theoretically support replication QPS of up to 30K QPS/worker. It varies for different schemas and workloads. The ability to handle upstream binlog is up to 20 MB/s/worker. -- If you use DM as a data replication middleware that will run for a long time, you need to carefully design the deployment architecture of DM components. For more information, see [Deploy DM-master and DM-worker](#deploy-dm-master-and-dm-worker) +- DM supports managing 1000 work nodes simultaneously, and the maximum number of tasks is 600. To ensure the high availability of work nodes, you should reserve some work nodes as standby nodes. The recommended number of standby nodes is 20% to 50% of the number of the work nodes that have running migration tasks. +- A single work node can theoretically support replication QPS of up to 30K QPS/worker. It varies for different schemas and workloads. The ability to handle upstream binlogs is up to 20 MB/s/worker. +- If you want to use DM as a data replication middleware for long-term use, you need to carefully design the deployment architecture of DM components. For more information, see [Deploy DM-master and DM-worker](#deploy-dm-master-and-dm-worker) ## Before data migration -Before data migration, the design of the overall solution is critical. The following sections describe best practices and scenarios from the business side and the implementation side. +Before data migration, the design of the overall solution is critical. The following sections describe best practices and scenarios from the business perspective and the implementation perspective. ### Best practices for the business side -To distribute the workload evenly on multiple nodes, the design for the distributed database is different from traditional databases. It is designed for both low migration cost and logic correctness after migration. The following sections describe best practices before data migration. +To distribute the workload evenly on multiple nodes, the design for the distributed database is different from traditional databases. The solution needs to ensure both low migration cost and logic correctness after migration. The following sections describe best practices before data migration. #### Business impact of AUTO_INCREMENT in schema design -`AUTO_INCREMENT` in TiDB is compatible with `AUTO_INCREMENT` in MySQL. However, as a distributed database, TiDB usually has multiple computing nodes (entry on the client end). When the application data is written, the workload is evenly distributed. This leads to the result that when there is an `AUTO_INCREMENT` column in the table, the auto-increment ID may not be consecutive. For more details, see [AUTO_INCREMENT](/auto-increment.md#implementation-principles). +`AUTO_INCREMENT` in TiDB is compatible with `AUTO_INCREMENT` in MySQL. However, as a distributed database, TiDB usually has multiple computing nodes (entries for the client end). When the application data is written, the workload is evenly distributed. This leads to the result that when there is an `AUTO_INCREMENT` column in the table, the auto-increment IDs of the column might be inconsecutive. For more details, see [AUTO_INCREMENT](/auto-increment.md#implementation-principles). -If your business has a strong dependence on the auto-increment ID, consider using the [SEQUENCE function](/sql-statements/sql-statement-create-sequence.md#sequence-function). +If your business has a strong dependence on auto-increment IDs, consider using the [SEQUENCE function](/sql-statements/sql-statement-create-sequence.md#sequence-function). #### Usage of clustered indexes @@ -53,7 +53,7 @@ When you create a table, you can declare that the primary key is either a cluste - Non-clustered indexes + `shard row id bit` - Using non-clustered indexes and `shard row id bit`, you can avoid the write hotspot problem when using `AUTO_INCREMENT`. However, table lookup in this scenario can impact the query performance when querying using the primary key. + Using non-clustered indexes and `shard row id bit`, you can avoid the write hotspot problem when using `AUTO_INCREMENT`. However, table lookup in this scenario can affect the query performance when querying using the primary key. - Clustered indexes + external distributed ID generators @@ -72,10 +72,10 @@ The following table summarizes the pros and cons of each solution. | Scenario | Recommended solution | Pros | Cons | | :--- | :--- | :--- | :--- | -|
  • TiDB will act as the primary and write-intensive database.
  • The business logic strongly depends on the continuity of the primary key IDs.
  • | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use `SEQUENCE` as the primary key column. | It can avoid data write hotspots and ensure the continuity and monotonic increment of business data. |
  • The throughput capacity of data write is decreased to ensure data write continuity.
  • The performance of primary key queries is decreased.
  • | -|
  • TiDB will act as the primary and write-intensive database.
  • The business logic strongly depends on the increment of the primary key IDs.
  • | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use an application ID generator to generate the primary key IDs. | It can avoid data write hotspots, guarantee the performance of data write, guarantee the increment of business data, but cannot guarantee continuity. |
  • You need to customize the application.
  • External ID generators strongly depend on the clock accuracy and might introduce failure risks.
  • | -|
  • TiDB will act as the primary and write-intensive database.
  • The business logic does not depend on the continuity of the primary key IDs.
  • | Create the table with clustered indexes and set `AUTO_RANDOM` for the primary key column. |
  • It can avoid data write hotspots and has excellent query performance of the primary keys.
  • You can smoothly switch from `AUTO_INCREMENT` to `AUTO_RANDOM`.
  • |
  • The primary key IDs are random.
  • The write throughput ability is limited.
  • It is recommended to sort the business data by using the insert time column.
  • If you have to use the primary key ID to sort data, you can left shift 5 bits to query, which can guarantee the increment of the data.
  • | -| TiDB will act as the read-only database. | Create the table with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Keep the primary key column consistent with the data source. |
  • It can avoid data write hotspots.
  • It requires less customization cost.
  • | The query performance of the primary keys is impacted. | +|
  • TiDB will act as the primary and write-intensive database.
  • The business logic strongly relies on the continuity of the primary key IDs.
  • | Create tables with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use `SEQUENCE` as the primary key column. | It can avoid data write hotspots and ensure the continuity and monotonic increment of business data. |
  • The throughput capacity of data write is decreased to ensure data write continuity.
  • The performance of primary key queries is decreased.
  • | +|
  • TiDB will act as the primary and write-intensive database.
  • The business logic strongly relies on the increment of the primary key IDs.
  • | Create tables with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Use an application ID generator to generate the primary key IDs. | It can avoid data write hotspots, guarantee the performance of data write, and guarantee the increment of business data, but cannot guarantee continuity. |
  • You need to customize the application.
  • External ID generators strongly rely on the clock accuracy and might introduce failures.
  • | +|
  • TiDB will act as the primary and write-intensive database.
  • The business logic does not rely on the continuity of the primary key IDs.
  • | Create tables with clustered indexes and set `AUTO_RANDOM` for the primary key column. |
  • It can avoid data write hotspots and has excellent query performance of primary keys.
  • You can smoothly switch from `AUTO_INCREMENT` to `AUTO_RANDOM`.
  • |
  • The primary key IDs are random.
  • The write throughput ability is limited.
  • It is recommended to sort the business data by using the insert time column.
  • If you have to use the primary key ID to sort data, you can left shift 5 bits to query, which can guarantee the increment of the data.
  • | +| TiDB will act as a read-only database. | Create tables with non-clustered indexes and set `SHARD_ROW_ID_BIT`. Keep the primary key column consistent with the data source. |
  • It can avoid data write hotspots.
  • It requires less customization cost.
  • | The query performance of primary keys is impacted. | ### Key points for MySQL shards @@ -83,13 +83,13 @@ The following table summarizes the pros and cons of each solution. It is recommended that you use DM to [migrate and merge MySQL shards of small datasets to TiDB](/migrate-small-mysql-shards-to-tidb.md). -Besides data merging, another typical scenario is data archiving. Data is constantly being written. As time goes by, large amounts of data gradually change from hot data to warm or even cold data. Fortunately, in TiDB, you can use [placement rules](/configure-placement-rules.md) to set different placement rules for data. The minimum granularity is [a partition](/partitioned-table.md). +Besides data merging, another typical scenario is data archiving. Data is constantly being written. As time goes by, large amounts of data gradually change from hot data to warm or even cold data. Fortunately, in TiDB, you can set different [placement rules](/configure-placement-rules.md) for data. The minimum granularity of the rules is [a partition](/partitioned-table.md). -Therefore, it is recommended that for write-intensive scenarios, you need to evaluate from the beginning whether you need to archive data and store hot and cold data on different media separately. If you need to archive data, you can set the partitioning rules before migration (TiDB does not support Table Rebuild operations yet). It saves you from the need to create tables and import data in future. +Therefore, it is recommended that for write-intensive scenarios, you need to evaluate from the beginning whether you need to archive data and store hot and cold data on different media separately. If you need to archive data, you can set the partitioning rules before migration (TiDB does not support Table Rebuild operations yet). It saves you from the need to recreate tables and import data in future. #### The pessimistic mode and the optimistic mode -DM uses the pessimistic mode by default. In scenarios of migrating and merging MySQL shards, changes in upstream shard schemas can block DML writing to downstream databases. You need to wait for all the schemas to change and have the same structure, and then continue migration from the break point. +DM uses the pessimistic mode by default. In scenarios of migrating and merging MySQL shards, changes in upstream shard schemas can block DML writing to downstream databases. You need to wait until all the schemas are changed and have the same structure, and then continue the migration from the breakpoint. - If the upstream schema changes take a long time, it might cause the upstream Binlog to be cleaned up. You can enable the relay log to avoid this problem. For more information, see [Use the relay log](#use-the-relay-log). @@ -100,7 +100,7 @@ The following table summarizes the pros and cons of optimistic mode and pessimis | Scenario | Pros | Cons | | :--- | :--- | :--- | | Pessimistic mode (Default) | It can ensure that the data migrated to the downstream will not go wrong. | If there are a large number of shards, the migration task will be blocked for a long time, or even stop if the upstream binlogs have been cleaned up. You can enable the relay log to avoid this problem. For more information, see [Use the relay log](#use-the relay-log). | -| Optimistic mode| Data can not be blocked during migration. | In this mode, ensure that schema changes are compatible (check whether the incremental column has a default value). It is possible that the inconsistent data can be overlooked. For more information, see [Merge and Migrate Data from Sharded Tables in Optimistic Mode](/dm/feature-shard-merge-optimistic.md#restrictions).| +| Optimistic mode| Upstream schema changes will not cause data migration latency. | In this mode, ensure that schema changes are compatible (check whether the incremental column has a default value). It is possible that the inconsistent data can be overlooked. For more information, see [Merge and Migrate Data from Sharded Tables in Optimistic Mode](/dm/feature-shard-merge-optimistic.md#restrictions).| ### Other restrictions and impact @@ -110,7 +110,7 @@ TiDB supports most MySQL data types. However, some special types are not support #### Character sets and collations -Since TiDB v6.0.0, the new framework for collations is used by default. If you want TiDB to support utf8_general_ci, utf8mb4_general_ci, utf8_unicode_ci, utf8mb4_unicode_ci, gbk_chinese_ci and gbk_bin, you need explicitly declare it when creating the cluster by setting the value of `new_collations_enabled_on_first_bootstrap` to `true`. For more information, see [New framework for collations](/character-set-and-collation.md#new-framework-for-collations). +Since TiDB v6.0.0, the new framework for collations is used by default. In earlier versions, if you want TiDB to support utf8_general_ci, utf8mb4_general_ci, utf8_unicode_ci, utf8mb4_unicode_ci, gbk_chinese_ci and gbk_bin, you need to explicitly declare it when creating the cluster by setting the value of `new_collations_enabled_on_first_bootstrap` to `true`. For more information, see [New framework for collations](/character-set-and-collation.md#new-framework-for-collations). The default character set in TiDB is utf8mb4. It is recommended that you use utf8mb4 for the upstream and downstream databases and applications. If the upstream database has explicitly specified a character set or collation, you need to check whether TiDB supports it. @@ -123,21 +123,21 @@ Since TiDB v6.0.0, GBK is supported. For more information, see the following doc #### Deploy DM-master and DM-worker -DM consists of DM-master and DM-worker. +DM consists of DM-master and DM-worker nodes. -- DM-master manages the metadata of migration tasks and scheduls DM-worker nodes. It is the core of the whole DM platform. Therefore, you can deploy DM-master as clusters to ensure high availability of the DM platform. +- DM-master manages the metadata of migration tasks and schedules DM-worker nodes. It is the core of the whole DM platform. Therefore, you can deploy DM-master as clusters to ensure high availability of the DM platform. - DM-worker executes upstream and downstream migration tasks. A DM-worker node is stateless. You can deploy at most 1000 DM-worker nodes. When using DM, it is recommended that you reserve some idle DM-workers to ensure high availability. #### Plan the migration tasks -When migrating and merging MySQL shards, you can split the migration task according to the types of shards in the upstream. For example, if `usertable_1~50` and `Logtable_1~50` are two types of shards, you can create two migration tasks. It can simplify the migration task template and effectively control the impact of interruption in data migration. +When migrating and merging MySQL shards, you can split a migration task according to the types of shards in the upstream. For example, if `usertable_1~50` and `Logtable_1~50` are two types of shards, you can create two migration tasks. It can simplify the migration task template and effectively control the impact of interruption in data migration. For migration of large datasets, you can refer to the following suggestions to split the migration task: - If you need to migrate multiple databases in the upstream, you can split the migration task according to the number of databases. -- Split the task according to the write pressure in the upstream, that is, split the tables with frequent DML operations in the upstream to a separate migration task. Use another migration task to migrate the tables without frequent DML operations. This method can speed up the migration progress, especially when there are a large number of logs written to a table in the upstream. But if this table that contains logs does not affect the whole business, this method still works well. +- Split the task according to the write pressure in the upstream, that is, split the tables with frequent DML operations in the upstream to a separate migration task. Use another migration task to migrate the tables without frequent DML operations. This method can speed up the migration progress, especially when there are a large number of logs written to a table in the upstream. But if this table that contains a large number of logs does not affect the whole business, this method still works well. Note that splitting the migration task can only guarantee the final consistency of data. Real-time consistency may deviate significantly due to various reasons. @@ -151,13 +151,13 @@ The following table describes the recommended deployment plans for DM-master and #### Choose and configure the upstream data source -DM backs up the full data of the entire database when performing full data migration, and uses the parallel logical backup method. During backing up MySQL, it adds a global read lock [`FLUSH TABLES WITH READ LOCK`](https://dev.mysql.com/doc/refman/8.0/en/flush.html#flush-tables-with-read-lock). DML and DDL operations of the upstream database will be blocked for a short time. Therefore, it is strongly recommended to use a backup database in upstream to perform the full data backup, and enable the GTID function of the data source (`enable-gtid: true`). This way, you can avoid the impact from the upstream, and switch to the master node in the upstream to reduce the delay during the incremental migration. For the method of switching the upstream MySQL data source, see [Switch DM-worker Connection between Upstream MySQL Instances](/dm/usage-scenario-master-slave-switch.md#switch-dm-worker-connection-via-virtual-ip). +DM backs up the full data of the entire database when performing full data migration, and uses the parallel logical backup method. During backing up MySQL, it adds a global read lock [`FLUSH TABLES WITH READ LOCK`](https://dev.mysql.com/doc/refman/8.0/en/flush.html#flush-tables-with-read-lock). DML and DDL operations of the upstream database will be blocked for a short time. Therefore, it is strongly recommended to use a backup database in upstream to perform the full data backup, and enable the GTID function of the data source (`enable-gtid: true`). In this way, you can avoid the impact from the upstream, and switch to the master node in the upstream to reduce the latency during the incremental migration. For the instructions of switching the upstream MySQL data source, see [Switch DM-worker Connection between Upstream MySQL Instances](/dm/usage-scenario-master-slave-switch.md#switch-dm-worker-connection-via-virtual-ip). Note the following: - You can only perform full data backup on the master node of the upstream database. - In this scenario, you can set the consistency parameter to `none` in the configuration file, `mydumpers.global.extra-args: "--consistency none"`, to avoid adding a global read lock to the master node. But this may damage the data consistency of the full backup, which may lead to inconsistent data between the upstream and downstream. + In this scenario, you can set the `consistency` parameter to `none` in the configuration file, `mydumpers.global.extra-args: "--consistency none"`, to avoid adding a global read lock to the master node. But this might affect the data consistency of the full backup, which may lead to inconsistent data between the upstream and downstream. - Use backup snapshots to perform full data migration (only applicable to the migration of MySQL RDS and Aurora RDS on AWS) @@ -167,7 +167,7 @@ Note the following: #### Capitalization -TiDB schema names are case-insensitive by default, that is, `lower_case_table_names:2`. But most upstream MySQL databases use Linux systems that are case-sensitive by default. In this case, you need to set `case-sensitive` to `true` to ensure that the schema can be correctly migrated from the upstream. +TiDB schema names are case-insensitive by default, that is, `lower_case_table_names:2`. But most upstream MySQL databases use Linux systems that are case-sensitive by default. In this case, you need to set `case-sensitive` to `true` in the DM task configuration file to ensure that the schema can be correctly migrated from the upstream. In a special case, for example, if there is a database in the upstream that has both uppercase tables such as `Table` and lowercase tables such as `table`, then an error occurs when creating the schema: @@ -178,7 +178,7 @@ In a special case, for example, if there is a database in the upstream that has You can configure the filter rules as soon as you start configuring the data source. For more information, see [Data Migration Task Configuration Guide](/dm/dm-task-configuration-guide.md). The benefits of configuring the filter rules are: - Reduce the number of Binlog events that the downstream needs to process, thereby improving migration efficiency. -- Reduce unnecessary Relay log storage, thereby saving disk space. +- Reduce unnecessary relay log storage, thereby saving disk space. > **Note:** > @@ -188,21 +188,21 @@ You can configure the filter rules as soon as you start configuring the data sou In the MySQL master/standby mechanism, the standby node saves a copy of relay logs to ensure the reliability and efficiency of asynchronous replication. DM also supports saving a copy of relay logs on DM-worker. You can configure information such as the storage location and expiration time. This feature applies to the following scenarios: -- During full and incremental data migration, if the amount of full data is large, the entire process takes more time than the time for the upstream Binlog to be archived. It causes the incremental replication task to fail to start normally. If you enable the relay log, DM-worker will start receiving relay logs when the full migration is started. This avoids the failure of the incremental task. +- During full and incremental data migration, if the amount of full data is large, the entire process takes more time than the time for the upstream binlogs to be archived. It causes the incremental replication task to fail to start normally. If you enable the relay log, DM-worker will start receiving relay logs when the full migration is started. This avoids the failure of the incremental task. -- When you use DM to perform long-time data replication, sometimes the migration task is blocked for a long time due to various reasons. If you enable the relay log, you can effectively deal with the problem of upstream Binlog being recycled due to the blocking of the migration task. +- When you use DM to perform long-time data replication, sometimes the migration task is blocked for a long time due to various reasons. If you enable the relay log, you can effectively deal with the problem of upstream binlogs being recycled due to the blocking of the migration task. -There are some restrictions using the relay log. DM supports high availability. When a DM-worker fails, it will try to promote an idle DM-worker instance to a working instance. If the upstream Binlog does not contain the necessary migration logs, it may cause interruption. You need to intervene manually to copy the relay log to the new DM-worker node as soon as possible, and modify the corresponding relay meta file. For details, see [Troubleshooting](/dm/dm-error-handling.md#the-relay-unit-throws-error-event-from--in--diff-from-passed-in-event--or-a-migration-task-is-interrupted-with-failing-to-get-or-parse-binlog-errors-like-get-binlog-error-error-1236-hy000-and-binlog-checksum-mismatch-data-may-be-corrupted-returned). +There are some restrictions on using the relay log. DM supports high availability. When a DM-worker fails, it will try to promote an idle DM-worker instance to a working instance. If the upstream binlogs do not contain the necessary migration logs, it may cause interruption. You need to intervene manually to copy the relay log to the new DM-worker node as soon as possible, and modify the corresponding relay meta file. For details, see [Troubleshooting](/dm/dm-error-handling.md#the-relay-unit-throws-error-event-from--in--diff-from-passed-in-event--or-a-migration-task-is-interrupted-with-failing-to-get-or-parse-binlog-errors-like-get-binlog-error-error-1236-hy000-and-binlog-checksum-mismatch-data-may-be-corrupted-returned). #### Use PT-osc/GH-ost in upstream -In daily MySQL operation and maintenance, usually you use tools such as PT-osc/GH-ost to change the schema online to minimize impact on the business. However, the whole process will be logged to MySQL Binlog. Migrating such data to TiDB downstream will result in a lot of write operations, which is neither efficient nor economical. +In daily MySQL operation and maintenance, usually you use tools such as PT-osc/GH-ost to change the schema online to minimize impact on the business. However, the whole process will be logged to MySQL Binlog. Migrating such data to TiDB downstream will result in a lot of unnecessary write operations, which is neither efficient nor economical. To resolve this issue, DM supports third-party data tools such as PT-osc and GH-ost when you configure the migration task. When you use such tools, DM does not migrate redundant data and ensure data consistency. For details, see [Migrate from Databases that Use GH-ost/PT-osc](/dm/feature-online-ddl.md). ## Best practices during migration -This section introduces how to troubleshoot problems you encounter during migration. +This section introduces how to troubleshoot problems you might encounter during migration. ### Inconsistent schemas in upstream and downstream @@ -219,7 +219,7 @@ If the downstream has more columns, see [Migrate Data to a Downstream TiDB Table ### Interrupted migration task due to failed DDL -DM supports skipping or replacing DDL statements that cause the migration task to interrupt. For details, see [Handle Failed DDL Statements](/dm/handle-failed-ddl-statements.md#usage-examples). +DM supports skipping or replacing DDL statements that cause a migration task to interrupt. For details, see [Handle Failed DDL Statements](/dm/handle-failed-ddl-statements.md#usage-examples). ## Data validation after data migration