From de26baf5abecac4124d28bccdfe5e4aa37a638cb Mon Sep 17 00:00:00 2001 From: Ran Date: Fri, 17 Jun 2022 16:17:40 +0800 Subject: [PATCH 01/11] add doc for dm safe mode Signed-off-by: Ran --- TOC.md | 1 + dm/dm-safe-mode.md | 98 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 99 insertions(+) create mode 100644 dm/dm-safe-mode.md diff --git a/TOC.md b/TOC.md index af3b65c828b85..dc869cef2de05 100644 --- a/TOC.md +++ b/TOC.md @@ -481,6 +481,7 @@ - Reference - Architecture - [DM-worker](/dm/dm-worker-intro.md) + - [Safe Mode](/dm/dm-safe-mode.md) - [Relay Log](/dm/relay-log.md) - [DDL Handling](/dm/dm-ddl-compatible.md) - Command Line diff --git a/dm/dm-safe-mode.md b/dm/dm-safe-mode.md new file mode 100644 index 0000000000000..b65aab5459823 --- /dev/null +++ b/dm/dm-safe-mode.md @@ -0,0 +1,98 @@ +--- +title: DM Safe Mode +summary: Introduces the DM safe mode, its purpose, working principles and how to use it. +--- + +# DM Safe Mode + +Safe mode is an operation mode for DM to perform incremental replication. In safe mode, when the DM's incremental replication component replicates binlog events, it forcibly rewrites all the `INSERT` and `UPDATE` statements before executing them in the downstream. + +In safe mode, the same binlog event can be replicated repeatedly to the downstream and the result is guaranteed to be idempotent. Thus, the incremental replication is ensured to be *safe*. + +After DM resumes a data replication task from the checkpoint, it might repeatedly execute some binlog events, which leads to the following issues: + +1. During incremental replication, the operation of executing DML and the operation of writing checkpoint are not silmultaneous. The operation of writing checkpoint and writing data into the downstream database are not atomic. Therefore, **when DM exits abnormally, checkpoint might only record to a recovery point before the exit moment**. +2. When DM restarts a task and resumes incremental replication from the checkpoint, some data between the checkpoint and the exit moment might already be processed before the abnormal exit. This causes **some SQL statements to be repeatedly executed**. +3. If the `INSERT` statement is executed for more than once, the primary key or the unique index might encounter conflict and cause the replication to fail. If the `UPDATE` statement is executed for more than once, the fitler condition might not be able to locate the updated records. + +In safe mode, DM can resolve the above issues by rewriting SQL statements. + +## Working principle + +Safe mode guarantees the idempotency of binlog events by rewriting SQL statements. Specifically, the following SQL statements are rewritten: + +* `INSERT` is rewritten to `REPLACE`. +* `UPDATE` is analyzed to obtain the value of the primary key or the unique index of the row updated. `UPDATE` is then rewritten to `DELETE` + `REPLACE` in the following two steps: DM deletes the old record using the primary key or unique index, and insert the new record using the `REPLACE` statement. + +`REPLACE` is a MySQL-specific data insertion syntax. When you insert data with `REPLACE`, if the new data and existing data has a primary key or unique constraint conflict, MySQL deletes all the conflicting records and executes the insert operation, which is equivalent to "force insert". For details, see [`REPLACE` statement](https://dev.mysql.com/doc/refman/8.0/en/replace.html) in MySQL documentation. + +For example, a `dummydb.dummytbl` table has a primary key `id`. Execute the following SQL statements repeatedly on this table: + +```sql +INSERT INTO dummydb.dummytbl (id, int_value, str_value) VALUES (123, 999, 'abc'); +UPDATE dummydb.dummytbl SET int_value = 888999 WHERE int_value = 999; -- If there is no other record with int_value = 999 +UPDATE dummydb.dummytbl SET id = 999 WHERE id = 888; -- Update the primary key +``` + +With safe mode enabled, when the preceding SQL statement is executed again in the downstream, it is rewritten as follows: + +```sql +REPLACE INTO dummydb.dummytbl (id, int_value, str_value) VALUES (123, 999, 'abc'); +DELETE FROM dummydb.dummytbl WHERE id = 123; +REPLACE INTO dummydb.dummytbl (id, int_value, str_value) VALUES (123, 888999, 'abc'); +DELETE FROM dummydb.dummytbl WHERE id = 888; +REPLACE INTO dummydb.dummytbl (id, int_value, str_value) VALUES (999, 888888, 'abc888'); +``` + +In the preceding statement, `UPDATE` is rewritten as `DELETE` + `REPLACE`, rather than `DELETE` + `INSERT`. If `INSERT` is used here, when you insert a duplicate record with `id = 999`, the database reports a primary key conflict. This is why `REPLACE` is used instead. The new record will replace the existing record. + +By rewriting SQL statements, before duplicate insert or update operations, DM uses the new row data to overwrite the existing row data. This guarantees that insert and update operations can be executed repeatedly. + +## Enable safe mode + +### Automatically enable + +When DM resumes an incremental replication task from the checkpoint (DM worker restart or network reconnection), DM automatically enables safe mode for a period. + +Whether to enable safe mode is related to the `safemode_exit_point` in the checkpoint. When an incremental replication task is paused abnormally, DM tries to replicate all DML statements in the memory to the downstream and records the latest binlog position in the memory pulled from the upstream as `safemode_exit_point`. The `safemode_exit_point` is saved in the last checkpoint before the abnormal pause. + +When DM resumes an incremental replication task from the checkpoint, it determines whether to enable safe mode based on the following logic: + +- If the checkpoint contains `safemode_exit_point`, the incremental replication task is paused abnormally. When DM resumes the task, if DM detects that the binlog position of the checkpoint to be resumed is earlier than `safemode_exit_point`, it means that the binlog events between the checkpoint and the `safemode_exit_point` might have been processed in the downstream. When the task is resumed, some binlog events might be executed repeatedly. Therefore, DM determines that safe mode should be enabled for these binlog positions. After the binlog position exceeds the `safemode_exit_point`, if safe mode is not manually enabled, DM automatically disables safe mode. + +- If the checkpoint does not contain `safemode_exit_point`, there are two cases: + + 1. This is a new task, or this task is paused as expected. + 2. This task is paused abnormally and it fails to record `safemode_exit_point`, or the DM process exits abnormally. + + In the second case, DM does not know which binlog events after the checkpoint are executed in the downstream. To be safe, if DM does not find `safemode_exit_point` in the checkpoint, it automatically enables safe mode between the preceding two checkpoints to ensure that repeatedly executed binlog events do not cause any problems. The default interval between two checkpoints is 30 seconds, which means when a normal incremental replication task starts, safe mode is enforced for the first 60 seconds (2 * 30 seconds). + + You can change the checkpoint interval by setting the `checkpoint-flush-interval` item in syncer configuration and thereby adjust the safe mode period at the beginning of the incremental replication task. It is not recommended to adjust this setting. If necessary, you can [manually enable safe mode](#manually-enable). + +### Manually enable + +You can control whether to enable safe mode throughout by setting the `safe-mode` item in the syncer configuration. `safe-mode` is a bool type parameter and `false` by default. If it is set to `true`, DM enables safe mode for the whole incremental replication process. The following is a task configuration example with safe mode enabled: + +``` +syncers: # The running configurations of the sync processing unit. + global: # Configuration name. + # Other configuration items are ignored. + safe-mode: true # Enables safe mode for the whole incremental replication process. + # Other configuration items are ignored. +# ----------- Instance configuration ----------- +mysql-instances: + - + source-id: "mysql-replica-01" + # Other configuration items are ignored. + syncer-config-name: "global" # Name of the syncers configuration. +``` + +## Notes for safe mode + +If you want to enable safe mode throughout for safety reasons, you must be aware of the following: + +- **Safe mode has extra overhead for incremental replication.** Frequent `DELETE` + `REPLACE` operations result in frequent changes to primary keys or unique indexes, which creates a greater performance overhead than a simple `UPDATE` statement. +- **Safe mode forces the replacement of records with the same primary key, which might result in data loss in the downstream.** When you merge and migrate shards from the upstream to the downstream, incorrect configuration might lead to a large number of primary key or unique key conflicts. If safe mode is enabled, the downstream might lose lots of data. The task might not show any exception, resulting in severe data inconsistency. +- **Safe mode relies on the primary key or unique index to detect conflicts.** if the downstream table has no primary key or unique index, DM cannot use `REPLACE` to replace and insert records. In such case, even if DM rewrites `INSERT` to `REPLACE`, duplicate records are still inserted into the downstream. + +In summary, if the upstream database has data with duplicate primary keys, and the application can tolerate loss of duplicate records and performance overhead, you can enable safe mode to ignore data duplication. From c1a66ae9e320ea69c40c3a04bc32778f6c8e65b5 Mon Sep 17 00:00:00 2001 From: Ran Date: Wed, 20 Jul 2022 15:14:58 +0800 Subject: [PATCH 02/11] Apply suggestions from code review Co-authored-by: shichun-0415 <89768198+shichun-0415@users.noreply.github.com> --- dm/dm-safe-mode.md | 58 ++++++++++++++++++++++++---------------------- 1 file changed, 30 insertions(+), 28 deletions(-) diff --git a/dm/dm-safe-mode.md b/dm/dm-safe-mode.md index b65aab5459823..fcc9e25a9b3fc 100644 --- a/dm/dm-safe-mode.md +++ b/dm/dm-safe-mode.md @@ -9,24 +9,24 @@ Safe mode is an operation mode for DM to perform incremental replication. In saf In safe mode, the same binlog event can be replicated repeatedly to the downstream and the result is guaranteed to be idempotent. Thus, the incremental replication is ensured to be *safe*. -After DM resumes a data replication task from the checkpoint, it might repeatedly execute some binlog events, which leads to the following issues: +After resuming a data replication task from a checkpoint, DM might repeatedly execute some binlog events, which leads to the following issues: -1. During incremental replication, the operation of executing DML and the operation of writing checkpoint are not silmultaneous. The operation of writing checkpoint and writing data into the downstream database are not atomic. Therefore, **when DM exits abnormally, checkpoint might only record to a recovery point before the exit moment**. -2. When DM restarts a task and resumes incremental replication from the checkpoint, some data between the checkpoint and the exit moment might already be processed before the abnormal exit. This causes **some SQL statements to be repeatedly executed**. -3. If the `INSERT` statement is executed for more than once, the primary key or the unique index might encounter conflict and cause the replication to fail. If the `UPDATE` statement is executed for more than once, the fitler condition might not be able to locate the updated records. +- During incremental replication, the operation of executing DML and the operation of writing checkpoint are not simultaneous. The operation of writing checkpoints and writing data into the downstream database is not atomic. Therefore, **when DM exits abnormally, checkpoints might only record a restoration point before the exit moment**. +- When DM restarts a replication task and resumes incremental replication from a checkpoint, some data between the checkpoint and the exit moment might already be processed before the abnormal exit. This causes **some SQL statements to be repeatedly executed**. +- If an `INSERT` statement is executed more than once, the primary key or the unique index might encounter a conflict, resulting in replication failure. If an `UPDATE` statement is executed more than once, the filter condition might not be able to locate updated records. -In safe mode, DM can resolve the above issues by rewriting SQL statements. +In safe mode, DM can resolve the preceding issues by rewriting SQL statements. ## Working principle -Safe mode guarantees the idempotency of binlog events by rewriting SQL statements. Specifically, the following SQL statements are rewritten: +In safe mode, DM guarantees the idempotency of binlog events by rewriting SQL statements. Specifically, the following SQL statements are rewritten: * `INSERT` is rewritten to `REPLACE`. -* `UPDATE` is analyzed to obtain the value of the primary key or the unique index of the row updated. `UPDATE` is then rewritten to `DELETE` + `REPLACE` in the following two steps: DM deletes the old record using the primary key or unique index, and insert the new record using the `REPLACE` statement. +* `UPDATE` is analyzed to obtain the value of the primary key or the unique index of the row updated. `UPDATE` is then rewritten to `DELETE` + `REPLACE` in the following two steps: DM deletes the old record using the primary key or unique index, and inserts the new record using the `REPLACE` statement. -`REPLACE` is a MySQL-specific data insertion syntax. When you insert data with `REPLACE`, if the new data and existing data has a primary key or unique constraint conflict, MySQL deletes all the conflicting records and executes the insert operation, which is equivalent to "force insert". For details, see [`REPLACE` statement](https://dev.mysql.com/doc/refman/8.0/en/replace.html) in MySQL documentation. +`REPLACE` is a MySQL-specific syntax for inserting data. When you insert data using `REPLACE`, and the new data and existing data have a primary key or unique constraint conflict, MySQL deletes all the conflicting records and executes the insert operation, which is equivalent to "force insert". For details, see [`REPLACE` statement](https://dev.mysql.com/doc/refman/8.0/en/replace.html) in MySQL documentation. -For example, a `dummydb.dummytbl` table has a primary key `id`. Execute the following SQL statements repeatedly on this table: +Assume that a `dummydb.dummytbl` table has a primary key `id`. Execute the following SQL statements repeatedly on this table: ```sql INSERT INTO dummydb.dummytbl (id, int_value, str_value) VALUES (123, 999, 'abc'); @@ -34,7 +34,7 @@ UPDATE dummydb.dummytbl SET int_value = 888999 WHERE int_value = 999; -- If UPDATE dummydb.dummytbl SET id = 999 WHERE id = 888; -- Update the primary key ``` -With safe mode enabled, when the preceding SQL statement is executed again in the downstream, it is rewritten as follows: +With safe mode enabled, when the preceding SQL statements are executed again in the downstream, they are rewritten as follows: ```sql REPLACE INTO dummydb.dummytbl (id, int_value, str_value) VALUES (123, 999, 'abc'); @@ -44,55 +44,57 @@ DELETE FROM dummydb.dummytbl WHERE id = 888; REPLACE INTO dummydb.dummytbl (id, int_value, str_value) VALUES (999, 888888, 'abc888'); ``` -In the preceding statement, `UPDATE` is rewritten as `DELETE` + `REPLACE`, rather than `DELETE` + `INSERT`. If `INSERT` is used here, when you insert a duplicate record with `id = 999`, the database reports a primary key conflict. This is why `REPLACE` is used instead. The new record will replace the existing record. +In the preceding statements, `UPDATE` is rewritten as `DELETE` + `REPLACE`, rather than `DELETE` + `INSERT`. If `INSERT` is used here, when you insert a duplicate record with `id = 999`, the database reports a primary key conflict. This is why `REPLACE` is used instead. The new record will replace the existing record. -By rewriting SQL statements, before duplicate insert or update operations, DM uses the new row data to overwrite the existing row data. This guarantees that insert and update operations can be executed repeatedly. +By rewriting SQL statements, before duplicate insert or update operations, DM uses the new row data to overwrite the existing row data. This guarantees that insert and update operations are executed repeatedly. ## Enable safe mode +You can enable safe mode either automatically or manually. This section describes the detailed steps. + ### Automatically enable -When DM resumes an incremental replication task from the checkpoint (DM worker restart or network reconnection), DM automatically enables safe mode for a period. +When DM resumes an incremental replication task from a checkpoint (DM worker restart or network reconnection), DM automatically enables safe mode for a period. -Whether to enable safe mode is related to the `safemode_exit_point` in the checkpoint. When an incremental replication task is paused abnormally, DM tries to replicate all DML statements in the memory to the downstream and records the latest binlog position in the memory pulled from the upstream as `safemode_exit_point`. The `safemode_exit_point` is saved in the last checkpoint before the abnormal pause. +Whether to enable safe mode is related to `safemode_exit_point` in the checkpoint. When an incremental replication task is paused abnormally, DM replicates all DML statements in the memory to the downstream and records the latest binlog position in the memory pulled from the upstream as `safemode_exit_point`, which is saved in the last checkpoint before the abnormal pause. -When DM resumes an incremental replication task from the checkpoint, it determines whether to enable safe mode based on the following logic: +When resuming an incremental replication task from the checkpoint, DM determines whether to enable safe mode based on the following logic: -- If the checkpoint contains `safemode_exit_point`, the incremental replication task is paused abnormally. When DM resumes the task, if DM detects that the binlog position of the checkpoint to be resumed is earlier than `safemode_exit_point`, it means that the binlog events between the checkpoint and the `safemode_exit_point` might have been processed in the downstream. When the task is resumed, some binlog events might be executed repeatedly. Therefore, DM determines that safe mode should be enabled for these binlog positions. After the binlog position exceeds the `safemode_exit_point`, if safe mode is not manually enabled, DM automatically disables safe mode. +- If the checkpoint contains `safemode_exit_point`, the incremental replication task is paused abnormally. When DM resumes the task, if DM detects that the binlog position of the checkpoint to be resumed is earlier than `safemode_exit_point`, the binlog events between the checkpoint and the `safemode_exit_point` might have been processed in the downstream. After the task is resumed, some binlog events are executed repeatedly. Therefore, DM determines that safe mode should be enabled for these binlog positions. After the binlog position exceeds the `safemode_exit_point`, if safe mode is not manually enabled, DM automatically disables safe mode. - If the checkpoint does not contain `safemode_exit_point`, there are two cases: 1. This is a new task, or this task is paused as expected. - 2. This task is paused abnormally and it fails to record `safemode_exit_point`, or the DM process exits abnormally. + 2. This task is paused abnormally but DM fails to record `safemode_exit_point`, or the DM process exits abnormally. - In the second case, DM does not know which binlog events after the checkpoint are executed in the downstream. To be safe, if DM does not find `safemode_exit_point` in the checkpoint, it automatically enables safe mode between the preceding two checkpoints to ensure that repeatedly executed binlog events do not cause any problems. The default interval between two checkpoints is 30 seconds, which means when a normal incremental replication task starts, safe mode is enforced for the first 60 seconds (2 * 30 seconds). + In the second case, DM does not know which binlog events after the checkpoint are executed in the downstream. To ensure that repeatedly executed binlog events do not cause any problems, DM automatically enables safe mode between the preceding two checkpoints. The default interval between two checkpoints is 30 seconds, which means when a normal incremental replication task starts, safe mode is enforced for the first 60 seconds (2 * 30 seconds). - You can change the checkpoint interval by setting the `checkpoint-flush-interval` item in syncer configuration and thereby adjust the safe mode period at the beginning of the incremental replication task. It is not recommended to adjust this setting. If necessary, you can [manually enable safe mode](#manually-enable). + You can change the checkpoint interval by setting the `checkpoint-flush-interval` item in syncer configuration, thereby adjusting the safe mode period at the beginning of a incremental replication task. It is not recommended to adjust this setting. If necessary, you can [manually enable safe mode](#manually-enable). ### Manually enable -You can control whether to enable safe mode throughout by setting the `safe-mode` item in the syncer configuration. `safe-mode` is a bool type parameter and `false` by default. If it is set to `true`, DM enables safe mode for the whole incremental replication process. The following is a task configuration example with safe mode enabled: +You can control whether to enable safe mode during the entire replication process by setting the `safe-mode` item in the syncer configuration. `safe-mode` is a bool type parameter and is `false` by default. If it is set to `true`, DM enables safe mode for the whole incremental replication process. The following is a task configuration example with safe mode enabled: ``` syncers: # The running configurations of the sync processing unit. global: # Configuration name. - # Other configuration items are ignored. + # Other configuration items are not provided in this example. safe-mode: true # Enables safe mode for the whole incremental replication process. - # Other configuration items are ignored. + # Other configuration items are not provided in this example. # ----------- Instance configuration ----------- mysql-instances: - source-id: "mysql-replica-01" - # Other configuration items are ignored. + # Other configuration items are not provided in this example. syncer-config-name: "global" # Name of the syncers configuration. ``` ## Notes for safe mode -If you want to enable safe mode throughout for safety reasons, you must be aware of the following: +If you want to enable safe mode during the entire replication process for safety reasons, be aware of the following: -- **Safe mode has extra overhead for incremental replication.** Frequent `DELETE` + `REPLACE` operations result in frequent changes to primary keys or unique indexes, which creates a greater performance overhead than a simple `UPDATE` statement. -- **Safe mode forces the replacement of records with the same primary key, which might result in data loss in the downstream.** When you merge and migrate shards from the upstream to the downstream, incorrect configuration might lead to a large number of primary key or unique key conflicts. If safe mode is enabled, the downstream might lose lots of data. The task might not show any exception, resulting in severe data inconsistency. -- **Safe mode relies on the primary key or unique index to detect conflicts.** if the downstream table has no primary key or unique index, DM cannot use `REPLACE` to replace and insert records. In such case, even if DM rewrites `INSERT` to `REPLACE`, duplicate records are still inserted into the downstream. +- **Incremental replication in safe mode consumes extra overhead.** Frequent `DELETE` + `REPLACE` operations result in frequent changes to primary keys or unique indexes, which creates a greater performance overhead than executing `UPDATE` statements only. +- **Safe mode forces the replacement of records with the same primary key, which might result in data loss in the downstream.** When you merge and migrate shards from the upstream to the downstream, incorrect configuration might lead to a large number of primary key or unique key conflicts. If safe mode is enabled in this situation, the downstream might lose lots of data without showing any exception, resulting in severe data inconsistency. +- **Safe mode relies on the primary key or unique index to detect conflicts.** If the downstream table has no primary key or unique index, DM cannot use `REPLACE` to replace and insert records. In this case, even if safe mode is enabled and DM rewrites `INSERT` to `REPLACE`, duplicate records are still inserted into the downstream. -In summary, if the upstream database has data with duplicate primary keys, and the application can tolerate loss of duplicate records and performance overhead, you can enable safe mode to ignore data duplication. +In summary, if the upstream database has data with duplicate primary keys, and your application tolerates loss of duplicate records and performance overhead, you can enable safe mode to ignore data duplication. From 34770efb988444e12397d0472c251a895cd40b00 Mon Sep 17 00:00:00 2001 From: Ran Date: Thu, 21 Jul 2022 10:58:33 +0800 Subject: [PATCH 03/11] Update dm/dm-safe-mode.md --- dm/dm-safe-mode.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dm/dm-safe-mode.md b/dm/dm-safe-mode.md index fcc9e25a9b3fc..85dc74591c9d7 100644 --- a/dm/dm-safe-mode.md +++ b/dm/dm-safe-mode.md @@ -69,7 +69,7 @@ When resuming an incremental replication task from the checkpoint, DM determines In the second case, DM does not know which binlog events after the checkpoint are executed in the downstream. To ensure that repeatedly executed binlog events do not cause any problems, DM automatically enables safe mode between the preceding two checkpoints. The default interval between two checkpoints is 30 seconds, which means when a normal incremental replication task starts, safe mode is enforced for the first 60 seconds (2 * 30 seconds). - You can change the checkpoint interval by setting the `checkpoint-flush-interval` item in syncer configuration, thereby adjusting the safe mode period at the beginning of a incremental replication task. It is not recommended to adjust this setting. If necessary, you can [manually enable safe mode](#manually-enable). + Usually, it is not recommended to change the checkpoint interval to adjust the safe mode period at the beginning of the incremental replication task. However, if you do need a change, you can [manually enable safe mode](#manually-enable) (recommended) or set the `checkpoint-flush-interval` item in syncer configuration. ### Manually enable From 23dd161d16daa7b2c591719d7df37a0ba80c6216 Mon Sep 17 00:00:00 2001 From: Ran Date: Fri, 16 Sep 2022 10:32:17 +0800 Subject: [PATCH 04/11] Apply suggestions from code review Co-authored-by: okJiang --- dm/dm-safe-mode.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/dm/dm-safe-mode.md b/dm/dm-safe-mode.md index 85dc74591c9d7..a3b78c3218d61 100644 --- a/dm/dm-safe-mode.md +++ b/dm/dm-safe-mode.md @@ -5,13 +5,13 @@ summary: Introduces the DM safe mode, its purpose, working principles and how to # DM Safe Mode -Safe mode is an operation mode for DM to perform incremental replication. In safe mode, when the DM's incremental replication component replicates binlog events, it forcibly rewrites all the `INSERT` and `UPDATE` statements before executing them in the downstream. +Safe mode is a special operation mode for DM to perform incremental replication. During safe mode, when the DM's incremental replication component replicates binlog events, DM forcibly rewrites all the `INSERT` and `UPDATE` statements before executing them in the downstream. In safe mode, the same binlog event can be replicated repeatedly to the downstream and the result is guaranteed to be idempotent. Thus, the incremental replication is ensured to be *safe*. After resuming a data replication task from a checkpoint, DM might repeatedly execute some binlog events, which leads to the following issues: -- During incremental replication, the operation of executing DML and the operation of writing checkpoint are not simultaneous. The operation of writing checkpoints and writing data into the downstream database is not atomic. Therefore, **when DM exits abnormally, checkpoints might only record a restoration point before the exit moment**. +- During incremental replication, the operation of executing DML and the operation of writing checkpoint are not simultaneous. The operation of writing checkpoints and writing data into the downstream database is not atomic. Therefore, **when DM exits abnormally, checkpoints might only record a restoration point before the exit point**. - When DM restarts a replication task and resumes incremental replication from a checkpoint, some data between the checkpoint and the exit moment might already be processed before the abnormal exit. This causes **some SQL statements to be repeatedly executed**. - If an `INSERT` statement is executed more than once, the primary key or the unique index might encounter a conflict, resulting in replication failure. If an `UPDATE` statement is executed more than once, the filter condition might not be able to locate updated records. @@ -64,12 +64,12 @@ When resuming an incremental replication task from the checkpoint, DM determines - If the checkpoint does not contain `safemode_exit_point`, there are two cases: - 1. This is a new task, or this task is paused as expected. + 1. This is a new task, or this task is exited as expected. 2. This task is paused abnormally but DM fails to record `safemode_exit_point`, or the DM process exits abnormally. In the second case, DM does not know which binlog events after the checkpoint are executed in the downstream. To ensure that repeatedly executed binlog events do not cause any problems, DM automatically enables safe mode between the preceding two checkpoints. The default interval between two checkpoints is 30 seconds, which means when a normal incremental replication task starts, safe mode is enforced for the first 60 seconds (2 * 30 seconds). - Usually, it is not recommended to change the checkpoint interval to adjust the safe mode period at the beginning of the incremental replication task. However, if you do need a change, you can [manually enable safe mode](#manually-enable) (recommended) or set the `checkpoint-flush-interval` item in syncer configuration. + Usually, it is not recommended to change the checkpoint interval to adjust the safe mode period at the beginning of the incremental replication task. However, if you do need a change, you can [manually enable safe mode](#manually-enable) (recommended) or change the `checkpoint-flush-interval` item in syncer configuration. ### Manually enable From 5366f396b50566022950b2c6535eed77f07e4c5a Mon Sep 17 00:00:00 2001 From: Ran Date: Mon, 28 Nov 2022 15:49:20 +0800 Subject: [PATCH 05/11] Apply suggestions from code review Co-authored-by: okJiang --- dm/dm-safe-mode.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/dm/dm-safe-mode.md b/dm/dm-safe-mode.md index a3b78c3218d61..4daaff0689689 100644 --- a/dm/dm-safe-mode.md +++ b/dm/dm-safe-mode.md @@ -7,19 +7,19 @@ summary: Introduces the DM safe mode, its purpose, working principles and how to Safe mode is a special operation mode for DM to perform incremental replication. During safe mode, when the DM's incremental replication component replicates binlog events, DM forcibly rewrites all the `INSERT` and `UPDATE` statements before executing them in the downstream. -In safe mode, the same binlog event can be replicated repeatedly to the downstream and the result is guaranteed to be idempotent. Thus, the incremental replication is ensured to be *safe*. +During safe mode, the duplicate binlog event can be replicated repeatedly to the downstream and make sure the result is idempotent. Thus, the incremental replication is *safe*. After resuming a data replication task from a checkpoint, DM might repeatedly execute some binlog events, which leads to the following issues: - During incremental replication, the operation of executing DML and the operation of writing checkpoint are not simultaneous. The operation of writing checkpoints and writing data into the downstream database is not atomic. Therefore, **when DM exits abnormally, checkpoints might only record a restoration point before the exit point**. -- When DM restarts a replication task and resumes incremental replication from a checkpoint, some data between the checkpoint and the exit moment might already be processed before the abnormal exit. This causes **some SQL statements to be repeatedly executed**. -- If an `INSERT` statement is executed more than once, the primary key or the unique index might encounter a conflict, resulting in replication failure. If an `UPDATE` statement is executed more than once, the filter condition might not be able to locate updated records. +- When DM restarts a replication task and resumes incremental replication from a checkpoint, some data between the checkpoint and the exit point might already be processed before the abnormal exit. This causes **some SQL statements executed repeatedly**. +- If an `INSERT` statement is executed more than once, the primary key or the unique index might encounter a conflict, which leads to a replication failure. If an `UPDATE` statement is executed more than once, the filter condition might not be able to locate the previously updated records. -In safe mode, DM can resolve the preceding issues by rewriting SQL statements. +During safe mode, DM can rewrite SQL statements to resolve the preceding issues. ## Working principle -In safe mode, DM guarantees the idempotency of binlog events by rewriting SQL statements. Specifically, the following SQL statements are rewritten: +During safe mode, DM guarantees the idempotency of binlog events by rewriting SQL statements. Specifically, the following SQL statements are rewritten: * `INSERT` is rewritten to `REPLACE`. * `UPDATE` is analyzed to obtain the value of the primary key or the unique index of the row updated. `UPDATE` is then rewritten to `DELETE` + `REPLACE` in the following two steps: DM deletes the old record using the primary key or unique index, and inserts the new record using the `REPLACE` statement. @@ -56,9 +56,9 @@ You can enable safe mode either automatically or manually. This section describe When DM resumes an incremental replication task from a checkpoint (DM worker restart or network reconnection), DM automatically enables safe mode for a period. -Whether to enable safe mode is related to `safemode_exit_point` in the checkpoint. When an incremental replication task is paused abnormally, DM replicates all DML statements in the memory to the downstream and records the latest binlog position in the memory pulled from the upstream as `safemode_exit_point`, which is saved in the last checkpoint before the abnormal pause. +Whether to enable safe mode is related to `safemode_exit_point` in the checkpoint. When an incremental replication task is paused abnormally, DM tries to replicate all DML statements in the memory to the downstream and records the latest binlog position among the DML statements as `safemode_exit_point`, which is saved to the last checkpoint. -When resuming an incremental replication task from the checkpoint, DM determines whether to enable safe mode based on the following logic: +When resuming an incremental replication task from the checkpoint, whether DM enables safe mode depends on the following logic: - If the checkpoint contains `safemode_exit_point`, the incremental replication task is paused abnormally. When DM resumes the task, if DM detects that the binlog position of the checkpoint to be resumed is earlier than `safemode_exit_point`, the binlog events between the checkpoint and the `safemode_exit_point` might have been processed in the downstream. After the task is resumed, some binlog events are executed repeatedly. Therefore, DM determines that safe mode should be enabled for these binlog positions. After the binlog position exceeds the `safemode_exit_point`, if safe mode is not manually enabled, DM automatically disables safe mode. From 35d3caa056aa74daadae750964911b0e860e358f Mon Sep 17 00:00:00 2001 From: Ran Date: Mon, 28 Nov 2022 15:50:47 +0800 Subject: [PATCH 06/11] Apply suggestions from code review Co-authored-by: okJiang --- dm/dm-safe-mode.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/dm/dm-safe-mode.md b/dm/dm-safe-mode.md index 4daaff0689689..cd1d3bbd38aec 100644 --- a/dm/dm-safe-mode.md +++ b/dm/dm-safe-mode.md @@ -67,13 +67,13 @@ When resuming an incremental replication task from the checkpoint, whether DM en 1. This is a new task, or this task is exited as expected. 2. This task is paused abnormally but DM fails to record `safemode_exit_point`, or the DM process exits abnormally. - In the second case, DM does not know which binlog events after the checkpoint are executed in the downstream. To ensure that repeatedly executed binlog events do not cause any problems, DM automatically enables safe mode between the preceding two checkpoints. The default interval between two checkpoints is 30 seconds, which means when a normal incremental replication task starts, safe mode is enforced for the first 60 seconds (2 * 30 seconds). + In the second case, DM does not know which binlog events after the checkpoint are executed in the downstream. To ensure that repeatedly executed binlog events do not cause any problems, DM automatically enables safe mode during the first two checkpoint intervals. The default interval between two checkpoints is 30 seconds, which means when a normal incremental replication task starts, safe mode is enforced for the first 60 seconds (2 * 30 seconds). Usually, it is not recommended to change the checkpoint interval to adjust the safe mode period at the beginning of the incremental replication task. However, if you do need a change, you can [manually enable safe mode](#manually-enable) (recommended) or change the `checkpoint-flush-interval` item in syncer configuration. ### Manually enable -You can control whether to enable safe mode during the entire replication process by setting the `safe-mode` item in the syncer configuration. `safe-mode` is a bool type parameter and is `false` by default. If it is set to `true`, DM enables safe mode for the whole incremental replication process. The following is a task configuration example with safe mode enabled: +You can set the `safe-mode` item in the syncer configuration to enable safe mode during the entire replication process. `safe-mode` is a bool type parameter and is `false` by default. If it is set to `true`, DM enables safe mode for the whole incremental replication process. The following is a task configuration example with safe mode enabled: ``` syncers: # The running configurations of the sync processing unit. From d994f59711f365d583d0c1c969312a2098e34d03 Mon Sep 17 00:00:00 2001 From: Ran Date: Fri, 7 Apr 2023 17:44:21 +0800 Subject: [PATCH 07/11] Apply suggestions from code review Co-authored-by: shichun-0415 <89768198+shichun-0415@users.noreply.github.com> Co-authored-by: okJiang --- dm/dm-safe-mode.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/dm/dm-safe-mode.md b/dm/dm-safe-mode.md index cd1d3bbd38aec..959e9bc9a6a13 100644 --- a/dm/dm-safe-mode.md +++ b/dm/dm-safe-mode.md @@ -7,7 +7,7 @@ summary: Introduces the DM safe mode, its purpose, working principles and how to Safe mode is a special operation mode for DM to perform incremental replication. During safe mode, when the DM's incremental replication component replicates binlog events, DM forcibly rewrites all the `INSERT` and `UPDATE` statements before executing them in the downstream. -During safe mode, the duplicate binlog event can be replicated repeatedly to the downstream and make sure the result is idempotent. Thus, the incremental replication is *safe*. +During safe mode, a duplicate binlog event can be replicated repeatedly to the downstream with idempotence guaranteed. Thus, the incremental replication is *safe*. After resuming a data replication task from a checkpoint, DM might repeatedly execute some binlog events, which leads to the following issues: @@ -30,8 +30,8 @@ Assume that a `dummydb.dummytbl` table has a primary key `id`. Execute the follo ```sql INSERT INTO dummydb.dummytbl (id, int_value, str_value) VALUES (123, 999, 'abc'); -UPDATE dummydb.dummytbl SET int_value = 888999 WHERE int_value = 999; -- If there is no other record with int_value = 999 -UPDATE dummydb.dummytbl SET id = 999 WHERE id = 888; -- Update the primary key +UPDATE dummydb.dummytbl SET int_value = 888999 WHERE int_value = 999; -- Suppose there is no other record with int_value = 999 +UPDATE dummydb.dummytbl SET id = 999 WHERE id = 888; -- Update the primary key ``` With safe mode enabled, when the preceding SQL statements are executed again in the downstream, they are rewritten as follows: @@ -46,7 +46,7 @@ REPLACE INTO dummydb.dummytbl (id, int_value, str_value) VALUES (999, 888888, 'a In the preceding statements, `UPDATE` is rewritten as `DELETE` + `REPLACE`, rather than `DELETE` + `INSERT`. If `INSERT` is used here, when you insert a duplicate record with `id = 999`, the database reports a primary key conflict. This is why `REPLACE` is used instead. The new record will replace the existing record. -By rewriting SQL statements, before duplicate insert or update operations, DM uses the new row data to overwrite the existing row data. This guarantees that insert and update operations are executed repeatedly. +By rewriting SQL statements, DM overwrites the existing row data using the new row data when performing duplicate insert or update operations. This guarantees that insert and update operations are executed repeatedly. ## Enable safe mode @@ -58,13 +58,13 @@ When DM resumes an incremental replication task from a checkpoint (DM worker res Whether to enable safe mode is related to `safemode_exit_point` in the checkpoint. When an incremental replication task is paused abnormally, DM tries to replicate all DML statements in the memory to the downstream and records the latest binlog position among the DML statements as `safemode_exit_point`, which is saved to the last checkpoint. -When resuming an incremental replication task from the checkpoint, whether DM enables safe mode depends on the following logic: +The detailed logic is as follows: -- If the checkpoint contains `safemode_exit_point`, the incremental replication task is paused abnormally. When DM resumes the task, if DM detects that the binlog position of the checkpoint to be resumed is earlier than `safemode_exit_point`, the binlog events between the checkpoint and the `safemode_exit_point` might have been processed in the downstream. After the task is resumed, some binlog events are executed repeatedly. Therefore, DM determines that safe mode should be enabled for these binlog positions. After the binlog position exceeds the `safemode_exit_point`, if safe mode is not manually enabled, DM automatically disables safe mode. +- If the checkpoint contains `safemode_exit_point`, the incremental replication task is paused abnormally. When DM resumes the task, the binlog position of the checkpoint to be resumed (**begin position**) is earlier than `safemode_exit_point`, which represents the binlog events between the begin position and the `safemode_exit_point` might have been processed in the downstream. So, during the resume process, some binlog events might be executed repeatedly. Therefore, enabling safe mode can make these binlog positions **safe**. After the binlog position exceeds the `safemode_exit_point`, DM automatically disables safe mode unless safe mode is enabled manually. - If the checkpoint does not contain `safemode_exit_point`, there are two cases: - 1. This is a new task, or this task is exited as expected. + 1. This is a new task, or this task is paused as expected. 2. This task is paused abnormally but DM fails to record `safemode_exit_point`, or the DM process exits abnormally. In the second case, DM does not know which binlog events after the checkpoint are executed in the downstream. To ensure that repeatedly executed binlog events do not cause any problems, DM automatically enables safe mode during the first two checkpoint intervals. The default interval between two checkpoints is 30 seconds, which means when a normal incremental replication task starts, safe mode is enforced for the first 60 seconds (2 * 30 seconds). From b02ee935f97dda0e21750d10b9cafec5cf82279f Mon Sep 17 00:00:00 2001 From: Ran Date: Fri, 7 Apr 2023 17:44:34 +0800 Subject: [PATCH 08/11] Update dm/dm-safe-mode.md Co-authored-by: shichun-0415 <89768198+shichun-0415@users.noreply.github.com> --- dm/dm-safe-mode.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dm/dm-safe-mode.md b/dm/dm-safe-mode.md index 959e9bc9a6a13..9f443138961e8 100644 --- a/dm/dm-safe-mode.md +++ b/dm/dm-safe-mode.md @@ -11,7 +11,7 @@ During safe mode, a duplicate binlog event can be replicated repeatedly to the d After resuming a data replication task from a checkpoint, DM might repeatedly execute some binlog events, which leads to the following issues: -- During incremental replication, the operation of executing DML and the operation of writing checkpoint are not simultaneous. The operation of writing checkpoints and writing data into the downstream database is not atomic. Therefore, **when DM exits abnormally, checkpoints might only record a restoration point before the exit point**. +- During incremental replication, the operation of executing DML and the operation of writing checkpoint are not simultaneous. The operation of writing checkpoints and writing data into the downstream database is not atomic. Therefore, **when DM exits abnormally, checkpoints might only record the restoration point before the exit point**. - When DM restarts a replication task and resumes incremental replication from a checkpoint, some data between the checkpoint and the exit point might already be processed before the abnormal exit. This causes **some SQL statements executed repeatedly**. - If an `INSERT` statement is executed more than once, the primary key or the unique index might encounter a conflict, which leads to a replication failure. If an `UPDATE` statement is executed more than once, the filter condition might not be able to locate the previously updated records. From 154dd6bed5377a9cb05f5afef022488cdaa031ff Mon Sep 17 00:00:00 2001 From: Ran Date: Mon, 10 Apr 2023 16:11:06 +0800 Subject: [PATCH 09/11] Apply suggestions from code review Co-authored-by: xixirangrang --- dm/dm-safe-mode.md | 26 ++++++++++++++------------ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/dm/dm-safe-mode.md b/dm/dm-safe-mode.md index 9f443138961e8..4f171fae61c6b 100644 --- a/dm/dm-safe-mode.md +++ b/dm/dm-safe-mode.md @@ -5,24 +5,24 @@ summary: Introduces the DM safe mode, its purpose, working principles and how to # DM Safe Mode -Safe mode is a special operation mode for DM to perform incremental replication. During safe mode, when the DM's incremental replication component replicates binlog events, DM forcibly rewrites all the `INSERT` and `UPDATE` statements before executing them in the downstream. +Safe mode is a special operation mode for DM to perform incremental replication. In safe mode, when the DM incremental replication component replicates binlog events, DM forcibly rewrites all the `INSERT` and `UPDATE` statements before executing them in the downstream. -During safe mode, a duplicate binlog event can be replicated repeatedly to the downstream with idempotence guaranteed. Thus, the incremental replication is *safe*. +During safe mode, one binlog event can be replicated repeatedly to the downstream with idempotence guaranteed. Thus, the incremental replication is *safe*. -After resuming a data replication task from a checkpoint, DM might repeatedly execute some binlog events, which leads to the following issues: +After resuming a data replication task from a checkpoint, DM might repeatedly replicate some binlog events, which leads to the following issues: -- During incremental replication, the operation of executing DML and the operation of writing checkpoint are not simultaneous. The operation of writing checkpoints and writing data into the downstream database is not atomic. Therefore, **when DM exits abnormally, checkpoints might only record the restoration point before the exit point**. -- When DM restarts a replication task and resumes incremental replication from a checkpoint, some data between the checkpoint and the exit point might already be processed before the abnormal exit. This causes **some SQL statements executed repeatedly**. -- If an `INSERT` statement is executed more than once, the primary key or the unique index might encounter a conflict, which leads to a replication failure. If an `UPDATE` statement is executed more than once, the filter condition might not be able to locate the previously updated records. +- During incremental replication, the operation of executing DML and the operation of writing checkpoints are not simultaneous. The operation of writing checkpoints and writing data into the downstream database is not atomic. Therefore, **when DM exits abnormally, checkpoints might only record the restoration point before the exit point**. +- When DM restarts a replication task and resumes incremental replication from a checkpoint, some data between the checkpoint and the exit point might already be processed before the abnormal exit. This causes **some SQL statements to be executed repeatedly**. +- If an `INSERT` statement is executed repeatedly, the primary key or the unique index might encounter a conflict, which leads to a replication failure. If an `UPDATE` statement is executed repeatedly, the filter condition might not be able to locate the previously updated records. -During safe mode, DM can rewrite SQL statements to resolve the preceding issues. +In safe mode, DM can rewrite SQL statements to resolve the preceding issues. ## Working principle -During safe mode, DM guarantees the idempotency of binlog events by rewriting SQL statements. Specifically, the following SQL statements are rewritten: +In safe mode, DM guarantees the idempotency of binlog events by rewriting SQL statements. Specifically, the following SQL statements are rewritten: -* `INSERT` is rewritten to `REPLACE`. -* `UPDATE` is analyzed to obtain the value of the primary key or the unique index of the row updated. `UPDATE` is then rewritten to `DELETE` + `REPLACE` in the following two steps: DM deletes the old record using the primary key or unique index, and inserts the new record using the `REPLACE` statement. +* `INSERT` statements are rewritten to `REPLACE` statements. +* `UPDATE` statements are analyzed to obtain the value of the primary key or the unique index of the row updated. `UPDATE` statements are then rewritten to `DELETE` + `REPLACE` statements in the following two steps: DM deletes the old record using the primary key or unique index, and inserts the new record using the `REPLACE` statement. `REPLACE` is a MySQL-specific syntax for inserting data. When you insert data using `REPLACE`, and the new data and existing data have a primary key or unique constraint conflict, MySQL deletes all the conflicting records and executes the insert operation, which is equivalent to "force insert". For details, see [`REPLACE` statement](https://dev.mysql.com/doc/refman/8.0/en/replace.html) in MySQL documentation. @@ -73,7 +73,9 @@ The detailed logic is as follows: ### Manually enable -You can set the `safe-mode` item in the syncer configuration to enable safe mode during the entire replication process. `safe-mode` is a bool type parameter and is `false` by default. If it is set to `true`, DM enables safe mode for the whole incremental replication process. The following is a task configuration example with safe mode enabled: +You can set the `safe-mode` item in the syncer configuration to enable safe mode during the entire replication process. `safe-mode` is a bool type parameter and is `false` by default. If it is set to `true`, DM enables safe mode for the whole incremental replication process. + +The following is a task configuration example with safe mode enabled: ``` syncers: # The running configurations of the sync processing unit. @@ -95,6 +97,6 @@ If you want to enable safe mode during the entire replication process for safety - **Incremental replication in safe mode consumes extra overhead.** Frequent `DELETE` + `REPLACE` operations result in frequent changes to primary keys or unique indexes, which creates a greater performance overhead than executing `UPDATE` statements only. - **Safe mode forces the replacement of records with the same primary key, which might result in data loss in the downstream.** When you merge and migrate shards from the upstream to the downstream, incorrect configuration might lead to a large number of primary key or unique key conflicts. If safe mode is enabled in this situation, the downstream might lose lots of data without showing any exception, resulting in severe data inconsistency. -- **Safe mode relies on the primary key or unique index to detect conflicts.** If the downstream table has no primary key or unique index, DM cannot use `REPLACE` to replace and insert records. In this case, even if safe mode is enabled and DM rewrites `INSERT` to `REPLACE`, duplicate records are still inserted into the downstream. +- **Safe mode relies on the primary key or unique index to detect conflicts.** If the downstream table has no primary key or unique index, DM cannot use `REPLACE` to replace and insert records. In this case, even if safe mode is enabled and DM rewrites `INSERT` to `REPLACE` statements, duplicate records are still inserted into the downstream. In summary, if the upstream database has data with duplicate primary keys, and your application tolerates loss of duplicate records and performance overhead, you can enable safe mode to ignore data duplication. From 091f0c2c358fd3702a1408fcf0ff95882c087aaa Mon Sep 17 00:00:00 2001 From: Ran Date: Tue, 11 Apr 2023 14:49:51 +0800 Subject: [PATCH 10/11] Update dm/dm-safe-mode.md Co-authored-by: xixirangrang --- dm/dm-safe-mode.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dm/dm-safe-mode.md b/dm/dm-safe-mode.md index 4f171fae61c6b..6a4b67b04b21b 100644 --- a/dm/dm-safe-mode.md +++ b/dm/dm-safe-mode.md @@ -54,7 +54,7 @@ You can enable safe mode either automatically or manually. This section describe ### Automatically enable -When DM resumes an incremental replication task from a checkpoint (DM worker restart or network reconnection), DM automatically enables safe mode for a period. +When DM resumes an incremental replication task from a checkpoint (For example, DM worker restart or network reconnection), DM automatically enables safe mode for a period. Whether to enable safe mode is related to `safemode_exit_point` in the checkpoint. When an incremental replication task is paused abnormally, DM tries to replicate all DML statements in the memory to the downstream and records the latest binlog position among the DML statements as `safemode_exit_point`, which is saved to the last checkpoint. From 32ee5485ccd68e168ecb441dbfc2d62ea1e2a01f Mon Sep 17 00:00:00 2001 From: Ran Date: Tue, 11 Apr 2023 15:26:52 +0800 Subject: [PATCH 11/11] Update dm/dm-safe-mode.md Co-authored-by: xixirangrang --- dm/dm-safe-mode.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dm/dm-safe-mode.md b/dm/dm-safe-mode.md index 6a4b67b04b21b..52b7dd10c169d 100644 --- a/dm/dm-safe-mode.md +++ b/dm/dm-safe-mode.md @@ -54,7 +54,7 @@ You can enable safe mode either automatically or manually. This section describe ### Automatically enable -When DM resumes an incremental replication task from a checkpoint (For example, DM worker restart or network reconnection), DM automatically enables safe mode for a period. +When DM resumes an incremental replication task from a checkpoint (For example, DM worker restart or network reconnection), DM automatically enables safe mode for a period (60 seconds by default). Whether to enable safe mode is related to `safemode_exit_point` in the checkpoint. When an incremental replication task is paused abnormally, DM tries to replicate all DML statements in the memory to the downstream and records the latest binlog position among the DML statements as `safemode_exit_point`, which is saved to the last checkpoint.