From 44baa1437cab3e602087562b9f16d15366a4f4c1 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Sun, 28 Jun 2020 11:30:53 +0800 Subject: [PATCH 01/52] Create troubleshoot-high-disk-io.md --- troubleshoot-high-disk-io.md | 95 ++++++++++++++++++++++++++++++++++++ 1 file changed, 95 insertions(+) create mode 100644 troubleshoot-high-disk-io.md diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md new file mode 100644 index 0000000000000..20a3f7a256510 --- /dev/null +++ b/troubleshoot-high-disk-io.md @@ -0,0 +1,95 @@ +--- +title: TiDB Disk I/O Excessive Treatment +summary: Learn how to locate and deal with the problem of high TiDB storage I/O. +category: reference +--- + +# The processing method of TiDB disk io usage is too high + +This article mainly introduces how to locate and deal with the problem of high TiDB disk I/O usage. + +## Confirm the current I/O indicators + +When the system response slows down, if the bottleneck of the CPU and the bottleneck of data transaction conflicts have been investigated, you need to start with I/O indicators to help determine the current system bottleneck. + +### Locate I/O problems from monitor + +The fastest position method is to view the overall I/O situation from the monitor. You can view the correspond I/O monitor from the Grafana monitor component, which is deployed by the default cluster deployment tool (TiDB-Ansible, TiUP). The Dashboard relate to I/O has `Overview`, `Node_exporter`, `Disk-Performance`. + +#### The first type of panel + +In `Overview`> `System Info`> `IO Util`, you can see the I/O status of each machine in the cluster. This indicator is similar to util in Linux iostat monitor. The higher percentage represents the higher disk I/O usage: + +- If there is only one machine with high I/O in the monitor, it can assist in judging that there are currently reading and writing hot spots. +- If the I/O of most machines in the monitor is high, then the cluster now has a high I/O load. + +If you find that the I/O of a certain machine is relatively high, you can further monitor the use of I/O from monitor `Disk-Performance Dashboard`, combined with metrics such as `Disk Latency` and `Disk Load` to determine whether there is an abnormality, and if necessary use the fio tool to test the disk. + +#### The second type of panel + +The main persistence component of the TiDB cluster is TiKV cluster. One TiKV instance contains two RocksDB instances: one for storing Raft logs, located in data/raft, and one for storing real data, located in data/db. + +In `TiKV-Details`> `Raft IO`, you can see the relevant metrics for disk writes of these two instances: + +- `Append log duration`: This monitor indicates the response time of RocksDB writes that store Raft logs. The .99 response should be within 50ms. +- `Apply log duration`: This monitor indicates the response time for RocksDB writes that store real data. The .99 response should be within 100ms. + +These two monitors also have `.. per server` monitor panels to provide assistance to view hotspot writes. + +#### The third type of panel + +In `TiKV-Details`> `Storage`, there are monitor related to storage: + +- `Storage command total`: the number of different commands received. +- `Storage async write duration`: includes monitor items such as disk sync duration, which may be related to Raft IO. If you encounter an abnormal situation, you need to check the working status of related components by logs. + +#### Other panels + +In addition, some other content may be needed to help locate whether the bottleneck is I/O, and you can try to set some recommended parameters. By checking the prewrite/commit/raw-put of TiKV gRPC (raw kv cluster only) duration, it is confirmed that TiKV write is indeed slow. Several common situations are as follows: + +- Append log is slow. TiKV Grafana's Raft I/O and append log duration are relatively high. Usually, it is due to slow disk writing. You can check the value of WAL Sync Duration max of RocksDB-raft to confirm, otherwise you may need to report a bug. +- The raftstore thread is busy. TiKV grafana's Raft Propose/propose wait duration is significantly higher than append log duration. Please check the following two points: + + - Is the `store-pool-size` configuration of `[raftstore]` too small (this value is recommended between [1,5], not too large). + - Is machine's CPU insufficient. + +- Apply log is slow. TiKV Grafana's Raft I/O and apply log duration are relatively high, usually accompanied by a relatively high Raft Propose/apply wait duration. The possible situations are as follows: + + - The `apply-pool-size` configuration of `[raftstore]` is too small (recommended between [1, 5], not recommended to be too large), Thread CPU/apply cpu is relatively high; + - The machine's CPU resources are not enough. + - Region write hotspot issue, the CPU usage of a single apply thread is relatively high (by modifying the Grafana expression, plus by (instance, name) to see the CPU usage of each thread), temporarily for the hot write of a single Region is no solution, this scene is being optimized recently. + - It is slow to write RocksDB, and the RocksDB kv/max write duration is relatively high (a single Raft log may contain many kvs. When writing RocksDB, 128 kvs will be written to RocksDB in a batch write, so one apply log may involve multiple RocksDB writes). + - In other cases, bugs need to be reported. + +- Raft commit log is slow. TiKV Grafana's Raft I/O and commit log duration are relatively high (this metric is only available in Grafana 4.x). Each Region corresponds to an independent Raft group. Raft has a flow control mechanism, similar to the sliding window mechanism of TCP, through the parameter [raftstore] raft-max-inflight-msgs = 256 to control the size of the sliding window, if there is a hot spot Write and commit log duration is relatively high, you can moderately change the parameters, such as 1024. + +### Locate I/O problems from log + +- If the client reports `server is busy` error, especially the error message of `raftstore is busy`, it will be related to I/O problem. + + You can check the monitor: grafana -> TiKV -> errors to confirm the specific busy reason. Among them, `server is busy` is TiKV's flow control mechanism. In this way, TiKV informs `tidb/ti-client` that the current pressure of TiKV is too high, and try again later. + +- "Write stall" appears in TiKV RocksDB logs. + + It may be that too much level0 sst causes stalls. You can add the parameter `[rocksdb] max-sub-compactions = 2 (or 3)` to speed up the compaction of level0 sst. This parameter means that the compaction tasks of level0 to level1 can be divided into `max-sub-compactions` subtasks to multi-threaded concurrent execution. + + If the disk's I/O capability fail to keep up with the write, it is recommended to scale-in. If the throughput of the disk reaches the upper limit (for example, the throughput of SATA SSD will be much lower than that of NVME SSD), resulting in write stall, but the CPU resource is relatively sufficient, you can try to use a higher compression ratio compression algorithm to relieve the pressure on the disk, use CPU resources Change disk resources. + + For example, when the pressure of `default cf compaction` is relatively high, you can change the parameter`[rocksdb.defaultcf] compression-per-level = ["no", "no", "lz4", "lz4", "lz4", "zstd" , "zstd"]` to `compression-per-level = ["no", "no", "zstd", "zstd", "zstd", "zstd", "zstd"]`. + +### I/O problem found from alarm + +The cluster deployment tool (TiDB-Ansible, TiUP) is an alarm component that is deployed by default. Officials have preset related alarm items and thresholds. I/O related items include: + +- TiKV_write_stall +- TiKV_raft_log_lag +- TiKV_async_request_snapshot_duration_seconds +- TiKV_async_request_write_duration_seconds +- TiKV_raft_append_log_duration_secs +- TiKV_raft_apply_log_duration_secs + +## I/O problem handling plan + +1. When it is confirmed as a I/O hotspot issue, you need to refer to [TiDB Hot Issue Processing] (/troubleshoot-hot-spot-issues.md) to eliminate the related I/O hotspot situation. +2. When it is confirmed that the overall I/O has reached the bottleneck, and the ability to judge the I/O from the business side will continue to keep up, then you can take advantage of the distributed database's scale capability and adopt the scheme of expanding the number of TiKV nodes to obtain greater overall I/O throughput. +3. Adjust some of the parameters in the above description, and use computing/memory resources in exchange for disk storage resources. From e5a5a578cf210fe4367d263b852c8c5db40efebe Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:50:31 +0800 Subject: [PATCH 02/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 20a3f7a256510..12f23288a1e5a 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -1,5 +1,5 @@ --- -title: TiDB Disk I/O Excessive Treatment +title: Troubleshoot High Disk I/O Usage in TiDB summary: Learn how to locate and deal with the problem of high TiDB storage I/O. category: reference --- From ff2ee920f1ec07b41ccd1a3ecb3d05c9c5b1cef2 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:50:42 +0800 Subject: [PATCH 03/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 12f23288a1e5a..70a5a674e7885 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -1,6 +1,6 @@ --- title: Troubleshoot High Disk I/O Usage in TiDB -summary: Learn how to locate and deal with the problem of high TiDB storage I/O. +summary: Learn how to locate and address the issue of high TiDB storage I/O usage. category: reference --- From 00898dea6703383c5478dde09d131250c2392428 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:50:50 +0800 Subject: [PATCH 04/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 70a5a674e7885..06cce922453ec 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -8,7 +8,7 @@ category: reference This article mainly introduces how to locate and deal with the problem of high TiDB disk I/O usage. -## Confirm the current I/O indicators +## Check the current I/O metrics When the system response slows down, if the bottleneck of the CPU and the bottleneck of data transaction conflicts have been investigated, you need to start with I/O indicators to help determine the current system bottleneck. From 1d2f95e39edb3dec08e92c73652f65c015d91a99 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:51:02 +0800 Subject: [PATCH 05/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 06cce922453ec..ce9af691381c7 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -6,7 +6,7 @@ category: reference # The processing method of TiDB disk io usage is too high -This article mainly introduces how to locate and deal with the problem of high TiDB disk I/O usage. +This document introduces how to locate and address the issue of high disk I/O usage in TiDB. ## Check the current I/O metrics From 33dc1cc5ca27441123a8fada98c1c37ac14aa938 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:51:10 +0800 Subject: [PATCH 06/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index ce9af691381c7..1e21200059835 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -57,7 +57,7 @@ In addition, some other content may be needed to help locate whether the bottlen - The `apply-pool-size` configuration of `[raftstore]` is too small (recommended between [1, 5], not recommended to be too large), Thread CPU/apply cpu is relatively high; - The machine's CPU resources are not enough. - - Region write hotspot issue, the CPU usage of a single apply thread is relatively high (by modifying the Grafana expression, plus by (instance, name) to see the CPU usage of each thread), temporarily for the hot write of a single Region is no solution, this scene is being optimized recently. + - Write hotspot issue of a single Region (Currently, the solution to this issue is still on the way). The CPU usage of a single `apply` thread is high (which can be viewed by modifying the Grafana expression, appended with `by (instance, name)`). - It is slow to write RocksDB, and the RocksDB kv/max write duration is relatively high (a single Raft log may contain many kvs. When writing RocksDB, 128 kvs will be written to RocksDB in a batch write, so one apply log may involve multiple RocksDB writes). - In other cases, bugs need to be reported. From 795827ad21c1b891b80d80997f40e46c3f6e2f1f Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:51:26 +0800 Subject: [PATCH 07/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 1e21200059835..14172b69b344b 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -58,7 +58,7 @@ In addition, some other content may be needed to help locate whether the bottlen - The `apply-pool-size` configuration of `[raftstore]` is too small (recommended between [1, 5], not recommended to be too large), Thread CPU/apply cpu is relatively high; - The machine's CPU resources are not enough. - Write hotspot issue of a single Region (Currently, the solution to this issue is still on the way). The CPU usage of a single `apply` thread is high (which can be viewed by modifying the Grafana expression, appended with `by (instance, name)`). - - It is slow to write RocksDB, and the RocksDB kv/max write duration is relatively high (a single Raft log may contain many kvs. When writing RocksDB, 128 kvs will be written to RocksDB in a batch write, so one apply log may involve multiple RocksDB writes). + - Slow write into RocksDB, and `RocksDB kv`/`max write duration` is high. A single Raft log might contain multiple key-value pairs (kv). 128 kvs are written to RocksDB in a batch, so one `apply` log might involve multiple RocksDB writes. - In other cases, bugs need to be reported. - Raft commit log is slow. TiKV Grafana's Raft I/O and commit log duration are relatively high (this metric is only available in Grafana 4.x). Each Region corresponds to an independent Raft group. Raft has a flow control mechanism, similar to the sliding window mechanism of TCP, through the parameter [raftstore] raft-max-inflight-msgs = 256 to control the size of the sliding window, if there is a hot spot Write and commit log duration is relatively high, you can moderately change the parameters, such as 1024. From 0dec9a0bd00e87b30632a91981e3a32ea3a99e3c Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:51:34 +0800 Subject: [PATCH 08/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 14172b69b344b..62a162ee8ddd0 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -56,7 +56,7 @@ In addition, some other content may be needed to help locate whether the bottlen - Apply log is slow. TiKV Grafana's Raft I/O and apply log duration are relatively high, usually accompanied by a relatively high Raft Propose/apply wait duration. The possible situations are as follows: - The `apply-pool-size` configuration of `[raftstore]` is too small (recommended between [1, 5], not recommended to be too large), Thread CPU/apply cpu is relatively high; - - The machine's CPU resources are not enough. + - Insufficient CPU resources on the machine. - Write hotspot issue of a single Region (Currently, the solution to this issue is still on the way). The CPU usage of a single `apply` thread is high (which can be viewed by modifying the Grafana expression, appended with `by (instance, name)`). - Slow write into RocksDB, and `RocksDB kv`/`max write duration` is high. A single Raft log might contain multiple key-value pairs (kv). 128 kvs are written to RocksDB in a batch, so one `apply` log might involve multiple RocksDB writes. - In other cases, bugs need to be reported. From db0d36ed1356f2f9dc4f67b298aad3b8e2618900 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:51:46 +0800 Subject: [PATCH 09/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 62a162ee8ddd0..0f79655afe7ec 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -61,7 +61,7 @@ In addition, some other content may be needed to help locate whether the bottlen - Slow write into RocksDB, and `RocksDB kv`/`max write duration` is high. A single Raft log might contain multiple key-value pairs (kv). 128 kvs are written to RocksDB in a batch, so one `apply` log might involve multiple RocksDB writes. - In other cases, bugs need to be reported. -- Raft commit log is slow. TiKV Grafana's Raft I/O and commit log duration are relatively high (this metric is only available in Grafana 4.x). Each Region corresponds to an independent Raft group. Raft has a flow control mechanism, similar to the sliding window mechanism of TCP, through the parameter [raftstore] raft-max-inflight-msgs = 256 to control the size of the sliding window, if there is a hot spot Write and commit log duration is relatively high, you can moderately change the parameters, such as 1024. +- `raft commit log` is slow. In TiKV Grafana, `Raft I/O` and `commit log duration` (only available in Grafana 4.x) metrics are relatively high. Each Region corresponds to an independent Raft group. Raft has a flow control mechanism similar to the sliding window mechanism of TCP. To control the size of a sliding window, adjust the `[raftstore] raft-max-inflight-msgs` parameter. if there is a write hotspot and `commit log duration` is high, you can properly set this parameter to a larger value, such as `1024`. ### Locate I/O problems from log From 98c513248654784a6f1069b83f2c99f7fb19d021 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:51:53 +0800 Subject: [PATCH 10/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 0f79655afe7ec..4a98b7a05e9f2 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -59,7 +59,7 @@ In addition, some other content may be needed to help locate whether the bottlen - Insufficient CPU resources on the machine. - Write hotspot issue of a single Region (Currently, the solution to this issue is still on the way). The CPU usage of a single `apply` thread is high (which can be viewed by modifying the Grafana expression, appended with `by (instance, name)`). - Slow write into RocksDB, and `RocksDB kv`/`max write duration` is high. A single Raft log might contain multiple key-value pairs (kv). 128 kvs are written to RocksDB in a batch, so one `apply` log might involve multiple RocksDB writes. - - In other cases, bugs need to be reported. + - For other causes, report them as bugs. - `raft commit log` is slow. In TiKV Grafana, `Raft I/O` and `commit log duration` (only available in Grafana 4.x) metrics are relatively high. Each Region corresponds to an independent Raft group. Raft has a flow control mechanism similar to the sliding window mechanism of TCP. To control the size of a sliding window, adjust the `[raftstore] raft-max-inflight-msgs` parameter. if there is a write hotspot and `commit log duration` is high, you can properly set this parameter to a larger value, such as `1024`. From 000e62357159c4d5fffd2b35daa99775a9680e57 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:52:06 +0800 Subject: [PATCH 11/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 4a98b7a05e9f2..6aa0323cf542e 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -4,7 +4,7 @@ summary: Learn how to locate and address the issue of high TiDB storage I/O usag category: reference --- -# The processing method of TiDB disk io usage is too high +# Troubleshoot High Disk I/O Usage in TiDB This document introduces how to locate and address the issue of high disk I/O usage in TiDB. From 81e9bd3cd6d540a91a6d67d021e0ea6868a91e4b Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:52:19 +0800 Subject: [PATCH 12/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 6aa0323cf542e..7121c3703d1ed 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -10,7 +10,7 @@ This document introduces how to locate and address the issue of high disk I/O us ## Check the current I/O metrics -When the system response slows down, if the bottleneck of the CPU and the bottleneck of data transaction conflicts have been investigated, you need to start with I/O indicators to help determine the current system bottleneck. +If TiDB's response slows down after you have troubleshot the CPU bottleneck and the bottleneck caused by transaction conflicts, you need to check I/O metrics to help determine the current system bottleneck. ### Locate I/O problems from monitor From a19112db15bd5b5d18338660862489c39ba84846 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:53:27 +0800 Subject: [PATCH 13/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 7121c3703d1ed..d758c3854188a 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -12,7 +12,7 @@ This document introduces how to locate and address the issue of high disk I/O us If TiDB's response slows down after you have troubleshot the CPU bottleneck and the bottleneck caused by transaction conflicts, you need to check I/O metrics to help determine the current system bottleneck. -### Locate I/O problems from monitor +### Locate I/O issues from monitor The fastest position method is to view the overall I/O situation from the monitor. You can view the correspond I/O monitor from the Grafana monitor component, which is deployed by the default cluster deployment tool (TiDB-Ansible, TiUP). The Dashboard relate to I/O has `Overview`, `Node_exporter`, `Disk-Performance`. From 55a145c7a618c9e73b081eac4b710ef764591df8 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:53:42 +0800 Subject: [PATCH 14/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index d758c3854188a..47e9fe08bf65b 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -18,7 +18,7 @@ The fastest position method is to view the overall I/O situation from the monito #### The first type of panel -In `Overview`> `System Info`> `IO Util`, you can see the I/O status of each machine in the cluster. This indicator is similar to util in Linux iostat monitor. The higher percentage represents the higher disk I/O usage: +In `Overview`> `System Info`> `IO Util`, you can see the I/O status of each machine in the cluster. This metric is similar to `util` in the Linux `iostat` monitor. The higher percentage represents higher disk I/O usage: - If there is only one machine with high I/O in the monitor, it can assist in judging that there are currently reading and writing hot spots. - If the I/O of most machines in the monitor is high, then the cluster now has a high I/O load. From 6f7f2b840264220ead694415b0ecdaf7326425c7 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:53:56 +0800 Subject: [PATCH 15/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 47e9fe08bf65b..ae09008484a9c 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -14,7 +14,7 @@ If TiDB's response slows down after you have troubleshot the CPU bottleneck and ### Locate I/O issues from monitor -The fastest position method is to view the overall I/O situation from the monitor. You can view the correspond I/O monitor from the Grafana monitor component, which is deployed by the default cluster deployment tool (TiDB-Ansible, TiUP). The Dashboard relate to I/O has `Overview`, `Node_exporter`, `Disk-Performance`. +The quickest way to locate I/O issues is to view the overall I/O status from the monitor, such as the Grafana dashboard which is deployed by default by TiDB Ansible and TiUP. The dashboard panels related to I/O include **Overview**, **Node_exporter**, **Disk-Performance**. #### The first type of panel From 684446c4e310c7a4aa8c7d877f7ca4233fd862b2 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:54:10 +0800 Subject: [PATCH 16/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index ae09008484a9c..10bcd377f6f8b 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -20,7 +20,7 @@ The quickest way to locate I/O issues is to view the overall I/O status from th In `Overview`> `System Info`> `IO Util`, you can see the I/O status of each machine in the cluster. This metric is similar to `util` in the Linux `iostat` monitor. The higher percentage represents higher disk I/O usage: -- If there is only one machine with high I/O in the monitor, it can assist in judging that there are currently reading and writing hot spots. +- If there is only one machine with high I/O usage in the monitor, currently there might be read and write hotspots on this machine. - If the I/O of most machines in the monitor is high, then the cluster now has a high I/O load. If you find that the I/O of a certain machine is relatively high, you can further monitor the use of I/O from monitor `Disk-Performance Dashboard`, combined with metrics such as `Disk Latency` and `Disk Load` to determine whether there is an abnormality, and if necessary use the fio tool to test the disk. From ce9c7c89f6a123983f0765429b9cc121602d75bc Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:54:22 +0800 Subject: [PATCH 17/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 10bcd377f6f8b..7381d0ee8a0dd 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -21,7 +21,7 @@ The quickest way to locate I/O issues is to view the overall I/O status from th In `Overview`> `System Info`> `IO Util`, you can see the I/O status of each machine in the cluster. This metric is similar to `util` in the Linux `iostat` monitor. The higher percentage represents higher disk I/O usage: - If there is only one machine with high I/O usage in the monitor, currently there might be read and write hotspots on this machine. -- If the I/O of most machines in the monitor is high, then the cluster now has a high I/O load. +- If the I/O usage of most machines in the monitor is high, the cluster now has high I/O loads. If you find that the I/O of a certain machine is relatively high, you can further monitor the use of I/O from monitor `Disk-Performance Dashboard`, combined with metrics such as `Disk Latency` and `Disk Load` to determine whether there is an abnormality, and if necessary use the fio tool to test the disk. From df83f536c735d468b95c43817c0e947cce7a7643 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:54:35 +0800 Subject: [PATCH 18/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 7381d0ee8a0dd..a9f7dc21f23af 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -23,7 +23,7 @@ In `Overview`> `System Info`> `IO Util`, you can see the I/O status of each mach - If there is only one machine with high I/O usage in the monitor, currently there might be read and write hotspots on this machine. - If the I/O usage of most machines in the monitor is high, the cluster now has high I/O loads. -If you find that the I/O of a certain machine is relatively high, you can further monitor the use of I/O from monitor `Disk-Performance Dashboard`, combined with metrics such as `Disk Latency` and `Disk Load` to determine whether there is an abnormality, and if necessary use the fio tool to test the disk. +For the first situation above (only one machine with high I/O usage), you can further observe I/O metrics from the **Disk-Performance Dashboard** such as `Disk Latency` and `Disk Load` to determine whether any anomaly exists. If necessary, use the fio tool to check the disk. #### The second type of panel From 9041f374fb15da09587b51ed4d13273bbaaf22a5 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:54:56 +0800 Subject: [PATCH 19/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index a9f7dc21f23af..c3567d57b7a25 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -47,7 +47,7 @@ In `TiKV-Details`> `Storage`, there are monitor related to storage: In addition, some other content may be needed to help locate whether the bottleneck is I/O, and you can try to set some recommended parameters. By checking the prewrite/commit/raw-put of TiKV gRPC (raw kv cluster only) duration, it is confirmed that TiKV write is indeed slow. Several common situations are as follows: -- Append log is slow. TiKV Grafana's Raft I/O and append log duration are relatively high. Usually, it is due to slow disk writing. You can check the value of WAL Sync Duration max of RocksDB-raft to confirm, otherwise you may need to report a bug. +- `append log` is slow. TiKV Grafana's `Raft I/O` and `append log duration` metrics are relatively high, which is often due to slow disk writes. You can check the value of `WAL Sync Duration max` in **RocksDB-raft** to determine the cause of slow `append log`. Otherwise, you might need to report a bug. - The raftstore thread is busy. TiKV grafana's Raft Propose/propose wait duration is significantly higher than append log duration. Please check the following two points: - Is the `store-pool-size` configuration of `[raftstore]` too small (this value is recommended between [1,5], not too large). From 01035dcdd1462973c2ef9d8dcdd5c49ffa5e93a4 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:55:12 +0800 Subject: [PATCH 20/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index c3567d57b7a25..0bec2678a5098 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -16,7 +16,7 @@ If TiDB's response slows down after you have troubleshot the CPU bottleneck and The quickest way to locate I/O issues is to view the overall I/O status from the monitor, such as the Grafana dashboard which is deployed by default by TiDB Ansible and TiUP. The dashboard panels related to I/O include **Overview**, **Node_exporter**, **Disk-Performance**. -#### The first type of panel +#### The first type of monitoring panels In `Overview`> `System Info`> `IO Util`, you can see the I/O status of each machine in the cluster. This metric is similar to `util` in the Linux `iostat` monitor. The higher percentage represents higher disk I/O usage: From 1bf17f670cad3e2bdc24ffb4a4582547de5a7bb2 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:55:28 +0800 Subject: [PATCH 21/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 0bec2678a5098..698d1a7d4b532 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -45,7 +45,7 @@ In `TiKV-Details`> `Storage`, there are monitor related to storage: #### Other panels -In addition, some other content may be needed to help locate whether the bottleneck is I/O, and you can try to set some recommended parameters. By checking the prewrite/commit/raw-put of TiKV gRPC (raw kv cluster only) duration, it is confirmed that TiKV write is indeed slow. Several common situations are as follows: +In addition, some other panel metrics might help you determine whether the bottleneck is I/O, and you can try to set some parameters. By checking the prewrite/commit/raw-put (for raw key-value clusters only) of TiKV gRPC duration, you can determine that the bottleneck is indeed the slow TiKV write. The common situations of slow TiKV writes are as follows: - `append log` is slow. TiKV Grafana's `Raft I/O` and `append log duration` metrics are relatively high, which is often due to slow disk writes. You can check the value of `WAL Sync Duration max` in **RocksDB-raft** to determine the cause of slow `append log`. Otherwise, you might need to report a bug. - The raftstore thread is busy. TiKV grafana's Raft Propose/propose wait duration is significantly higher than append log duration. Please check the following two points: From be98f318d705b0d018cbdcd5998bba49bff828ce Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:55:44 +0800 Subject: [PATCH 22/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 698d1a7d4b532..9a7de531e9dbf 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -25,7 +25,7 @@ In `Overview`> `System Info`> `IO Util`, you can see the I/O status of each mach For the first situation above (only one machine with high I/O usage), you can further observe I/O metrics from the **Disk-Performance Dashboard** such as `Disk Latency` and `Disk Load` to determine whether any anomaly exists. If necessary, use the fio tool to check the disk. -#### The second type of panel +#### The second type of monitoring panels The main persistence component of the TiDB cluster is TiKV cluster. One TiKV instance contains two RocksDB instances: one for storing Raft logs, located in data/raft, and one for storing real data, located in data/db. From 1d238efb720365f6d5093641adb004f8d0b0ab4c Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:55:57 +0800 Subject: [PATCH 23/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 9a7de531e9dbf..febefab6719a9 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -27,7 +27,7 @@ For the first situation above (only one machine with high I/O usage), you can fu #### The second type of monitoring panels -The main persistence component of the TiDB cluster is TiKV cluster. One TiKV instance contains two RocksDB instances: one for storing Raft logs, located in data/raft, and one for storing real data, located in data/db. +The main storage component of the TiDB cluster is TiKV. One TiKV instance contains two RocksDB instances: one for storing Raft logs, located in `data/raft`, and the other for storing real data, located in `data/db`. In `TiKV-Details`> `Raft IO`, you can see the relevant metrics for disk writes of these two instances: From 2d1e3e31cd563211ff3654ec47a5baa5cc59458c Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:56:12 +0800 Subject: [PATCH 24/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index febefab6719a9..dc4d2e647c391 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -29,7 +29,7 @@ For the first situation above (only one machine with high I/O usage), you can fu The main storage component of the TiDB cluster is TiKV. One TiKV instance contains two RocksDB instances: one for storing Raft logs, located in `data/raft`, and the other for storing real data, located in `data/db`. -In `TiKV-Details`> `Raft IO`, you can see the relevant metrics for disk writes of these two instances: +In **TiKV-Details** > **Raft IO**, you can see the metrics related to disk writes of these two instances: - `Append log duration`: This monitor indicates the response time of RocksDB writes that store Raft logs. The .99 response should be within 50ms. - `Apply log duration`: This monitor indicates the response time for RocksDB writes that store real data. The .99 response should be within 100ms. From a0acb80081d8318982d383bd6b8c0130887f93f5 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:56:26 +0800 Subject: [PATCH 25/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index dc4d2e647c391..e8835080d6503 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -31,7 +31,7 @@ The main storage component of the TiDB cluster is TiKV. One TiKV instance contai In **TiKV-Details** > **Raft IO**, you can see the metrics related to disk writes of these two instances: -- `Append log duration`: This monitor indicates the response time of RocksDB writes that store Raft logs. The .99 response should be within 50ms. +- `Append log duration`: This metric indicates the response time of writes into RockDB that stores Raft logs. The `.99` response time should be within 50 ms. - `Apply log duration`: This monitor indicates the response time for RocksDB writes that store real data. The .99 response should be within 100ms. These two monitors also have `.. per server` monitor panels to provide assistance to view hotspot writes. From 12dff5a42437a67d819a2b8aec66b036157a4a15 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:56:45 +0800 Subject: [PATCH 26/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index e8835080d6503..abb60da870d25 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -41,7 +41,7 @@ These two monitors also have `.. per server` monitor panels to provide assistanc In `TiKV-Details`> `Storage`, there are monitor related to storage: - `Storage command total`: the number of different commands received. -- `Storage async write duration`: includes monitor items such as disk sync duration, which may be related to Raft IO. If you encounter an abnormal situation, you need to check the working status of related components by logs. +- `Storage async write duration`: Includes monitoring metrics such as `disk sync duration`, which might be related to Raft I/O. If you encounter an abnormal situation, check the working statuses of related components by checking logs. #### Other panels From d792ad3f6e06a8a040c354d92865a05aec7930e4 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:57:00 +0800 Subject: [PATCH 27/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index abb60da870d25..cfb8643a39fb5 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -32,7 +32,7 @@ The main storage component of the TiDB cluster is TiKV. One TiKV instance contai In **TiKV-Details** > **Raft IO**, you can see the metrics related to disk writes of these two instances: - `Append log duration`: This metric indicates the response time of writes into RockDB that stores Raft logs. The `.99` response time should be within 50 ms. -- `Apply log duration`: This monitor indicates the response time for RocksDB writes that store real data. The .99 response should be within 100ms. +- `Apply log duration`: This metric indicates the response time of writes into RockDB that stores real data. The `.99` response should be within 100 ms. These two monitors also have `.. per server` monitor panels to provide assistance to view hotspot writes. From 52da55020e51e6376525c87a595073bd262eebbe Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:57:14 +0800 Subject: [PATCH 28/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index cfb8643a39fb5..4984079f06f28 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -34,7 +34,7 @@ In **TiKV-Details** > **Raft IO**, you can see the metrics related to disk write - `Append log duration`: This metric indicates the response time of writes into RockDB that stores Raft logs. The `.99` response time should be within 50 ms. - `Apply log duration`: This metric indicates the response time of writes into RockDB that stores real data. The `.99` response should be within 100 ms. -These two monitors also have `.. per server` monitor panels to provide assistance to view hotspot writes. +These two metrics also have the **.. per server** monitoring panel to help you view the write hotspots. #### The third type of panel From f6b1df51417622e6e6824f1ec0a8feacdfcee3cf Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 16:59:57 +0800 Subject: [PATCH 29/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 4984079f06f28..8fd1ba3068c04 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -36,7 +36,7 @@ In **TiKV-Details** > **Raft IO**, you can see the metrics related to disk write These two metrics also have the **.. per server** monitoring panel to help you view the write hotspots. -#### The third type of panel +#### The third type of monitoring panels In `TiKV-Details`> `Storage`, there are monitor related to storage: From f1f2633898d054447e5da96efa0487ddfe9dd7cc Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 17:00:15 +0800 Subject: [PATCH 30/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 8fd1ba3068c04..1aa853966c2d8 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -38,7 +38,7 @@ These two metrics also have the **.. per server** monitoring panel to help you v #### The third type of monitoring panels -In `TiKV-Details`> `Storage`, there are monitor related to storage: +In **TiKV-Details** > **Storage**, there are monitoring metrics related to storage: - `Storage command total`: the number of different commands received. - `Storage async write duration`: Includes monitoring metrics such as `disk sync duration`, which might be related to Raft I/O. If you encounter an abnormal situation, check the working statuses of related components by checking logs. From 3bdc1935b2d949ad9a3c7564ebbec495c2c9f840 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 17:00:29 +0800 Subject: [PATCH 31/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 1aa853966c2d8..5ce4957689381 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -40,7 +40,7 @@ These two metrics also have the **.. per server** monitoring panel to help you v In **TiKV-Details** > **Storage**, there are monitoring metrics related to storage: -- `Storage command total`: the number of different commands received. +- `Storage command total`: Indicates the number of different commands received. - `Storage async write duration`: Includes monitoring metrics such as `disk sync duration`, which might be related to Raft I/O. If you encounter an abnormal situation, check the working statuses of related components by checking logs. #### Other panels From 8b92873ea33bd0059e23793a7e1c6e70449094ed Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 17:01:55 +0800 Subject: [PATCH 32/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 5ce4957689381..3c9324b962784 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -48,7 +48,7 @@ In **TiKV-Details** > **Storage**, there are monitoring metrics related to stora In addition, some other panel metrics might help you determine whether the bottleneck is I/O, and you can try to set some parameters. By checking the prewrite/commit/raw-put (for raw key-value clusters only) of TiKV gRPC duration, you can determine that the bottleneck is indeed the slow TiKV write. The common situations of slow TiKV writes are as follows: - `append log` is slow. TiKV Grafana's `Raft I/O` and `append log duration` metrics are relatively high, which is often due to slow disk writes. You can check the value of `WAL Sync Duration max` in **RocksDB-raft** to determine the cause of slow `append log`. Otherwise, you might need to report a bug. -- The raftstore thread is busy. TiKV grafana's Raft Propose/propose wait duration is significantly higher than append log duration. Please check the following two points: +- The `raftstore` thread is busy. In TiKV Grafana, `Raft Propose`/`propose wait duration` is significantly higher than `append log duration`. Please check the following aspects for troubleshooting: - Is the `store-pool-size` configuration of `[raftstore]` too small (this value is recommended between [1,5], not too large). - Is machine's CPU insufficient. From 0038773c85c0e47ebc2d27e9826f2feb624a5c71 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 17:02:22 +0800 Subject: [PATCH 33/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 3c9324b962784..e381a94d383ba 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -50,7 +50,7 @@ In addition, some other panel metrics might help you determine whether the bottl - `append log` is slow. TiKV Grafana's `Raft I/O` and `append log duration` metrics are relatively high, which is often due to slow disk writes. You can check the value of `WAL Sync Duration max` in **RocksDB-raft** to determine the cause of slow `append log`. Otherwise, you might need to report a bug. - The `raftstore` thread is busy. In TiKV Grafana, `Raft Propose`/`propose wait duration` is significantly higher than `append log duration`. Please check the following aspects for troubleshooting: - - Is the `store-pool-size` configuration of `[raftstore]` too small (this value is recommended between [1,5], not too large). + - Whether the value of `store-pool-size` of `[raftstore]` is too small. It is recommended to set this value between `[1,5]` and not too large. - Is machine's CPU insufficient. - Apply log is slow. TiKV Grafana's Raft I/O and apply log duration are relatively high, usually accompanied by a relatively high Raft Propose/apply wait duration. The possible situations are as follows: From 2271242c31a0dd0637788849577b3d9ff57b01da Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 17:02:41 +0800 Subject: [PATCH 34/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index e381a94d383ba..fc787e53ac691 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -51,7 +51,7 @@ In addition, some other panel metrics might help you determine whether the bottl - The `raftstore` thread is busy. In TiKV Grafana, `Raft Propose`/`propose wait duration` is significantly higher than `append log duration`. Please check the following aspects for troubleshooting: - Whether the value of `store-pool-size` of `[raftstore]` is too small. It is recommended to set this value between `[1,5]` and not too large. - - Is machine's CPU insufficient. + - Whether the CPU resource of the machine is insufficient. - Apply log is slow. TiKV Grafana's Raft I/O and apply log duration are relatively high, usually accompanied by a relatively high Raft Propose/apply wait duration. The possible situations are as follows: From c0c55175f8766d5bb9005161752018f098e50555 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 17:03:02 +0800 Subject: [PATCH 35/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index fc787e53ac691..13bb7be7a0475 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -53,7 +53,7 @@ In addition, some other panel metrics might help you determine whether the bottl - Whether the value of `store-pool-size` of `[raftstore]` is too small. It is recommended to set this value between `[1,5]` and not too large. - Whether the CPU resource of the machine is insufficient. -- Apply log is slow. TiKV Grafana's Raft I/O and apply log duration are relatively high, usually accompanied by a relatively high Raft Propose/apply wait duration. The possible situations are as follows: +- `append log` is slow. TiKV Grafana's `Raft I/O` and `append log duration` metrics are relatively high, which might usually occur along with relatively high `Raft Propose`/`apply wait duration`. The possible causes are as follows: - The `apply-pool-size` configuration of `[raftstore]` is too small (recommended between [1, 5], not recommended to be too large), Thread CPU/apply cpu is relatively high; - Insufficient CPU resources on the machine. From 34b9e22e634c87a8f466d83c1e2132f704399c77 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Thu, 2 Jul 2020 17:03:28 +0800 Subject: [PATCH 36/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 13bb7be7a0475..b98b81f0ce96e 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -55,7 +55,7 @@ In addition, some other panel metrics might help you determine whether the bottl - `append log` is slow. TiKV Grafana's `Raft I/O` and `append log duration` metrics are relatively high, which might usually occur along with relatively high `Raft Propose`/`apply wait duration`. The possible causes are as follows: - - The `apply-pool-size` configuration of `[raftstore]` is too small (recommended between [1, 5], not recommended to be too large), Thread CPU/apply cpu is relatively high; + - The value of `apply-pool-size` of `[raftstore]` is too small. It is recommended to set this value between `[1, 5]` and not too large. The value of `Thread CPU`/`apply cpu` is also relatively high. - Insufficient CPU resources on the machine. - Write hotspot issue of a single Region (Currently, the solution to this issue is still on the way). The CPU usage of a single `apply` thread is high (which can be viewed by modifying the Grafana expression, appended with `by (instance, name)`). - Slow write into RocksDB, and `RocksDB kv`/`max write duration` is high. A single Raft log might contain multiple key-value pairs (kv). 128 kvs are written to RocksDB in a batch, so one `apply` log might involve multiple RocksDB writes. From c5b137005d217295b5bd5d1c5614c8d0dafd7028 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Mon, 6 Jul 2020 14:46:26 +0800 Subject: [PATCH 37/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index b98b81f0ce96e..a1bc91c650dd8 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -63,7 +63,7 @@ In addition, some other panel metrics might help you determine whether the bottl - `raft commit log` is slow. In TiKV Grafana, `Raft I/O` and `commit log duration` (only available in Grafana 4.x) metrics are relatively high. Each Region corresponds to an independent Raft group. Raft has a flow control mechanism similar to the sliding window mechanism of TCP. To control the size of a sliding window, adjust the `[raftstore] raft-max-inflight-msgs` parameter. if there is a write hotspot and `commit log duration` is high, you can properly set this parameter to a larger value, such as `1024`. -### Locate I/O problems from log +### Locate I/O issues from log - If the client reports `server is busy` error, especially the error message of `raftstore is busy`, it will be related to I/O problem. From 2dc60a30cbfdccfbaa97c8958e707f2d1331f685 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Mon, 6 Jul 2020 14:46:38 +0800 Subject: [PATCH 38/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index a1bc91c650dd8..b2ffaf62e1334 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -92,4 +92,4 @@ The cluster deployment tool (TiDB-Ansible, TiUP) is an alarm component that is d 1. When it is confirmed as a I/O hotspot issue, you need to refer to [TiDB Hot Issue Processing] (/troubleshoot-hot-spot-issues.md) to eliminate the related I/O hotspot situation. 2. When it is confirmed that the overall I/O has reached the bottleneck, and the ability to judge the I/O from the business side will continue to keep up, then you can take advantage of the distributed database's scale capability and adopt the scheme of expanding the number of TiKV nodes to obtain greater overall I/O throughput. -3. Adjust some of the parameters in the above description, and use computing/memory resources in exchange for disk storage resources. ++ Adjust some of the parameters as described above, and use computing/memory resources to make up for disk storage resources. From 5f7329dc6ad3fe26324e5721433e8c767854f682 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Mon, 6 Jul 2020 14:46:48 +0800 Subject: [PATCH 39/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index b2ffaf62e1334..fc8bea5ee45b5 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -91,5 +91,5 @@ The cluster deployment tool (TiDB-Ansible, TiUP) is an alarm component that is d ## I/O problem handling plan 1. When it is confirmed as a I/O hotspot issue, you need to refer to [TiDB Hot Issue Processing] (/troubleshoot-hot-spot-issues.md) to eliminate the related I/O hotspot situation. -2. When it is confirmed that the overall I/O has reached the bottleneck, and the ability to judge the I/O from the business side will continue to keep up, then you can take advantage of the distributed database's scale capability and adopt the scheme of expanding the number of TiKV nodes to obtain greater overall I/O throughput. ++ When it is confirmed that the overall I/O performance has become the bottleneck, and you can determine that the I/O performance will keep falling behind from the application side, then you can take advantage of the distributed database's capability to scale and scale out the number of TiKV nodes to have greater overall I/O throughput. + Adjust some of the parameters as described above, and use computing/memory resources to make up for disk storage resources. From 2cddcf36f8c7b5259605f9158315df635e9e19c1 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Mon, 6 Jul 2020 14:46:56 +0800 Subject: [PATCH 40/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index fc8bea5ee45b5..e258f3311cc58 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -90,6 +90,6 @@ The cluster deployment tool (TiDB-Ansible, TiUP) is an alarm component that is d ## I/O problem handling plan -1. When it is confirmed as a I/O hotspot issue, you need to refer to [TiDB Hot Issue Processing] (/troubleshoot-hot-spot-issues.md) to eliminate the related I/O hotspot situation. ++ When an I/O hotspot issue is confirmed to occur, you need to refer to Handle TiDB Hotspot Issues to eliminate the I/O hotspots. + When it is confirmed that the overall I/O performance has become the bottleneck, and you can determine that the I/O performance will keep falling behind from the application side, then you can take advantage of the distributed database's capability to scale and scale out the number of TiKV nodes to have greater overall I/O throughput. + Adjust some of the parameters as described above, and use computing/memory resources to make up for disk storage resources. From ce902b126b2ded0471b8fa1faf25c861b9bd6991 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Mon, 6 Jul 2020 14:47:08 +0800 Subject: [PATCH 41/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index e258f3311cc58..14d86cdca3003 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -88,7 +88,7 @@ The cluster deployment tool (TiDB-Ansible, TiUP) is an alarm component that is d - TiKV_raft_append_log_duration_secs - TiKV_raft_apply_log_duration_secs -## I/O problem handling plan +## Handle I/O issues + When an I/O hotspot issue is confirmed to occur, you need to refer to Handle TiDB Hotspot Issues to eliminate the I/O hotspots. + When it is confirmed that the overall I/O performance has become the bottleneck, and you can determine that the I/O performance will keep falling behind from the application side, then you can take advantage of the distributed database's capability to scale and scale out the number of TiKV nodes to have greater overall I/O throughput. From f2ad947e958c392052964d49e33c4309bfe4fde6 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Mon, 6 Jul 2020 14:47:16 +0800 Subject: [PATCH 42/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 14d86cdca3003..dca6dd19a48c7 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -71,7 +71,7 @@ In addition, some other panel metrics might help you determine whether the bottl - "Write stall" appears in TiKV RocksDB logs. - It may be that too much level0 sst causes stalls. You can add the parameter `[rocksdb] max-sub-compactions = 2 (or 3)` to speed up the compaction of level0 sst. This parameter means that the compaction tasks of level0 to level1 can be divided into `max-sub-compactions` subtasks to multi-threaded concurrent execution. + It might be that too many level-0 SST files cause the write stall. To address the issue, you can add the `[rocksdb] max-sub-compactions = 2 (or 3)` parameter to speed up the compaction of level-0 SST files. This parameter means that the compaction tasks of level-0 to level-1 can be divided into `max-sub-compactions` subtasks for multi-threaded concurrent execution. If the disk's I/O capability fail to keep up with the write, it is recommended to scale-in. If the throughput of the disk reaches the upper limit (for example, the throughput of SATA SSD will be much lower than that of NVME SSD), resulting in write stall, but the CPU resource is relatively sufficient, you can try to use a higher compression ratio compression algorithm to relieve the pressure on the disk, use CPU resources Change disk resources. From bbc3154a3375b0fcba413416a40bdfc57b5830b3 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Mon, 6 Jul 2020 14:47:23 +0800 Subject: [PATCH 43/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index dca6dd19a48c7..62084e6aecf65 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -79,7 +79,7 @@ In addition, some other panel metrics might help you determine whether the bottl ### I/O problem found from alarm -The cluster deployment tool (TiDB-Ansible, TiUP) is an alarm component that is deployed by default. Officials have preset related alarm items and thresholds. I/O related items include: +The cluster deployment tools (TiDB Ansible and TiUP) deploy the cluster with alert components by default that have built-in alert items and thresholds. The following alert items are related to I/O: - TiKV_write_stall - TiKV_raft_log_lag From ba3a37eb36e9e52e80645a6f6e25171664043135 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Mon, 6 Jul 2020 14:47:32 +0800 Subject: [PATCH 44/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 62084e6aecf65..135147f473e2e 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -67,7 +67,7 @@ In addition, some other panel metrics might help you determine whether the bottl - If the client reports `server is busy` error, especially the error message of `raftstore is busy`, it will be related to I/O problem. - You can check the monitor: grafana -> TiKV -> errors to confirm the specific busy reason. Among them, `server is busy` is TiKV's flow control mechanism. In this way, TiKV informs `tidb/ti-client` that the current pressure of TiKV is too high, and try again later. + You can check the monitoring panel (**Grafana** -> **TiKV** -> **errors**) to confirm the specific cause of the `busy` error. `server is busy` is TiKV's flow control mechanism. In this way, TiKV informs `tidb/ti-client` that the current pressure of TiKV is too high, and the client should try later. - "Write stall" appears in TiKV RocksDB logs. From 84d239c197860f21b5ca5ad308a075bc4d423e28 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Mon, 6 Jul 2020 14:47:42 +0800 Subject: [PATCH 45/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 135147f473e2e..ec6f4cbbade5a 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -69,7 +69,7 @@ In addition, some other panel metrics might help you determine whether the bottl You can check the monitoring panel (**Grafana** -> **TiKV** -> **errors**) to confirm the specific cause of the `busy` error. `server is busy` is TiKV's flow control mechanism. In this way, TiKV informs `tidb/ti-client` that the current pressure of TiKV is too high, and the client should try later. -- "Write stall" appears in TiKV RocksDB logs. +- `Write stall` appears in TiKV RocksDB logs. It might be that too many level-0 SST files cause the write stall. To address the issue, you can add the `[rocksdb] max-sub-compactions = 2 (or 3)` parameter to speed up the compaction of level-0 SST files. This parameter means that the compaction tasks of level-0 to level-1 can be divided into `max-sub-compactions` subtasks for multi-threaded concurrent execution. From d90fbea034e6cc20fd33f3886702c44ae9ef6e82 Mon Sep 17 00:00:00 2001 From: King-Dylan <50897894+King-Dylan@users.noreply.github.com> Date: Mon, 6 Jul 2020 14:47:55 +0800 Subject: [PATCH 46/52] Update troubleshoot-high-disk-io.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- troubleshoot-high-disk-io.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index ec6f4cbbade5a..5364b4dabb856 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -65,7 +65,7 @@ In addition, some other panel metrics might help you determine whether the bottl ### Locate I/O issues from log -- If the client reports `server is busy` error, especially the error message of `raftstore is busy`, it will be related to I/O problem. +- If the client reports errors such as `server is busy` or especially `raftstore is busy`, the errors might be related to I/O issues. You can check the monitoring panel (**Grafana** -> **TiKV** -> **errors**) to confirm the specific cause of the `busy` error. `server is busy` is TiKV's flow control mechanism. In this way, TiKV informs `tidb/ti-client` that the current pressure of TiKV is too high, and the client should try later. From 482319aa52500f430b1a02d5390faad016ebf684 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Tue, 7 Jul 2020 16:20:15 +0800 Subject: [PATCH 47/52] Update troubleshoot-high-disk-io.md --- troubleshoot-high-disk-io.md | 1 + 1 file changed, 1 insertion(+) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 5364b4dabb856..d6f0e341de230 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -2,6 +2,7 @@ title: Troubleshoot High Disk I/O Usage in TiDB summary: Learn how to locate and address the issue of high TiDB storage I/O usage. category: reference +aliases: ['/docs/dev/troubleshoot-high-disk-io/'] --- # Troubleshoot High Disk I/O Usage in TiDB From 736366dbe97ff56becc4b3c1e3c2cf996d55fcea Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Tue, 7 Jul 2020 17:41:54 +0800 Subject: [PATCH 48/52] Apply suggestions from code review --- troubleshoot-high-disk-io.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index d6f0e341de230..16f00086ffe88 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -15,7 +15,7 @@ If TiDB's response slows down after you have troubleshot the CPU bottleneck and ### Locate I/O issues from monitor -The quickest way to locate I/O issues is to view the overall I/O status from the monitor, such as the Grafana dashboard which is deployed by default by TiDB Ansible and TiUP. The dashboard panels related to I/O include **Overview**, **Node_exporter**, **Disk-Performance**. +The quickest way to locate I/O issues is to view the overall I/O status from the monitor, such as the Grafana dashboard which is deployed by default by TiDB Ansible and TiUP. The dashboard panels related to I/O include **Overview**, **Node_exporter**, **Disk-Performance**. #### The first type of monitoring panels @@ -49,7 +49,7 @@ In **TiKV-Details** > **Storage**, there are monitoring metrics related to stora In addition, some other panel metrics might help you determine whether the bottleneck is I/O, and you can try to set some parameters. By checking the prewrite/commit/raw-put (for raw key-value clusters only) of TiKV gRPC duration, you can determine that the bottleneck is indeed the slow TiKV write. The common situations of slow TiKV writes are as follows: - `append log` is slow. TiKV Grafana's `Raft I/O` and `append log duration` metrics are relatively high, which is often due to slow disk writes. You can check the value of `WAL Sync Duration max` in **RocksDB-raft** to determine the cause of slow `append log`. Otherwise, you might need to report a bug. -- The `raftstore` thread is busy. In TiKV Grafana, `Raft Propose`/`propose wait duration` is significantly higher than `append log duration`. Please check the following aspects for troubleshooting: +- The `raftstore` thread is busy. In TiKV Grafana, `Raft Propose`/`propose wait duration` is significantly higher than `append log duration`. Check the following aspects for troubleshooting: - Whether the value of `store-pool-size` of `[raftstore]` is too small. It is recommended to set this value between `[1,5]` and not too large. - Whether the CPU resource of the machine is insufficient. @@ -62,7 +62,7 @@ In addition, some other panel metrics might help you determine whether the bottl - Slow write into RocksDB, and `RocksDB kv`/`max write duration` is high. A single Raft log might contain multiple key-value pairs (kv). 128 kvs are written to RocksDB in a batch, so one `apply` log might involve multiple RocksDB writes. - For other causes, report them as bugs. -- `raft commit log` is slow. In TiKV Grafana, `Raft I/O` and `commit log duration` (only available in Grafana 4.x) metrics are relatively high. Each Region corresponds to an independent Raft group. Raft has a flow control mechanism similar to the sliding window mechanism of TCP. To control the size of a sliding window, adjust the `[raftstore] raft-max-inflight-msgs` parameter. if there is a write hotspot and `commit log duration` is high, you can properly set this parameter to a larger value, such as `1024`. +- `raft commit log` is slow. In TiKV Grafana, `Raft I/O` and `commit log duration` (only available in Grafana 4.x) metrics are relatively high. Each Region corresponds to an independent Raft group. Raft has a flow control mechanism similar to the sliding window mechanism of TCP. To control the size of a sliding window, adjust the `[raftstore] raft-max-inflight-msgs` parameter. If there is a write hotspot and `commit log duration` is high, you can properly set this parameter to a larger value, such as `1024`. ### Locate I/O issues from log @@ -74,11 +74,11 @@ In addition, some other panel metrics might help you determine whether the bottl It might be that too many level-0 SST files cause the write stall. To address the issue, you can add the `[rocksdb] max-sub-compactions = 2 (or 3)` parameter to speed up the compaction of level-0 SST files. This parameter means that the compaction tasks of level-0 to level-1 can be divided into `max-sub-compactions` subtasks for multi-threaded concurrent execution. - If the disk's I/O capability fail to keep up with the write, it is recommended to scale-in. If the throughput of the disk reaches the upper limit (for example, the throughput of SATA SSD will be much lower than that of NVME SSD), resulting in write stall, but the CPU resource is relatively sufficient, you can try to use a higher compression ratio compression algorithm to relieve the pressure on the disk, use CPU resources Change disk resources. + If the disk's I/O capability fails to keep up with the write, it is recommended to scale up the disk. If the throughput of the disk reaches the upper limit (for example, the throughput of SATA SSD is much lower than that of NVMe SSD), which results in write stall, but the CPU resource is relatively sufficient, you can try to use a compression algorithm of higher compression ratio to relieve the pressure on the disk, that is, use CPU resources to make up for disk resources. For example, when the pressure of `default cf compaction` is relatively high, you can change the parameter`[rocksdb.defaultcf] compression-per-level = ["no", "no", "lz4", "lz4", "lz4", "zstd" , "zstd"]` to `compression-per-level = ["no", "no", "zstd", "zstd", "zstd", "zstd", "zstd"]`. -### I/O problem found from alarm +### I/O issues found in alerts The cluster deployment tools (TiDB Ansible and TiUP) deploy the cluster with alert components by default that have built-in alert items and thresholds. The following alert items are related to I/O: @@ -92,5 +92,5 @@ The cluster deployment tools (TiDB Ansible and TiUP) deploy the cluster with ale ## Handle I/O issues + When an I/O hotspot issue is confirmed to occur, you need to refer to Handle TiDB Hotspot Issues to eliminate the I/O hotspots. -+ When it is confirmed that the overall I/O performance has become the bottleneck, and you can determine that the I/O performance will keep falling behind from the application side, then you can take advantage of the distributed database's capability to scale and scale out the number of TiKV nodes to have greater overall I/O throughput. ++ When it is confirmed that the overall I/O performance has become the bottleneck, and you can determine that the I/O performance will keep falling behind in the application side, then you can take advantage of the distributed database's capability of scaling and scale out the number of TiKV nodes to have greater overall I/O throughput. + Adjust some of the parameters as described above, and use computing/memory resources to make up for disk storage resources. From 133fffebee4d31632fbdab3ac52c8ae051bd355a Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Tue, 7 Jul 2020 18:10:24 +0800 Subject: [PATCH 49/52] Update TOC.md --- TOC.md | 1 + 1 file changed, 1 insertion(+) diff --git a/TOC.md b/TOC.md index 2917937ac066c..4c54fb52252c3 100644 --- a/TOC.md +++ b/TOC.md @@ -84,6 +84,7 @@ + [Identify Expensive Queries](/identify-expensive-queries.md) + [Statement Summary Tables](/statement-summary-tables.md) + [Troubleshoot Cluster Setup](/troubleshoot-tidb-cluster.md) + + [Troubleshoot High Disk I/O Usage](/troubleshoot-high-disk-io.md) + [TiDB Troubleshooting Map](/tidb-troubleshooting-map.md) + [Troubleshoot TiCDC](/ticdc/troubleshoot-ticdc.md) + [Troubleshoot TiFlash](/tiflash/troubleshoot-tiflash.md) From 1db21cbd7470e9aa9241cb0b06b28d9c19e2893a Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Tue, 7 Jul 2020 22:30:27 +0800 Subject: [PATCH 50/52] Update troubleshoot-high-disk-io.md --- troubleshoot-high-disk-io.md | 1 - 1 file changed, 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 16f00086ffe88..533511fb2beb2 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -2,7 +2,6 @@ title: Troubleshoot High Disk I/O Usage in TiDB summary: Learn how to locate and address the issue of high TiDB storage I/O usage. category: reference -aliases: ['/docs/dev/troubleshoot-high-disk-io/'] --- # Troubleshoot High Disk I/O Usage in TiDB From 8a5f2df3e8c13cedbdc178290a3442e79e2f2004 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Thu, 16 Jul 2020 18:01:24 +0800 Subject: [PATCH 51/52] Update troubleshoot-high-disk-io.md --- troubleshoot-high-disk-io.md | 1 - 1 file changed, 1 deletion(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 533511fb2beb2..2b0ada5c52c95 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -1,7 +1,6 @@ --- title: Troubleshoot High Disk I/O Usage in TiDB summary: Learn how to locate and address the issue of high TiDB storage I/O usage. -category: reference --- # Troubleshoot High Disk I/O Usage in TiDB From 16d020d67d310a67c928fafcb6f2191ad7f668fa Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Fri, 17 Jul 2020 15:19:06 +0800 Subject: [PATCH 52/52] Apply suggestions from code review --- troubleshoot-high-disk-io.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/troubleshoot-high-disk-io.md b/troubleshoot-high-disk-io.md index 2b0ada5c52c95..2d7e29d978ef2 100644 --- a/troubleshoot-high-disk-io.md +++ b/troubleshoot-high-disk-io.md @@ -13,11 +13,11 @@ If TiDB's response slows down after you have troubleshot the CPU bottleneck and ### Locate I/O issues from monitor -The quickest way to locate I/O issues is to view the overall I/O status from the monitor, such as the Grafana dashboard which is deployed by default by TiDB Ansible and TiUP. The dashboard panels related to I/O include **Overview**, **Node_exporter**, **Disk-Performance**. +The quickest way to locate I/O issues is to view the overall I/O status from the monitor, such as the Grafana dashboard which is deployed by default by TiDB Ansible and TiUP. The dashboard panels related to I/O include **Overview**, **Node_exporter**, and **Disk-Performance**. #### The first type of monitoring panels -In `Overview`> `System Info`> `IO Util`, you can see the I/O status of each machine in the cluster. This metric is similar to `util` in the Linux `iostat` monitor. The higher percentage represents higher disk I/O usage: +In **Overview**> **System Info**> **IO Util**, you can see the I/O status of each machine in the cluster. This metric is similar to `util` in the Linux `iostat` monitor. The higher percentage represents higher disk I/O usage: - If there is only one machine with high I/O usage in the monitor, currently there might be read and write hotspots on this machine. - If the I/O usage of most machines in the monitor is high, the cluster now has high I/O loads.