diff --git a/grafana-pd-dashboard.md b/grafana-pd-dashboard.md index 318980033fb9b..b6631f2267686 100644 --- a/grafana-pd-dashboard.md +++ b/grafana-pd-dashboard.md @@ -13,122 +13,151 @@ The Grafana dashboard is divided into a series of sub dashboards which include O You can get an overview of the component PD status from the PD dashboard, where the key metrics are displayed. This document provides a detailed description of these key metrics. +The following is the description of PD Dashboard metrics items: + +- PD role: The role of the current PD instance +- Storage capacity: The total storage capacity for this TiDB cluster +- Current storage size: The storage size that is currently used by the TiDB cluster +- Current storage usage: The current storage usage rate +- Normal stores: The count of healthy storage instances +- Number of Regions: The total count of cluster Regions +- Abnormal stores: The count of unhealthy stores. The normal value is `0`. If the number is bigger than `0`, it means at least one instance is abnormal. +- Region health: The health status of Regions indicated via the count of unusual Regions including pending peers, down peers, extra peers, offline peers, missing peers, learner peers and incorrect namespaces. Generally, the number of pending peers should be less than `100`. The missing peers should not be persistently greater than `0`. If many empty Regions exist, enable Region Merge in time. +- Current peer count: The current count of all cluster peers +![PD Dashboard - Header](/media/pd-dashboard-header-v4.png) + ## Key metrics description -To understand the key metrics displayed on the Overview dashboard, check the following table: - -Service | Panel name | Description | Normal range ----------------- | ---------------- | ---------------------------------- | -------------- -Cluster | PD role | The role of the current PD | -Cluster | Storage capacity | The total capacity size of the cluster | -Cluster | Current storage size | The current storage size of the cluster | -Cluster | Current storage usage | The total number of Regions without replicas | -Cluster | Normal stores | The count of healthy stores | -Cluster | Abnormal stores | The count of unhealthy stores | The normal value is `0`. If the number is bigger than `0`, it means at least one instance is abnormal. -Cluster | Current peer count | The current peer count of the cluster | -Cluster | Number of Regions | The total number of Regions of the cluster | -Cluster | PD scheduler config | The list of PD scheduler configurations | -Cluster | Region label isolation level | The number of Regions in different label levels | +## Cluster + +- PD scheduler config: The list of PD scheduler configurations +- Cluster ID: The unique identifier of the cluster +- Current TSO: The physical part of current allocated TSO +- Current ID allocation: The maximum allocatable ID for new store/peer +- Region label isolation level: The number of Regions in different label levels +- Label distribution: The distribution status of the labels in the cluster + +![PD Dashboard - Cluster metrics](/media/pd-dashboard-cluster-v4.png) + +## Operator + +- Schedule operator create: The number of newly created operators per type +- Schedule operator check: The number of checked operator per type. It mainly checks whether the current step is finished; if yes, it returns the next step to be executed +- Schedule operator finish: The number of finished operators per type +- Schedule operator timeout: The number of timeout operators per type +- Schedule operator replaced or canceled: The number of replaced or canceled operators per type +- Schedule operators count by state: The number of operators per state +- Operator finish duration: The maximum duration of finished operators +- Operator step duration: The maximum duration of finished operator steps + +![PD Dashboard - Operator metrics](/media/pd-dashboard-operator-v4.png) + +## Statistics - Balance + +- Store capacity: The capacity size per TiKV instance +- Store available: The available capacity size per TiKV instance +- Store used: The used capacity size per TiKV instance +- Size amplification: The size amplification ratio per TiKV instance, which is equal to (Store Region size)/(Store used capacity size) +- Size available ratio: The size availability ratio per TiKV instance, which is equal to (Store available capacity size)/(Store capacity size) +- Store leader score: The leader score per TiKV instance +- Store Region score: The Region score per TiKV instance +- Store leader size: The total leader size per TiKV instance +- Store Region size: The total Region size per TiKV instance +- Store leader count: The leader count per TiKV instance +- Store Region count: The Region count per TiKV instance + +![PD Dashboard - Balance metrics](/media/pd-dashboard-balance-v4.png) + +## Statistics - hot write + +- Hot Region's leader distribution: The total number of leader Regions that have become write hotspots on each TiKV instance +- Total written bytes on hot leader Regions: The total written bytes by leader Regions that have become write hotspots on each TiKV instance +- Hot write Region's peer distribution: The total number of peer Regions that have become write hotspots on each TiKV instance +- Total written bytes on hot peer Regions: The written bytes of all peer Regions that have become write hotspots on each TiKV instance +- Store Write rate bytes: The total written bytes on each TiKV instance +- Store Write rate keys: The total written keys on each TiKV instance +- Hot cache write entry number: The number of peers on each TiKV instance that are in the write hotspot statistics module +- Selector events: The event count of Selector in the hotspot scheduling module +- Direction of hotspot move leader: The direction of leader movement in the hotsport scheduling. The positive number means scheduling into the instance. The negtive number means scheduling out of the instance +- Direction of hotspot move peer: The direction of peer movement in the hotspot scheduling. The positive number means scheduling into the instance. The negative number means scheduling out of the instance + +![PD Dashboard - Hot write metrics](/media/pd-dashboard-hotwrite-v4.png) + +## Statistics - hot read + +- Hot Region's leader distribution: The total number of leader Regions that have become read hotspots on each TiKV instance +- Total read bytes on hot leader Regions: The total read bytes of leaders that have become read hotspots on each TiKV instance +- Store read rate bytes: The total read bytes of each TiKV instance +- Store read rate keys: The total read keys of each TiKV instance +- Hot cache read entry number: The number of peers that are in the read hotspot statistics module on each TiKV instance + +![PD Dashboard - Hot read metrics](/media/pd-dashboard-hotread-v4.png) + +## Scheduler + +- Scheduler is running: The current running schedulers +- Balance leader movement: The leader movement details among TiKV instances +- Balance Region movement: The Region movement details among TiKV instances +- Balance leader event: The count of balance leader events +- Balance Region event: The count of balance Region events +- Balance leader scheduler: The inner status of balance leader scheduler +- Balance Region scheduler: The inner status of balance Region scheduler +- Replica checker: The replica checker's status +- Rule checker: The rule checker's status +- Region merge checker: The merge checker's status +- Filter target: The number of attempts that the store is selected as the scheduling target but failed to pass the filter +- Filter source: The number of attempts that the store is selected as the scheduling source but failed to pass the filter +- Balance Direction: The number of times that the Store is selected as the target or source of scheduling +- Store Limit: The flow control limitation of scheduling on the Store + +![PD Dashboard - Scheduler metrics](/media/pd-dashboard-scheduler-v4.png) + +## gRPC + +- Completed commands rate: The rate per command type at which gRPC commands are completed +- 99% Completed commands duration: The rate per command type at which gRPC commands are completed (P99) + +![PD Dashboard - gRPC metrics](/media/pd-dashboard-grpc-v2.png) + +## etcd + +- Handle transactions count: The rate at which etcd handles transactions +- 99% Handle transactions duration: The transaction handling rate (P99) +- 99% WAL fsync duration: The time consumed for writing WAL into the persistent storage. It is less than `1s` (P99) +- 99% Peer round trip time seconds: The network latency for etcd (P99) | The value is less than `1s` +- etcd disk WAL fsync rate: The rate of writing WAL into the persistent storage +- Raft term: The current term of Raft +- Raft committed index: The last committed index of Raft +- Raft applied index: The last applied index of Raft + +![PD Dashboard - etcd metrics](/media/pd-dashboard-etcd-v2.png) + Cluster | Label distribution | The distribution status of the labels in the cluster | Cluster | pd_cluster_metadata | The metadata of the PD cluster including cluster ID, the timestamp, and the generated ID. | Cluster | Region health | The health status of Regions indicated via count of unusual Regions including pending peers, down peers, extra peers, offline peers, missing peers, learner peers and incorrect namespaces | The number of pending peers should be less than `100`. The missing peers should not be persistently greater than `0`. -Statistics - Balance | Store capacity | The capacity size per TiKV instance | -Statistics - Balance | Store available | The available capacity size per TiKV instance | -Statistics - Balance | Store used | The used capacity size per TiKV instance | -Statistics - Balance | Size amplification | The size amplification ratio per TiKV instance, which is equal to (Store Region size)/(Store used capacity size) | -Statistics - Balance | Size available ratio | The size availability ratio per TiKV instance, which is equal to (Store available capacity size)/(Store capacity size) | -Statistics - Balance | Store leader score | The leader score per TiKV instance | -Statistics - Balance | Store Region score | The Region score per TiKV instance | -Statistics - Balance | Store leader size | The total leader size per TiKV instance | -Statistics - Balance | Store Region size | The total Region size per TiKV instance | -Statistics - Balance | Store leader count | The leader count per TiKV instance | -Statistics - Balance | Store Region count | The Region count per TiKV instance | -Statistics - Hotspot | Leader distribution in hot write Regions | The total number of leader Regions in hot write on each TiKV instance | -Statistics - Hotspot | Peer distribution in hot write Regions | The total number of peer Regions under in hot write on each TiKV instance | -Statistics - Hotspot | Leader written bytes in hot write Regions | The total written bytes by Leader regions in hot write on leader Regions for each TiKV instance | -Statistics - Hotspot | Peer written bytes in hot write Regions | The total bytes of hot write on peer Regions per each TiKV instance | -Statistics - Hotspot | Leader distribution in hot read Regions | The total number of leader Regions in hot read per each TiKV instance | -Statistics - Hotspot | Peer distribution in hot read Regions | The total number of Regions which are not leader under hot read per each TiKV instance | -Statistics - Hotspot | Leader read bytes in hot read Regions | The total bytes of hot read on leader Regions per each TiKV instance | -Statistics - Hotspot | Peer read bytes in hot read Regions | The total bytes of hot read on peer Regions per TiKV instance | -Scheduler | Running schedulers | The current running schedulers | -Scheduler | Balance leader movement | The leader movement details among TiKV instances | -Scheduler | Balance Region movement | The Region movement details among TiKV instances | -Scheduler | Balance leader event | The count of balance leader events | -Scheduler | Balance Region event | The count of balance Region events | -Scheduler | Balance leader scheduler | The inner status of balance leader scheduler | -Scheduler | Balance Region scheduler | The inner status of balance Region scheduler | -Scheduler | Namespace checker | The namespace checker's status | -Scheduler | Replica checker | The replica checker's status | -Scheduler | Region merge checker | The merge checker's status | -Operator | Schedule operator create | The number of newly created operators per type | -Operator | Schedule operator check | The number of checked operator per type. It mainly checks if the current step is finished; if yes, it returns the next step to be executed. | -Operator | Schedule operator finish | The number of finished operators per type | -Operator | Schedule operator timeout | The number of timeout operators per type | -Operator | Schedule operator replaced or canceled | The number of replaced or canceled operators per type | -Operator | Schedule operators count by state | The number of operators per state | -Operator | 99% Operator finish duration | The operator step duration (P99) | -Operator | 50% Operator finish duration | The operator duration (P50) | -Operator | 99% Operator step duration | The operator step duration (P99) | -Operator | 50% Operator step duration | The operator step duration (P50) | -gRPC | Completed commands rate | The rate per command type type at which gRPC commands are completed| -gRPC | 99% Completed commands duration | The rate per command type type at which gRPC commands are completed (P99) | -etcd | Transaction handling rate | The rate at which etcd handles transactions | -etcd | 99% transactions duration | The transaction handling rate (P99) | -etcd | 99% WAL fsync duration | The time consumed for writing WAL into the persistent storage (P99) | The value is less than `1s`. -etcd | 99% Peer round trip time seconds | The network latency for etcd (P99) | The value is less than `1s`. -etcd | etcd disk wal fsync rate | The rate of writing WAL into the persistent storage | -etcd | Raft term | The current term of Raft | -etcd | Raft committed index | The last committed index of Raft | -etcd | Raft applied index | The last applied index of Raft | -TiDB | Handled requests count | The count of TiDB requests | -TiDB | Request handling duration | The time consumed for handling TiDB requests | It should be less than `100ms` (P99). -Heartbeat | Region heartbeat report | The count of heartbeats reported to PD per instance | -Heartbeat | Region heartbeat report error | The count of heartbeats with the `error` status | -Heartbeat | Region heartbeat report active | The count of heartbeats with the `ok` status | -Heartbeat | Region schedule push | The count of corresponding schedule commands sent from PD per TiKV instance | -Heartbeat | 99% Region heartbeat latency | The heartbeat latency per TiKV instance (P99) | -Region storage | Syncer index | The maximum index in the Region change history recorded by the leader | -Region storage | History last index | The last index where the Region change history is synchronized successfully with the follower | - -## PD dashboard interface - -### Cluster - -![PD Dashboard - Cluster metrics](/media/pd-dashboard-cluster-v2.png) - -### Statistics - Balance - -![PD Dashboard - Statistics - Balance metrics](/media/pd-dashboard-balance-v2.png) - -### Statistics - Hotspot - -![PD Dashboard - Statistics - Hotspot metrics](/media/pd-dashboard-hotspot.png) - -### Scheduler - -![PD Dashboard - Scheduler metrics](/media/pd-dashboard-scheduler-v2.png) - -### Operator - -![PD Dashboard - Operator metrics](/media/pd-dashboard-operator-v2.png) - -### gRPC -![PD Dashboard - gRPC metrics](/media/pd-dashboard-grpc-v2.png) +## TiDB -### etcd +- PD Server TSO handle time and Client recv time: The duration between PD receiving the TSO request and the PD client getting the TSO response +- Handle requests count: The count of TiDB requests +- Handle requests duration: The time consumed for handling TiDB requests. It should be less than `100ms` (P99) -![PD Dashboard - etcd metrics](/media/pd-dashboard-etcd-v2.png) +![PD Dashboard - TiDB metrics](/media/pd-dashboard-tidb-v4.png) -### TiDB +## Heartbeat -![PD Dashboard - TiDB metrics](/media/pd-dashboard-tidb-v2.png) +- Heartbeat region event QPS: The QPS of handling heartbeat messages, including updating the cache and persisting data +- Region heartbeat report: The count of heartbeats reported to PD per instance +- Region heartbeat report error: The count of heartbeats with the `error` status +- Region heartbeat report active: The count of heartbeats with the `ok` status +- Region schedule push: The count of corresponding schedule commands sent from PD per TiKV instance +- 99% Region heartbeat latency: The heartbeat latency per TiKV instance (P99) -### Heartbeat +![PD Dashboard - Heartbeat metrics](/media/pd-dashboard-heartbeat-v4.png) -![PD Dashboard - Heartbeat metrics](/media/pd-dashboard-heartbeat-v2.png) +## Region storage -### Region storage +- Syncer Index: The maximum index in the Region change history recorded by the leader +- history last index: The last index where the Region change history is synchronized successfully with the follower -![PD Dashboard - Region storage](/media/pd-dashboard-region-storage.png) \ No newline at end of file +![PD Dashboard - Region storage](/media/pd-dashboard-region-storage.png) diff --git a/media/pd-dashboard-balance-v4.png b/media/pd-dashboard-balance-v4.png new file mode 100644 index 0000000000000..e13a97c8c8cd3 Binary files /dev/null and b/media/pd-dashboard-balance-v4.png differ diff --git a/media/pd-dashboard-cluster-v4.png b/media/pd-dashboard-cluster-v4.png new file mode 100644 index 0000000000000..1546480621d4f Binary files /dev/null and b/media/pd-dashboard-cluster-v4.png differ diff --git a/media/pd-dashboard-header-v4.png b/media/pd-dashboard-header-v4.png new file mode 100644 index 0000000000000..01b7efe29a794 Binary files /dev/null and b/media/pd-dashboard-header-v4.png differ diff --git a/media/pd-dashboard-heartbeat-v4.png b/media/pd-dashboard-heartbeat-v4.png new file mode 100644 index 0000000000000..455c52e5c4953 Binary files /dev/null and b/media/pd-dashboard-heartbeat-v4.png differ diff --git a/media/pd-dashboard-hotread-v4.png b/media/pd-dashboard-hotread-v4.png new file mode 100644 index 0000000000000..526e1acda5c8b Binary files /dev/null and b/media/pd-dashboard-hotread-v4.png differ diff --git a/media/pd-dashboard-hotwrite-v4.png b/media/pd-dashboard-hotwrite-v4.png new file mode 100644 index 0000000000000..cafdc0a65a06d Binary files /dev/null and b/media/pd-dashboard-hotwrite-v4.png differ diff --git a/media/pd-dashboard-operator-v4.png b/media/pd-dashboard-operator-v4.png new file mode 100644 index 0000000000000..0e41372dbee97 Binary files /dev/null and b/media/pd-dashboard-operator-v4.png differ diff --git a/media/pd-dashboard-scheduler-v4.png b/media/pd-dashboard-scheduler-v4.png new file mode 100644 index 0000000000000..2c58b2de548ba Binary files /dev/null and b/media/pd-dashboard-scheduler-v4.png differ diff --git a/media/pd-dashboard-tidb-v4.png b/media/pd-dashboard-tidb-v4.png new file mode 100644 index 0000000000000..0970eb42ce996 Binary files /dev/null and b/media/pd-dashboard-tidb-v4.png differ