diff --git a/TOC.md b/TOC.md index 465043bf2bd83..f18f5907957bf 100644 --- a/TOC.md +++ b/TOC.md @@ -73,8 +73,9 @@ + [Maintain TiDB Using TiUP](/maintain-tidb-using-tiup.md) + [Maintain TiDB Using Ansible](/maintain-tidb-using-ansible.md) + Monitor and Alert - + [Monitoring Framework](/tidb-monitoring-framework.md) - + [Monitor a TiDB Cluster](/monitor-a-tidb-cluster.md) + + [Monitoring Framework Overview](/tidb-monitoring-framework.md) + + [Monitoring API](/tidb-monitoring-api.md) + + [Deploy Monitoring Services](/deploy-monitoring-services.md) + [TiDB Cluster Alert Rules](/alert-rules.md) + [TiFlash Alert Rules](/tiflash/tiflash-alert-rules.md) + Troubleshoot diff --git a/monitor-a-tidb-cluster.md b/deploy-monitoring-services.md similarity index 62% rename from monitor-a-tidb-cluster.md rename to deploy-monitoring-services.md index 85bc9d0a31bbc..72b6703d716fd 100644 --- a/monitor-a-tidb-cluster.md +++ b/deploy-monitoring-services.md @@ -1,88 +1,17 @@ --- -title: Monitor a TiDB Cluster -summary: Learn how to monitor the state of a TiDB cluster. +title: Deploy Monitoring Services for the TiDB Cluster +summary: Learn how to deploy monitoring services for the TiDB cluster. category: how-to -aliases: ['/docs/dev/how-to/monitor/monitor-a-cluster/'] +aliases: ['/docs/dev/how-to/monitor/monitor-a-cluster/','/docs/dev/monitor-a-tidb-cluster/'] --- -# Monitor a TiDB Cluster +# Deploy Monitoring Services for the TiDB Cluster -You can use the following two types of interfaces to monitor the TiDB cluster state: +This document is intended for users who want to manually deploy TiDB monitoring and alert services. -- [The state interface](#use-the-state-interface): this interface uses the HTTP interface to get the component information. -- [The metrics interface](#use-the-metrics-interface): this interface uses Prometheus to record the detailed information of the various operations in components and views these metrics using Grafana. +If you deploy the TiDB cluster using TiUP, the monitoring and alert services are automatically deployed, and no manual deployment is needed. -## Use the state interface - -The state interface monitors the basic information of a specific component in the TiDB cluster. It can also act as the monitor interface for Keepalive messages. In addition, the state interface for the Placement Driver (PD) can get the details of the entire TiKV cluster. - -### TiDB server - -- TiDB API address: `http://${host}:${port}` -- Default port: `10080` -- Details about API names: see [TiDB HTTP API](https://github.com/pingcap/tidb/blob/master/docs/tidb_http_api.md) - -The following example uses `http://${host}:${port}/status` to get the current state of the TiDB server and to determine whether the server is alive. The result is returned in JSON format. - -```bash -curl http://127.0.0.1:10080/status -{ - connections: 0, # The current number of clients connected to the TiDB server. - version: "5.7.25-TiDB-v3.0.0-beta-250-g778c3f4a5", # The TiDB version number. - git_hash: "778c3f4a5a716880bcd1d71b257c8165685f0d70" # The Git Hash of the current TiDB code. -} -``` - -### PD server - -- PD API address: `http://${host}:${port}/pd/api/v1/${api_name}` -- Default port: `2379` -- Details about API names: see [PD API doc](https://download.pingcap.com/pd-api-v1.html) - -The PD interface provides the state of all the TiKV servers and the information about load balancing. See the following example for the information about a single-node TiKV cluster: - -```bash -curl http://127.0.0.1:2379/pd/api/v1/stores -{ - "count": 1, # The number of TiKV nodes. - "stores": [ # The list of TiKV nodes. - # The details about the single TiKV node. - { - "store": { - "id": 1, - "address": "127.0.0.1:20160", - "version": "3.0.0-beta", - "state_name": "Up" - }, - "status": { - "capacity": "20 GiB", # The total capacity. - "available": "16 GiB", # The available capacity. - "leader_count": 17, - "leader_weight": 1, - "leader_score": 17, - "leader_size": 17, - "region_count": 17, - "region_weight": 1, - "region_score": 17, - "region_size": 17, - "start_ts": "2019-03-21T14:09:32+08:00", # The starting timestamp. - "last_heartbeat_ts": "2019-03-21T14:14:22.961171958+08:00", # The timestamp of the last heartbeat. - "uptime": "4m50.961171958s" - } - } - ] -``` - -## Use the metrics interface - -The metrics interface monitors the state and performance of the entire TiDB cluster. - -- If you use TiDB Ansible to deploy the TiDB cluster, the monitoring system (Prometheus and Grafana) is deployed at the same time. -- If you use other deployment ways, [deploy Prometheus and Grafana](#deploy-prometheus-and-grafana) before using this interface. - -After Prometheus and Grafana are successfully deployed, [configure Grafana](#configure-grafana). - -### Deploy Prometheus and Grafana +## Deploy Prometheus and Grafana Assume that the TiDB cluster topology is as follows: @@ -95,7 +24,7 @@ Assume that the TiDB cluster topology is as follows: | Node5 | 192.168.199.117| TiKV2, node_export | | Node6 | 192.168.199.118| TiKV3, node_export | -#### Step 1: Download the binary package +### Step 1: Download the binary package {{< copyable "shell-regular" >}} @@ -115,7 +44,7 @@ tar -xzf node_exporter-0.17.0.linux-amd64.tar.gz tar -xzf grafana-6.1.6.linux-amd64.tar.gz ``` -#### Step 2: Start `node_exporter` on Node1, Node2, Node3, and Node4 +### Step 2: Start `node_exporter` on Node1, Node2, Node3, and Node4 {{< copyable "shell-regular" >}} @@ -127,7 +56,7 @@ $ ./node_exporter --web.listen-address=":9100" \ --log.level="info" & ``` -#### Step 3: Start Prometheus on Node1 +### Step 3: Start Prometheus on Node1 Edit the Prometheus configuration file: @@ -200,7 +129,7 @@ $ ./prometheus \ --storage.tsdb.retention="15d" & ``` -#### Step 4: Start Grafana on Node1 +### Step 4: Start Grafana on Node1 Edit the Grafana configuration file: @@ -259,11 +188,11 @@ $ ./bin/grafana-server \ --config="./conf/grafana.ini" & ``` -### Configure Grafana +## Configure Grafana This section describes how to configure Grafana. -#### Step 1: Add a Prometheus data source +### Step 1: Add a Prometheus data source 1. Log in to the Grafana Web interface. @@ -288,7 +217,7 @@ This section describes how to configure Grafana. 5. Click **Add** to save the new data source. -#### Step 2: Import a Grafana dashboard +### Step 2: Import a Grafana dashboard To import a Grafana dashboard for the PD server, the TiKV server, and the TiDB server, take the following steps respectively: @@ -308,7 +237,7 @@ To import a Grafana dashboard for the PD server, the TiKV server, and the TiDB s 6. Click **Import**. A Prometheus dashboard is imported. -### View component metrics +## View component metrics Click **New dashboard** in the top menu and choose the dashboard you want to view. diff --git a/grafana-overview-dashboard.md b/grafana-overview-dashboard.md index 76178918a7308..130b805347fe1 100644 --- a/grafana-overview-dashboard.md +++ b/grafana-overview-dashboard.md @@ -75,4 +75,4 @@ System Info | IO Util | the disk usage ratio, 100% at a maximum; generally you n ## Interface of the Overview dashboard -![Overview Dashboard](/media/overview.png) +![overview](/media/grafana-monitor-overview.png) diff --git a/media/grafana-monitor-overview.png b/media/grafana-monitor-overview.png new file mode 100644 index 0000000000000..a3a5a3bb1222a Binary files /dev/null and b/media/grafana-monitor-overview.png differ diff --git a/media/grafana-monitored-groups.png b/media/grafana-monitored-groups.png new file mode 100644 index 0000000000000..5ce778446a19d Binary files /dev/null and b/media/grafana-monitored-groups.png differ diff --git a/media/overview.png b/media/overview.png deleted file mode 100644 index 8a665fb4d82e8..0000000000000 Binary files a/media/overview.png and /dev/null differ diff --git a/mysql-compatibility.md b/mysql-compatibility.md index 4f6bfa32b2933..6a7766de860a3 100644 --- a/mysql-compatibility.md +++ b/mysql-compatibility.md @@ -76,7 +76,7 @@ mysql> select _tidb_rowid, id from t; ### Performance schema -TiDB uses a combination of Prometheus and Grafana to store and query the performance monitoring metrics. Some performance schema tables return empty results in TiDB. +TiDB uses a combination of [Prometheus and Grafana](/tidb-monitoring-api.md) to store and query the performance monitoring metrics. Some performance schema tables return empty results in TiDB. ### Query Execution Plan diff --git a/production-deployment-from-binary-tarball.md b/production-deployment-from-binary-tarball.md index 3de41ecdecab8..62f311c9a4192 100644 --- a/production-deployment-from-binary-tarball.md +++ b/production-deployment-from-binary-tarball.md @@ -205,4 +205,4 @@ Follow the steps below to start PD, TiKV, and TiDB: > - To tune TiKV, see [Performance Tuning for TiKV](/tune-tikv-performance.md). > - If you use `nohup` to start the cluster in the production environment, write the startup commands in a script and then run the script. If not, the `nohup` process might abort because it receives exceptions when the Shell command exits. For more information, see [The TiDB/TiKV/PD process aborts unexpectedly](/troubleshoot-tidb-cluster.md#the-tidbtikvpd-process-aborts-unexpectedly). -For the deployment and use of TiDB monitoring services, see [Monitor a TiDB Cluster](/monitor-a-tidb-cluster.md). +For the deployment and use of TiDB monitoring services, see [Deploy Monitoring Services for the TiDB Cluster](/deploy-monitoring-services.md) and [TiDB Monitoring API](/tidb-monitoring-api.md). diff --git a/tidb-monitoring-api.md b/tidb-monitoring-api.md new file mode 100644 index 0000000000000..a66d9ff9decb7 --- /dev/null +++ b/tidb-monitoring-api.md @@ -0,0 +1,81 @@ +--- +title: TiDB Monitoring API +summary: Learn the API of TiDB monitoring services. +category: how-to +--- + +# TiDB Monitoring API + +You can use the following two types of interfaces to monitor the TiDB cluster state: + +- [The state interface](#use-the-state-interface): this interface uses the HTTP interface to get the component information. +- [The metrics interface](#use-the-metrics-interface): this interface uses Prometheus to record the detailed information of the various operations in components and views these metrics using Grafana. + +## Use the state interface + +The state interface monitors the basic information of a specific component in the TiDB cluster. It can also act as the monitor interface for Keepalive messages. In addition, the state interface for the Placement Driver (PD) can get the details of the entire TiKV cluster. + +### TiDB server + +- TiDB API address: `http://${host}:${port}` +- Default port: `10080` + +The following example uses `http://${host}:${port}/status` to get the current state of the TiDB server and to determine whether the server is alive. The result is returned in JSON format. + +```bash +curl http://127.0.0.1:10080/status +{ + connections: 0, # The current number of clients connected to the TiDB server. + version: "5.7.25-TiDB-v3.0.0-beta-250-g778c3f4a5", # The TiDB version number. + git_hash: "778c3f4a5a716880bcd1d71b257c8165685f0d70" # The Git Hash of the current TiDB code. +} +``` + +### PD server + +- PD API address: `http://${host}:${port}/pd/api/v1/${api_name}` +- Default port: `2379` +- Details about API names: see [PD API doc](https://download.pingcap.com/pd-api-v1.html) + +The PD interface provides the state of all the TiKV servers and the information about load balancing. See the following example for the information about a single-node TiKV cluster: + +```bash +curl http://127.0.0.1:2379/pd/api/v1/stores +{ + "count": 1, # The number of TiKV nodes. + "stores": [ # The list of TiKV nodes. + # The details about the single TiKV node. + { + "store": { + "id": 1, + "address": "127.0.0.1:20160", + "version": "3.0.0-beta", + "state_name": "Up" + }, + "status": { + "capacity": "20 GiB", # The total capacity. + "available": "16 GiB", # The available capacity. + "leader_count": 17, + "leader_weight": 1, + "leader_score": 17, + "leader_size": 17, + "region_count": 17, + "region_weight": 1, + "region_score": 17, + "region_size": 17, + "start_ts": "2019-03-21T14:09:32+08:00", # The starting timestamp. + "last_heartbeat_ts": "2019-03-21T14:14:22.961171958+08:00", # The timestamp of the last heartbeat. + "uptime": "4m50.961171958s" + } + } + ] +``` + +## Use the metrics interface + +The metrics interface monitors the state and performance of the entire TiDB cluster. + +- If you use TiDB Ansible to deploy the TiDB cluster, the monitoring system (Prometheus and Grafana) is deployed at the same time. +- If you use other deployment ways, [deploy Prometheus and Grafana](/deploy-monitoring-services.md) before using this interface. + +After Prometheus and Grafana are successfully deployed, [configure Grafana](/deploy-monitoring-services.md#configure-grafana). diff --git a/tidb-monitoring-framework.md b/tidb-monitoring-framework.md index 3a938e4232e26..55ddd68d68afd 100644 --- a/tidb-monitoring-framework.md +++ b/tidb-monitoring-framework.md @@ -27,4 +27,27 @@ The diagram is as follows: Grafana is an open source project for analyzing and visualizing metrics. TiDB uses Grafana to display the performance metrics as follows: -![screenshot](/media/grafana-screenshot.png) +![Grafana monitored_groups](/media/grafana-monitored-groups.png) + +- {TiDB_Cluster_name}-Backup-Restore: Monitoring metrics related to backup and restore. +- {TiDB_Cluster_name}-Binlog: Monitoring metrics related to TiDB Binlog. +- {TiDB_Cluster_name}-Blackbox_exporter: Monitoring metrics related to network probe. +- {TiDB_Cluster_name}-Disk-Performance: Monitoring metrics related to disk performance. +- {TiDB_Cluster_name}-Kafka-Overview: Monitoring metrics related to Kafka. +- {TiDB_Cluster_name}-Lightning: Monitoring metrics related to TiDB Lightning. +- {TiDB_Cluster_name}-Node_exporter: Monitoring metrics related to the operating system. +- {TiDB_Cluster_name}-Overview: Monitoring overview related to important components. +- {TiDB_Cluster_name}-PD: Monitoring metrics related to the PD server. +- {TiDB_Cluster_name}-Performance-Read: Monitoring metrics related to read performance. +- {TiDB_Cluster_name}-Performance-Write: Monitoring metrics related to write performance. +- {TiDB_Cluster_name}-TiDB: Detailed monitoring metrics related to the TiDB server. +- {TiDB_Cluster_name}-TiDB-Summary: Monitoring overview related to TiDB. +- {TiDB_Cluster_name}-TiFlash-Proxy-Summary: Monitoring overview of the proxy server that is used to replicate data to TiFlash. +- {TiDB_Cluster_name}-TiFlash-Summary: Monitoring overview related to TiFlash. +- {TiDB_Cluster_name}-TiKV-Details: Detailed monitoring metrics related to the TiKV server. +- {TiDB_Cluster_name}-TiKV-Summary: Monitoring overview related to the TiKV server. +- {TiDB_Cluster_name}-TiKV-Trouble-Shooting: Monitoring metrics related to the TiKV error diagnostics. + +Each group has multiple panel labels of monitoring metrics, and each panel contains detailed information of multiple monitoring metrics. For example, the **Overview** monitoring group has five panel labels, and each labels corresponds to a monitoring panel. See the following UI: + +![Grafana Overview](/media/grafana-monitor-overview.png)