From 87a166129e52e77df6bfe009501cf93fde5135e6 Mon Sep 17 00:00:00 2001 From: TomShawn <1135243111@qq.com> Date: Sun, 12 Apr 2020 22:21:56 +0800 Subject: [PATCH 1/5] reference: add 3 metrics system tables --- TOC.md | 3 + reference/system-databases/metrics-schema.md | 179 +++++++++++++++++ reference/system-databases/metrics-summary.md | 189 ++++++++++++++++++ reference/system-databases/metrics-tables.md | 35 ++++ reference/system-databases/sql-diagnosis.md | 8 +- 5 files changed, 410 insertions(+), 4 deletions(-) create mode 100644 reference/system-databases/metrics-schema.md create mode 100644 reference/system-databases/metrics-summary.md create mode 100644 reference/system-databases/metrics-tables.md diff --git a/TOC.md b/TOC.md index e071510e489bd..b663a7ed9f140 100644 --- a/TOC.md +++ b/TOC.md @@ -277,6 +277,9 @@ - [`cluster_info`](/reference/system-databases/cluster-info.md) - [`cluster_hardware`](/reference/system-databases/cluster-hardware.md) - [`cluster_config`](/reference/system-databases/cluster-config.md) + - [`metrics_schema`](/reference/system-databases/metrics-schema.md) + - [`metrics_tables`](/reference/system-databases/metrics-tables.md) + - [`metrics_summary`](/reference/system-databases/metrics-summary.md) - [Errors Codes](/reference/error-codes.md) - [Supported Client Drivers](/reference/supported-clients.md) + Garbage Collection (GC) diff --git a/reference/system-databases/metrics-schema.md b/reference/system-databases/metrics-schema.md new file mode 100644 index 0000000000000..3972bfdacd33b --- /dev/null +++ b/reference/system-databases/metrics-schema.md @@ -0,0 +1,179 @@ +--- +title: METRICS_SCHEMA +summary: Learn the `METRICS_SCHEMA` system table. +category: reference +--- + +# Metrics Schema + +To dynamically observe and compare cluster conditions in different time periods, the SQL diagnosis system introduces cluster monitoring system tables. All monitoring tables are in `metrics_schema`, and you can query the monitoring information using SQL statements. In fact, the data of the three monitoring-related summary tables ([`metrics_summary`](/reference/system-databases/metrics-summary.md), [`metrics_summary_by_label`](/reference/system-databases/metrics-summary.md), and `inspection_result`) are obtained by querying the monitoring tables in `metrics_schema`. Currently, many system tables are added and you can query the information of these tables through the [`information_schema.metrics_tables`](/reference/system-databases/metrics-tables.md) table. + +## Overview + +The following example uses the `tidb_query_duration` table to introduce the usage and working principles of the monitoring table. The working principles of other monitoring tables are similar. + +Query the information related to the `tidb_query_duration` table on `information_schema.metrics_tables`: + +{{< copyable "sql" >}} + +```sql +select * from information_schema.metrics_tables where table_name='tidb_query_duration'; +``` + +``` ++---------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+----------+----------------------------------------------+ +| TABLE_NAME | PROMQL | LABELS | QUANTILE | COMMENT | ++---------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+----------+----------------------------------------------+ +| tidb_query_duration | histogram_quantile($QUANTILE, sum(rate(tidb_server_handle_query_duration_seconds_bucket{$LABEL_CONDITIONS}[$RANGE_DURATION])) by (le,sql_type,instance)) | instance,sql_type | 0.9 | The quantile of TiDB query durations(second) | ++---------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+----------+----------------------------------------------+ +``` + +Field description: + +* `TABLE_NAME`: Corresponds to the table name in `metrics_schema`. In this example, the table name is `tidb_query_duration`. +* `PROMQL`: The working principle of the monitoring table is to map SQL statements to `PromQL` and convert Prometheus results into SQL query results. This field is the expression template of `PromQL`. When getting the data of the monitoring table, the query conditions are used to rewrite the variables in this template to generate the final query expression. +* `LABELS`: The label for the monitoring item. `tidb_query_duration` has two labels: `instance` and `sql_type`. +* `QUANTILE`: The percentile. For monitoring data of the histogram type, specify a default percentile. If the value of this field is `0`, it means that the monitoring item corresponding to the monitoring table is not a histogram. +* `COMMENT`: The comment for the monitoring table. You can see that the `tidb_query_duration` table is used to query the percentile time of the TiDB query execution, such as the query time of P999/P99/P90. The unit is second. + +The structure of the `tidb_query_duration` table is queried as follows: + +{{< copyable "sql" >}} + +```sql +show create table metrics_schema.tidb_query_duration; +``` + +``` ++---------------------+--------------------------------------------------------------------------------------------------------------------+ +| Table | Create Table | ++---------------------+--------------------------------------------------------------------------------------------------------------------+ +| tidb_query_duration | CREATE TABLE `tidb_query_duration` ( | +| | `time` datetime unsigned DEFAULT CURRENT_TIMESTAMP, | +| | `instance` varchar(512) DEFAULT NULL, | +| | `sql_type` varchar(512) DEFAULT NULL, | +| | `quantile` double unsigned DEFAULT '0.9', | +| | `value` double unsigned DEFAULT NULL | +| | ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin COMMENT='The quantile of TiDB query durations(second)' | ++---------------------+--------------------------------------------------------------------------------------------------------------------+ +``` + +* `time`: The time of the monitoring item. +* `instance` and `sql_type`: The labels of the `tidb_query_duration` monitoring item. `instance` means the monitoring address. `sql_type` means the type of the executed SQL statement. +* `quantile`: The percentile. The monitoring item of the Histogram type has this column, which indicates the percentile time of the query. For example, `quantile = 0.9` means to query the time of P90. +* `value`: The value of the monitoring item. + +The following statement queries the P99 time within the range of [`2020-03-25 23:40:00`, `2020-03-25 23:42:00`]. + +{{< copyable "sql" >}} + +```sql +select * from metrics_schema.tidb_query_duration where value is not null and time>='2020-03-25 23:40:00' and time <= '2020-03-25 23:42:00' and quantile=0.99; +``` + +``` ++---------------------+-------------------+----------+----------+----------------+ +| time | instance | sql_type | quantile | value | ++---------------------+-------------------+----------+----------+----------------+ +| 2020-03-25 23:40:00 | 172.16.5.40:10089 | Insert | 0.99 | 0.509929485256 | +| 2020-03-25 23:41:00 | 172.16.5.40:10089 | Insert | 0.99 | 0.494690793986 | +| 2020-03-25 23:42:00 | 172.16.5.40:10089 | Insert | 0.99 | 0.493460506934 | +| 2020-03-25 23:40:00 | 172.16.5.40:10089 | Select | 0.99 | 0.152058493415 | +| 2020-03-25 23:41:00 | 172.16.5.40:10089 | Select | 0.99 | 0.152193879678 | +| 2020-03-25 23:42:00 | 172.16.5.40:10089 | Select | 0.99 | 0.140498483232 | +| 2020-03-25 23:40:00 | 172.16.5.40:10089 | internal | 0.99 | 0.47104 | +| 2020-03-25 23:41:00 | 172.16.5.40:10089 | internal | 0.99 | 0.11776 | +| 2020-03-25 23:42:00 | 172.16.5.40:10089 | internal | 0.99 | 0.11776 | ++---------------------+-------------------+----------+----------+----------------+ +``` + +The first row of the above query result means that at the time of 2020-03-25 23:40:00, on the TiDB instance `172.16.5.40:10089`, the P99 execution time of the `Insert` type statement is 0.509929485256 seconds. The meanings of other rows are similar. Other values of the `sql_type` column is described as follows: + +* `Select`: The `select` type statement is executed. +* `internal`: The internal SQL statement of TiDB, which is used to update the statistical information and get the global variables. + +The execution plan of the above statement is as follows: + +{{< copyable "sql" >}} + +```sql +desc select * from metrics_schema.tidb_query_duration where value is not null and time>='2020-03-25 23:40:00' and time <= '2020-03-25 23:42:00' and quantile=0.99; +``` + +``` ++------------------+----------+------+---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| id | estRows | task | access object | operator info | ++------------------+----------+------+---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Selection_5 | 8000.00 | root | | not(isnull(Column#5)) | +| └─MemTableScan_6 | 10000.00 | root | table:tidb_query_duration | PromQL:histogram_quantile(0.99, sum(rate(tidb_server_handle_query_duration_seconds_bucket{}[60s])) by (le,sql_type,instance)), start_time:2020-03-25 23:40:00, end_time:2020-03-25 23:42:00, step:1m0s | ++------------------+----------+------+---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +``` + +From the above result, you can see that `PromQL`, `start_time`, `end_time`, and the value of `step`. During actual execution, TiDB calls the `query_range` HTTP API interface of Prometheus to query the monitoring data. + +You might find that during the range of [`2020-03-25 23:40:00`, `2020-03-25 23:42:00`], each label only has three time values. In the execution plan, the value of `step` is 1 minute, which is determined by the following two variables: + +* `tidb_metric_query_step`: The resolution step of the query. To get the `query_range` data from Prometheus, you need to specify `start`, `end`, and `step`. `step` uses the value of this variable. +* `tidb_metric_query_range_duration`: When querying the monitoring, the `$ RANGE_DURATION` field in `PROMQL` is replaced with the value of this variable. The default value is 60 seconds. + +To view the values of monitoring items with different granularities, you can modify the above two session variables before querying the monitoring table. For example: + +1. Modify the values of the two session variables and set the time granularity to 30 seconds. + + > **Note:** + > + > The minimum granularity supported by Prometheus is 30 seconds. + + {{< copyable "sql" >}} + + ```sql + set @@tidb_metric_query_step=30; + set @@tidb_metric_query_range_duration=30; + ``` + +2. Query the `tidb_query_duration` monitoring item as follows. From the result, you can see that within the 3-minute time range, each label has 6 time values, and the interval between each value is 30 seconds. + + {{< copyable "sql" >}} + + ```sql + select * from metrics_schema.tidb_query_duration where value is not null and time>='2020-03-25 23:40:00' and time <= '2020-03-25 23:42:00' and quantile=0.99; + ``` + + ``` + +---------------------+-------------------+----------+----------+-----------------+ + | time | instance | sql_type | quantile | value | + +---------------------+-------------------+----------+----------+-----------------+ + | 2020-03-25 23:40:00 | 172.16.5.40:10089 | Insert | 0.99 | 0.483285651924 | + | 2020-03-25 23:40:30 | 172.16.5.40:10089 | Insert | 0.99 | 0.484151462113 | + | 2020-03-25 23:41:00 | 172.16.5.40:10089 | Insert | 0.99 | 0.504576 | + | 2020-03-25 23:41:30 | 172.16.5.40:10089 | Insert | 0.99 | 0.493577384561 | + | 2020-03-25 23:42:00 | 172.16.5.40:10089 | Insert | 0.99 | 0.49482474311 | + | 2020-03-25 23:40:00 | 172.16.5.40:10089 | Select | 0.99 | 0.189253402185 | + | 2020-03-25 23:40:30 | 172.16.5.40:10089 | Select | 0.99 | 0.184224951851 | + | 2020-03-25 23:41:00 | 172.16.5.40:10089 | Select | 0.99 | 0.151673410553 | + | 2020-03-25 23:41:30 | 172.16.5.40:10089 | Select | 0.99 | 0.127953838989 | + | 2020-03-25 23:42:00 | 172.16.5.40:10089 | Select | 0.99 | 0.127455434547 | + | 2020-03-25 23:40:00 | 172.16.5.40:10089 | internal | 0.99 | 0.0624 | + | 2020-03-25 23:40:30 | 172.16.5.40:10089 | internal | 0.99 | 0.12416 | + | 2020-03-25 23:41:00 | 172.16.5.40:10089 | internal | 0.99 | 0.0304 | + | 2020-03-25 23:41:30 | 172.16.5.40:10089 | internal | 0.99 | 0.06272 | + | 2020-03-25 23:42:00 | 172.16.5.40:10089 | internal | 0.99 | 0.0629333333333 | + +---------------------+-------------------+----------+----------+-----------------+ + ``` + +3. View the execution plan. From the result, you can also see that the values of `PromQL` and `step` in the execution plan have been changed to 30 seconds. + + {{< copyable "sql" >}} + + ```sql + desc select * from metrics_schema.tidb_query_duration where value is not null and time>='2020-03-25 23:40:00' and time <= '2020-03-25 23:42:00' and quantile=0.99; + ``` + + ``` + +------------------+----------+------+---------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | id | estRows | task | access object | operator info | + +------------------+----------+------+---------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Selection_5 | 8000.00 | root | | not(isnull(Column#5)) | + | └─MemTableScan_6 | 10000.00 | root | table:tidb_query_duration | PromQL:histogram_quantile(0.99, sum(rate(tidb_server_handle_query_duration_seconds_bucket{}[30s])) by (le,sql_type,instance)), start_time:2020-03-25 23:40:00, end_time:2020-03-25 23:42:00, step:30s | + +------------------+----------+------+---------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + ``` diff --git a/reference/system-databases/metrics-summary.md b/reference/system-databases/metrics-summary.md new file mode 100644 index 0000000000000..5b4bf80b1e605 --- /dev/null +++ b/reference/system-databases/metrics-summary.md @@ -0,0 +1,189 @@ +--- +title: METRICS_SUMMARY +summary: Learn the `METRICS_SUMMARY` system table. +category: reference +--- + +# METRICS_SUMMARY + +Because the TiDB cluster has many monitoring metrics, the SQL diagnosis system also provides the following two monitoring summary tables for you to easily find abnormal monitoring items: + +* `information_schema.metrics_summary` +* `information_schema.metrics_summary_by_label` + +The two tables summarize all monitoring data to for you to check each monitoring metric with higher efficiency. Compare to `information_schema.metrics_summary`, the `information_schema.metrics_summary_by_label` table has an additional `label` column and performs differentiated statistics according to different labels. + +{{< copyable "sql" >}} + +```sql +mysql> desc metrics_summary; +``` + +``` ++--------------+-----------------------+------+------+---------+-------+ +| Field | Type | Null | Key | Default | Extra | ++--------------+-----------------------+------+------+---------+-------+ +| METRICS_NAME | varchar(64) | YES | | NULL | | +| QUANTILE | double unsigned | YES | | NULL | | +| SUM_VALUE | double(22,6) unsigned | YES | | NULL | | +| AVG_VALUE | double(22,6) unsigned | YES | | NULL | | +| MIN_VALUE | double(22,6) unsigned | YES | | NULL | | +| MAX_VALUE | double(22,6) unsigned | YES | | NULL | | +| COMMENT | varchar(256) | YES | | NULL | | ++--------------+-----------------------+------+------+---------+-------+ +``` + +Field description: + +* `METRICS_NAME`: The monitoring table name. +* `QUANTILE`: The percentile. You can specify `QUANTILE` using SQL statements. For example: + * `select * from metrics_summary where quantile=0.99` specifies viewing the data of the 0.99 percentile. + * `select * from metrics_summary where quantile in (0.80, 0.99, 0.99, 0.999)` specifies viewing the data of the 0.8, 0.99, 0.99, 0.999 percentiles at the same time. +* `SUM_VALUE, AVG_VALUE, MIN_VALUE, and MAX_VALUE` respectively mean the sum, the average value, the minimum value, and the maximum value. +* `COMMENT`: The comment for the corresponding comment, + +For example: + +To query the three groups of monitoring items with the highest average time consumption in the TiDB cluster in the time range of `'2020-03-08 13:23:00', '2020-03-08 13: 33: 00'`, you can directly query the `information_schema.metrics_summary` table and use the `/*+ time_range() */` hint to specify the time range. The SQL statement is built as follows: + +{{< copyable "sql" >}} + +```sql +select /*+ time_range('2020-03-08 13:23:00','2020-03-08 13:33:00') */ * +from information_schema.`METRICS_SUMMARY` +where metrics_name like 'tidb%duration' + and avg_value > 0 + and quantile = 0.99 +order by avg_value desc +limit 3\G +``` + +``` +***************************[ 1. row ]*************************** +METRICS_NAME | tidb_get_token_duration +QUANTILE | 0.99 +SUM_VALUE | 8.972509 +AVG_VALUE | 0.996945 +MIN_VALUE | 0.996515 +MAX_VALUE | 0.997458 +COMMENT | The quantile of Duration (us) for getting token, it should be small until concurrency limit is reached(second) +***************************[ 2. row ]*************************** +METRICS_NAME | tidb_query_duration +QUANTILE | 0.99 +SUM_VALUE | 0.269079 +AVG_VALUE | 0.007272 +MIN_VALUE | 0.000667 +MAX_VALUE | 0.01554 +COMMENT | The quantile of TiDB query durations(second) +***************************[ 3. row ]*************************** +METRICS_NAME | tidb_kv_request_duration +QUANTILE | 0.99 +SUM_VALUE | 0.170232 +AVG_VALUE | 0.004601 +MIN_VALUE | 0.000975 +MAX_VALUE | 0.013 +COMMENT | The quantile of kv requests durations by store +``` + +Similarly, here is an example of querying the `metrics_summary_by_label` monitoring summary table. + +{{< copyable "sql" >}} + +```sql +select /*+ time_range('2020-03-08 13:23:00','2020-03-08 13:33:00') */ * +from information_schema.`METRICS_SUMMARY_BY_LABEL` +where metrics_name like 'tidb%duration' + and avg_value > 0 + and quantile = 0.99 +order by avg_value desc +limit 10\G +``` + +``` +***************************[ 1. row ]*************************** +INSTANCE | 172.16.5.40:10089 +METRICS_NAME | tidb_get_token_duration +LABEL | +QUANTILE | 0.99 +SUM_VALUE | 8.972509 +AVG_VALUE | 0.996945 +MIN_VALUE | 0.996515 +MAX_VALUE | 0.997458 +COMMENT | The quantile of Duration (us) for getting token, it should be small until concurrency limit is reached(second) +***************************[ 2. row ]*************************** +INSTANCE | 172.16.5.40:10089 +METRICS_NAME | tidb_query_duration +LABEL | Select +QUANTILE | 0.99 +SUM_VALUE | 0.072083 +AVG_VALUE | 0.008009 +MIN_VALUE | 0.007905 +MAX_VALUE | 0.008241 +COMMENT | The quantile of TiDB query durations(second) +***************************[ 3. row ]*************************** +INSTANCE | 172.16.5.40:10089 +METRICS_NAME | tidb_query_duration +LABEL | Rollback +QUANTILE | 0.99 +SUM_VALUE | 0.072083 +AVG_VALUE | 0.008009 +MIN_VALUE | 0.007905 +MAX_VALUE | 0.008241 +COMMENT | The quantile of TiDB query durations(second) +``` + +The second and third lines of the above query results indicate that the `Select` and `Rollback` statements on `tidb_query_duration` have a long average execution time. + +In addition to the above example, you can use the monitoring summary table to quickly find the module with the largest change from the monitoring data by comparing the full link monitoring items of the two time periods, and quickly locate the bottleneck. The following example compares all monitoring items in two periods (where `t1` is the baseline) and sorts these items according to the greatest difference: + +* Period t1:`("2020-03-03 17:08:00", "2020-03-03 17:11:00")` +* Period t2:`("2020-03-03 17:18:00", "2020-03-03 17:21:00")` + +The monitoring items of the two time periods are joined according to `METRICS_NAME` and sorted according to the difference value. `TIME_RANGE` is the hint that specifies the quey time. + +{{< copyable "sql" >}} + +```sql +SELECT GREATEST(t1.avg_value,t2.avg_value)/LEAST(t1.avg_value, + t2.avg_value) AS ratio, + t1.metrics_name, + t1.avg_value as t1_avg_value, + t2.avg_value as t2_avg_value, + t2.comment +FROM + (SELECT /*+ time_range("2020-03-03 17:08:00", "2020-03-03 17:11:00")*/ * + FROM information_schema.metrics_summary ) t1 +JOIN + (SELECT /*+ time_range("2020-03-03 17:18:00", "2020-03-03 17:21:00")*/ * + FROM information_schema.metrics_summary ) t2 + ON t1.metrics_name = t2.metrics_name +ORDER BY ratio DESC limit 10; +``` + +``` ++----------------+------------------------------------------+----------------+------------------+---------------------------------------------------------------------------------------------+ +| ratio | metrics_name | t1_avg_value | t2_avg_value | comment | ++----------------+------------------------------------------+----------------+------------------+---------------------------------------------------------------------------------------------+ +| 5865.59537065 | tidb_slow_query_cop_process_total_time | 0.016333 | 95.804724 | The total time of TiDB slow query statistics with slow query total cop process time(second) | +| 3648.74109023 | tidb_distsql_partial_scan_key_total_num | 10865.666667 | 39646004.4394 | The total num of distsql partial scan key numbers | +| 267.002351165 | tidb_slow_query_cop_wait_total_time | 0.003333 | 0.890008 | The total time of TiDB slow query statistics with slow query total cop wait time(second) | +| 192.43267836 | tikv_cop_total_response_total_size | 2515333.66667 | 484032394.445 | | +| 192.43267836 | tikv_cop_total_response_size_per_seconds | 41922.227778 | 8067206.57408 | | +| 152.780296296 | tidb_distsql_scan_key_total_num | 5304.333333 | 810397.618317 | The total num of distsql scan numbers | +| 126.042290167 | tidb_distsql_execution_total_time | 0.421622 | 53.142143 | The total time of distsql execution(second) | +| 105.164020657 | tikv_cop_scan_details | 134.450733 | 14139.379665 | | +| 105.164020657 | tikv_cop_scan_details_total | 8067.043981 | 848362.77991 | | +| 101.635495394 | tikv_cop_scan_keys_num | 1070.875 | 108838.91113 | | ++----------------+------------------------------------------+----------------+------------------+---------------------------------------------------------------------------------------------+ +``` + +From the above query result: + +* `tib_slow_query_cop_process_total_time` (the time consumption of `cop process` in TiDB slow queries) in the period t2 is 5,865 times higher than that in period t1. +* `tidb_distsql_partial_scan_key_total_num` (the number of keys to scan requested by TiDB’s `distsql`) in period t2 is 3,648 times higher than that in period t1. During period t2, `tidb_slow_query_cop_wait_total_time` (the time of cop in the TiDB slow query requesting to queue up) is 267 times higher than that in period t1. +* `tikv_cop_total_response_size` (the size of the TiKV cop request result) in period t2 is 192 times higher than that in period t1. +* `tikv_cop_scan_details` in period t2 (the scan requested by the TiKV cop) is 105 times higher than that in period t1. + +From the above result, you can see that the cop request in period t2 is much higher than period t1, which causes the TiKV Coprocessor to be overloaded, and there is a `cop task` waiting. It might be that some large queries appear in period t2 that bring more load. + +In fact, during the entire time period from t1 to t2, the `go-ycsb` pressure test is run. Then 20 `tpch` queries are run during period t2, so it is the `tpch` queries that cause many cop requests. diff --git a/reference/system-databases/metrics-tables.md b/reference/system-databases/metrics-tables.md new file mode 100644 index 0000000000000..18d66cb9f3b80 --- /dev/null +++ b/reference/system-databases/metrics-tables.md @@ -0,0 +1,35 @@ +--- +title: METRICS_TABLES +summary: Learn the `METRICS_TABLES` system table. +category: reference +--- + +# METRICS_TABLES + +The `METRICS_TABLES` table provides information of all monitoring tables in [`metrics_schema`](/reference/system-databases/metrics-schema.md). + +{{< copyable "sql" >}} + +```sql +desc metrics_tables; +``` + +``` ++------------+-----------------+------+------+---------+-------+ +| Field | Type | Null | Key | Default | Extra | ++------------+-----------------+------+------+---------+-------+ +| TABLE_NAME | varchar(64) | YES | | NULL | | +| PROMQL | varchar(64) | YES | | NULL | | +| LABELS | varchar(64) | YES | | NULL | | +| QUANTILE | double unsigned | YES | | NULL | | +| COMMENT | varchar(256) | YES | | NULL | | ++------------+-----------------+------+------+---------+-------+ +``` + +Field description: + +* `TABLE_NAME`: Corresponds to the table name in `metrics_schema`. +* `PROMQL`: The working principle of the monitoring table is to map SQL statements to `PromQL` and convert Prometheus results into SQL query results. This field is the expression template of `PromQL`. When getting the data of the monitoring table, the query conditions are used to rewrite the variables in this template to generate the final query expression. +* `LABELS`: The label for the monitoring item. Each label corresponds to a column in the monitoring table. If the SQL statement contains filter of the corresponding column, the corresponding `PromQL` changes accordingly. +* `QUANTILE`: The percentile. For monitoring data of the histogram type, specify a default percentile. If the value of this field is `0`, it means that the monitoring item corresponding to the monitoring table is not a histogram. +* `COMMENT`: The comment for the monitoring table. diff --git a/reference/system-databases/sql-diagnosis.md b/reference/system-databases/sql-diagnosis.md index 6765bf556f496..54e171d035753 100644 --- a/reference/system-databases/sql-diagnosis.md +++ b/reference/system-databases/sql-diagnosis.md @@ -40,14 +40,14 @@ On the system tables earlier than TiDB v4.0, you can only view the current node. ## Cluster monitoring tables -To dynamically observe and compare cluster conditions in different time periods, the SQL diagnosis system introduces cluster monitoring system tables. All monitoring tables are in `metrics_schema`, and you can query the monitoring information SQL statements. Using this method, you can perform correlated queries on all monitoring information of the entire cluster and compare the results of different time periods to quickly identify performance bottlenecks. +To dynamically observe and compare cluster conditions in different time periods, the SQL diagnosis system introduces cluster monitoring system tables. All monitoring tables are in `metrics_schema`, and you can query the monitoring information using SQL statements. Using this method, you can perform correlated queries on all monitoring information of the entire cluster and compare the results of different time periods to quickly identify performance bottlenecks. -+ `information_schema.metrics_tables`: Because many system tables exist now, you can query meta-information of these monitoring tables on the `information_schema.metrics_tables` table. ++ [`information_schema.metrics_tables`](/reference/system-databases/metrics-tables.md)): Because many system tables exist now, you can query meta-information of these monitoring tables on the `information_schema.metrics_tables` table. Because the TiDB cluster has many monitoring metrics, TiDB provides the following monitoring summary tables in v4.0: -+ The monitoring summary table `information_schema.metrics_summary` summarizes all monitoring data to for you to check each monitoring metric with higher efficiency. -+ The monitoring summary table `information_schema.metrics_summary_by_label` also summarizes all monitoring data, but this table performs differentiated statistics according to different labels. ++ The monitoring summary table [`information_schema.metrics_summary`](/reference/system-databases/metrics-summary.md) summarizes all monitoring data to for you to check each monitoring metric with higher efficiency. ++ The monitoring summary table [`information_schema.metrics_summary_by_label`](/reference/system-databases/metrics-summary.md)) also summarizes all monitoring data, but this table performs differentiated statistics according to different labels. ## Automatic diagnosis From a885aae88cc4b28f9208b13678b86e0a3ff5ce8e Mon Sep 17 00:00:00 2001 From: TomShawn <1135243111@qq.com> Date: Tue, 14 Apr 2020 14:25:52 +0800 Subject: [PATCH 2/5] refine language --- reference/system-databases/metrics-schema.md | 12 +++++------ reference/system-databases/metrics-summary.md | 20 +++++++++---------- reference/system-databases/metrics-tables.md | 4 ++-- 3 files changed, 18 insertions(+), 18 deletions(-) diff --git a/reference/system-databases/metrics-schema.md b/reference/system-databases/metrics-schema.md index 3972bfdacd33b..6d05d9154a08b 100644 --- a/reference/system-databases/metrics-schema.md +++ b/reference/system-databases/metrics-schema.md @@ -1,12 +1,12 @@ --- -title: METRICS_SCHEMA -summary: Learn the `METRICS_SCHEMA` system table. +title: Metrics Schema +summary: Learn the `METRICS_SCHEMA` schema. category: reference --- # Metrics Schema -To dynamically observe and compare cluster conditions in different time periods, the SQL diagnosis system introduces cluster monitoring system tables. All monitoring tables are in `metrics_schema`, and you can query the monitoring information using SQL statements. In fact, the data of the three monitoring-related summary tables ([`metrics_summary`](/reference/system-databases/metrics-summary.md), [`metrics_summary_by_label`](/reference/system-databases/metrics-summary.md), and `inspection_result`) are obtained by querying the monitoring tables in `metrics_schema`. Currently, many system tables are added and you can query the information of these tables through the [`information_schema.metrics_tables`](/reference/system-databases/metrics-tables.md) table. +To dynamically observe and compare cluster conditions of different time periods, the SQL diagnosis system introduces cluster monitoring system tables. All monitoring tables are in the metrics schema, and you can query the monitoring information using SQL statements in this schema. In fact, the data of the three monitoring-related summary tables ([`metrics_summary`](/reference/system-databases/metrics-summary.md), [`metrics_summary_by_label`](/reference/system-databases/metrics-summary.md), and `inspection_result`) are obtained by querying the monitoring tables in the metrics schema. Currently, many system tables are added and you can query the information of these tables through the [`information_schema.metrics_tables`](/reference/system-databases/metrics-tables.md) table. ## Overview @@ -30,10 +30,10 @@ select * from information_schema.metrics_tables where table_name='tidb_query_dur Field description: -* `TABLE_NAME`: Corresponds to the table name in `metrics_schema`. In this example, the table name is `tidb_query_duration`. +* `TABLE_NAME`: Corresponds to the table name in the metrics schema. In this example, the table name is `tidb_query_duration`. * `PROMQL`: The working principle of the monitoring table is to map SQL statements to `PromQL` and convert Prometheus results into SQL query results. This field is the expression template of `PromQL`. When getting the data of the monitoring table, the query conditions are used to rewrite the variables in this template to generate the final query expression. * `LABELS`: The label for the monitoring item. `tidb_query_duration` has two labels: `instance` and `sql_type`. -* `QUANTILE`: The percentile. For monitoring data of the histogram type, specify a default percentile. If the value of this field is `0`, it means that the monitoring item corresponding to the monitoring table is not a histogram. +* `QUANTILE`: The percentile. For monitoring data of the histogram type, a default percentile is specified. If the value of this field is `0`, it means that the monitoring item corresponding to the monitoring table is not a histogram. * `COMMENT`: The comment for the monitoring table. You can see that the `tidb_query_duration` table is used to query the percentile time of the TiDB query execution, such as the query time of P999/P99/P90. The unit is second. The structure of the `tidb_query_duration` table is queried as follows: @@ -60,7 +60,7 @@ show create table metrics_schema.tidb_query_duration; * `time`: The time of the monitoring item. * `instance` and `sql_type`: The labels of the `tidb_query_duration` monitoring item. `instance` means the monitoring address. `sql_type` means the type of the executed SQL statement. -* `quantile`: The percentile. The monitoring item of the Histogram type has this column, which indicates the percentile time of the query. For example, `quantile = 0.9` means to query the time of P90. +* `quantile`: The percentile. The monitoring item of the histogram type has this column, which indicates the percentile time of the query. For example, `quantile = 0.9` means to query the time of P90. * `value`: The value of the monitoring item. The following statement queries the P99 time within the range of [`2020-03-25 23:40:00`, `2020-03-25 23:42:00`]. diff --git a/reference/system-databases/metrics-summary.md b/reference/system-databases/metrics-summary.md index 5b4bf80b1e605..c4e332a2a440e 100644 --- a/reference/system-databases/metrics-summary.md +++ b/reference/system-databases/metrics-summary.md @@ -16,7 +16,7 @@ The two tables summarize all monitoring data to for you to check each monitoring {{< copyable "sql" >}} ```sql -mysql> desc metrics_summary; +desc metrics_summary; ``` ``` @@ -38,9 +38,9 @@ Field description: * `METRICS_NAME`: The monitoring table name. * `QUANTILE`: The percentile. You can specify `QUANTILE` using SQL statements. For example: * `select * from metrics_summary where quantile=0.99` specifies viewing the data of the 0.99 percentile. - * `select * from metrics_summary where quantile in (0.80, 0.99, 0.99, 0.999)` specifies viewing the data of the 0.8, 0.99, 0.99, 0.999 percentiles at the same time. + * `select * from metrics_summary where quantile in (0.80, 0.90, 0.99, 0.999)` specifies viewing the data of the 0.8, 0.90, 0.99, 0.999 percentiles at the same time. * `SUM_VALUE, AVG_VALUE, MIN_VALUE, and MAX_VALUE` respectively mean the sum, the average value, the minimum value, and the maximum value. -* `COMMENT`: The comment for the corresponding comment, +* `COMMENT`: The comment for the corresponding monitoring table. For example: @@ -85,7 +85,7 @@ MAX_VALUE | 0.013 COMMENT | The quantile of kv requests durations by store ``` -Similarly, here is an example of querying the `metrics_summary_by_label` monitoring summary table. +Similarly, below is an example of querying the `metrics_summary_by_label` monitoring summary table: {{< copyable "sql" >}} @@ -134,7 +134,7 @@ COMMENT | The quantile of TiDB query durations(second) The second and third lines of the above query results indicate that the `Select` and `Rollback` statements on `tidb_query_duration` have a long average execution time. -In addition to the above example, you can use the monitoring summary table to quickly find the module with the largest change from the monitoring data by comparing the full link monitoring items of the two time periods, and quickly locate the bottleneck. The following example compares all monitoring items in two periods (where `t1` is the baseline) and sorts these items according to the greatest difference: +In addition to the above example, you can use the monitoring summary table to quickly find the module with the largest change from the monitoring data by comparing the full link monitoring items of the two time periods, and quickly locate the bottleneck. The following example compares all monitoring items in two periods (where t1 is the baseline) and sorts these items according to the greatest difference: * Period t1:`("2020-03-03 17:08:00", "2020-03-03 17:11:00")` * Period t2:`("2020-03-03 17:18:00", "2020-03-03 17:21:00")` @@ -180,10 +180,10 @@ ORDER BY ratio DESC limit 10; From the above query result: * `tib_slow_query_cop_process_total_time` (the time consumption of `cop process` in TiDB slow queries) in the period t2 is 5,865 times higher than that in period t1. -* `tidb_distsql_partial_scan_key_total_num` (the number of keys to scan requested by TiDB’s `distsql`) in period t2 is 3,648 times higher than that in period t1. During period t2, `tidb_slow_query_cop_wait_total_time` (the time of cop in the TiDB slow query requesting to queue up) is 267 times higher than that in period t1. -* `tikv_cop_total_response_size` (the size of the TiKV cop request result) in period t2 is 192 times higher than that in period t1. -* `tikv_cop_scan_details` in period t2 (the scan requested by the TiKV cop) is 105 times higher than that in period t1. +* `tidb_distsql_partial_scan_key_total_num` (the number of keys to scan requested by TiDB’s `distsql`) in period t2 is 3,648 times higher than that in period t1. During period t2, `tidb_slow_query_cop_wait_total_time` (the waiting time of Coprocessor requesting to queue up in the TiDB slow query) is 267 times higher than that in period t1. +* `tikv_cop_total_response_size` (the size of the TiKV Coprocessor request result) in period t2 is 192 times higher than that in period t1. +* `tikv_cop_scan_details` in period t2 (the scan requested by the TiKV Coprocessor) is 105 times higher than that in period t1. -From the above result, you can see that the cop request in period t2 is much higher than period t1, which causes the TiKV Coprocessor to be overloaded, and there is a `cop task` waiting. It might be that some large queries appear in period t2 that bring more load. +From the above result, you can see that the Coprocessor request in period t2 is much higher than period t1, which causes TiKV Coprocessor to be overloaded, and there is a `cop task` waiting. It might be that some large queries appear in period t2 that bring more load. -In fact, during the entire time period from t1 to t2, the `go-ycsb` pressure test is run. Then 20 `tpch` queries are run during period t2, so it is the `tpch` queries that cause many cop requests. +In fact, during the entire time period from t1 to t2, the `go-ycsb` pressure test is being run. Then 20 `tpch` queries are run during period t2, so it is the `tpch` queries that cause many Coprocessor requests. diff --git a/reference/system-databases/metrics-tables.md b/reference/system-databases/metrics-tables.md index 18d66cb9f3b80..3fd0fd29f3821 100644 --- a/reference/system-databases/metrics-tables.md +++ b/reference/system-databases/metrics-tables.md @@ -6,7 +6,7 @@ category: reference # METRICS_TABLES -The `METRICS_TABLES` table provides information of all monitoring tables in [`metrics_schema`](/reference/system-databases/metrics-schema.md). +The `METRICS_TABLES` table provides information of all monitoring tables in the [metrics schema](/reference/system-databases/metrics-schema.md). {{< copyable "sql" >}} @@ -31,5 +31,5 @@ Field description: * `TABLE_NAME`: Corresponds to the table name in `metrics_schema`. * `PROMQL`: The working principle of the monitoring table is to map SQL statements to `PromQL` and convert Prometheus results into SQL query results. This field is the expression template of `PromQL`. When getting the data of the monitoring table, the query conditions are used to rewrite the variables in this template to generate the final query expression. * `LABELS`: The label for the monitoring item. Each label corresponds to a column in the monitoring table. If the SQL statement contains filter of the corresponding column, the corresponding `PromQL` changes accordingly. -* `QUANTILE`: The percentile. For monitoring data of the histogram type, specify a default percentile. If the value of this field is `0`, it means that the monitoring item corresponding to the monitoring table is not a histogram. +* `QUANTILE`: The percentile. For monitoring data of the histogram type, a default percentile is specified. If the value of this field is `0`, it means that the monitoring item corresponding to the monitoring table is not a histogram. * `COMMENT`: The comment for the monitoring table. From d29afa754e0e2405f0b90e02ec8447acd9509d34 Mon Sep 17 00:00:00 2001 From: TomShawn <1135243111@qq.com> Date: Tue, 21 Apr 2020 16:21:00 +0800 Subject: [PATCH 3/5] address comments --- reference/system-databases/metrics-schema.md | 30 +++++++++---------- reference/system-databases/metrics-summary.md | 16 +++++----- reference/system-databases/metrics-tables.md | 2 +- 3 files changed, 24 insertions(+), 24 deletions(-) diff --git a/reference/system-databases/metrics-schema.md b/reference/system-databases/metrics-schema.md index 6d05d9154a08b..df1ec0c3f28b1 100644 --- a/reference/system-databases/metrics-schema.md +++ b/reference/system-databases/metrics-schema.md @@ -6,7 +6,7 @@ category: reference # Metrics Schema -To dynamically observe and compare cluster conditions of different time periods, the SQL diagnosis system introduces cluster monitoring system tables. All monitoring tables are in the metrics schema, and you can query the monitoring information using SQL statements in this schema. In fact, the data of the three monitoring-related summary tables ([`metrics_summary`](/reference/system-databases/metrics-summary.md), [`metrics_summary_by_label`](/reference/system-databases/metrics-summary.md), and `inspection_result`) are obtained by querying the monitoring tables in the metrics schema. Currently, many system tables are added and you can query the information of these tables through the [`information_schema.metrics_tables`](/reference/system-databases/metrics-tables.md) table. +To dynamically observe and compare cluster conditions of different time ranges, the SQL diagnosis system introduces cluster monitoring system tables. All monitoring tables are in the metrics schema, and you can query the monitoring information using SQL statements in this schema. The data of the three monitoring-related summary tables ([`metrics_summary`](/reference/system-databases/metrics-summary.md), [`metrics_summary_by_label`](/reference/system-databases/metrics-summary.md), and `inspection_result`) are all obtained by querying the monitoring tables in the metrics schema. Currently, many system tables are added, so you can query the information of these tables using the [`information_schema.metrics_tables`](/reference/system-databases/metrics-tables.md) table. ## Overview @@ -20,7 +20,7 @@ Query the information related to the `tidb_query_duration` table on `information select * from information_schema.metrics_tables where table_name='tidb_query_duration'; ``` -``` +```sql +---------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+----------+----------------------------------------------+ | TABLE_NAME | PROMQL | LABELS | QUANTILE | COMMENT | +---------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+----------+----------------------------------------------+ @@ -36,7 +36,7 @@ Field description: * `QUANTILE`: The percentile. For monitoring data of the histogram type, a default percentile is specified. If the value of this field is `0`, it means that the monitoring item corresponding to the monitoring table is not a histogram. * `COMMENT`: The comment for the monitoring table. You can see that the `tidb_query_duration` table is used to query the percentile time of the TiDB query execution, such as the query time of P999/P99/P90. The unit is second. -The structure of the `tidb_query_duration` table is queried as follows: +To query the schema of the `tidb_query_duration` table, execute the following statement: {{< copyable "sql" >}} @@ -44,7 +44,7 @@ The structure of the `tidb_query_duration` table is queried as follows: show create table metrics_schema.tidb_query_duration; ``` -``` +```sql +---------------------+--------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +---------------------+--------------------------------------------------------------------------------------------------------------------+ @@ -71,7 +71,7 @@ The following statement queries the P99 time within the range of [`2020-03-25 23 select * from metrics_schema.tidb_query_duration where value is not null and time>='2020-03-25 23:40:00' and time <= '2020-03-25 23:42:00' and quantile=0.99; ``` -``` +```sql +---------------------+-------------------+----------+----------+----------------+ | time | instance | sql_type | quantile | value | +---------------------+-------------------+----------+----------+----------------+ @@ -87,12 +87,12 @@ select * from metrics_schema.tidb_query_duration where value is not null and tim +---------------------+-------------------+----------+----------+----------------+ ``` -The first row of the above query result means that at the time of 2020-03-25 23:40:00, on the TiDB instance `172.16.5.40:10089`, the P99 execution time of the `Insert` type statement is 0.509929485256 seconds. The meanings of other rows are similar. Other values of the `sql_type` column is described as follows: +The first row of the query result above means that at the time of 2020-03-25 23:40:00, on the TiDB instance `172.16.5.40:10089`, the P99 execution time of the `Insert` type statement is 0.509929485256 seconds. The meanings of other rows are similar. Other values of the `sql_type` column are described as follows: * `Select`: The `select` type statement is executed. * `internal`: The internal SQL statement of TiDB, which is used to update the statistical information and get the global variables. -The execution plan of the above statement is as follows: +To view the execution plan of the statement above, execute the following statement: {{< copyable "sql" >}} @@ -100,7 +100,7 @@ The execution plan of the above statement is as follows: desc select * from metrics_schema.tidb_query_duration where value is not null and time>='2020-03-25 23:40:00' and time <= '2020-03-25 23:42:00' and quantile=0.99; ``` -``` +```sql +------------------+----------+------+---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | id | estRows | task | access object | operator info | +------------------+----------+------+---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ @@ -109,14 +109,14 @@ desc select * from metrics_schema.tidb_query_duration where value is not null an +------------------+----------+------+---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` -From the above result, you can see that `PromQL`, `start_time`, `end_time`, and the value of `step`. During actual execution, TiDB calls the `query_range` HTTP API interface of Prometheus to query the monitoring data. +From the result above, you can see that `PromQL`, `start_time`, `end_time`, and `step` are in the execution plan. During the execution process, TiDB calls the `query_range` HTTP API of Prometheus to query the monitoring data. -You might find that during the range of [`2020-03-25 23:40:00`, `2020-03-25 23:42:00`], each label only has three time values. In the execution plan, the value of `step` is 1 minute, which is determined by the following two variables: +You might find that in the range of [`2020-03-25 23:40:00`, `2020-03-25 23:42:00`], each label only has three time values. In the execution plan, the value of `step` is 1 minute, which is determined by the following two variables: -* `tidb_metric_query_step`: The resolution step of the query. To get the `query_range` data from Prometheus, you need to specify `start`, `end`, and `step`. `step` uses the value of this variable. -* `tidb_metric_query_range_duration`: When querying the monitoring, the `$ RANGE_DURATION` field in `PROMQL` is replaced with the value of this variable. The default value is 60 seconds. +* `tidb_metric_query_step`: The query resolution step width. To get the `query_range` data from Prometheus, you need to specify `start_time`, `end_time`, and `step`. `step` uses the value of this variable. +* `tidb_metric_query_range_duration`: When the monitoring data is queried, the value of the `$ RANGE_DURATION` field in `PROMQL` is replaced with the value of this variable. The default value is 60 seconds. -To view the values of monitoring items with different granularities, you can modify the above two session variables before querying the monitoring table. For example: +To view the values of monitoring items with different granularities, you can modify the two session variables above before querying the monitoring table. For example: 1. Modify the values of the two session variables and set the time granularity to 30 seconds. @@ -139,7 +139,7 @@ To view the values of monitoring items with different granularities, you can mod select * from metrics_schema.tidb_query_duration where value is not null and time>='2020-03-25 23:40:00' and time <= '2020-03-25 23:42:00' and quantile=0.99; ``` - ``` + ```sql +---------------------+-------------------+----------+----------+-----------------+ | time | instance | sql_type | quantile | value | +---------------------+-------------------+----------+----------+-----------------+ @@ -169,7 +169,7 @@ To view the values of monitoring items with different granularities, you can mod desc select * from metrics_schema.tidb_query_duration where value is not null and time>='2020-03-25 23:40:00' and time <= '2020-03-25 23:42:00' and quantile=0.99; ``` - ``` + ```sql +------------------+----------+------+---------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | id | estRows | task | access object | operator info | +------------------+----------+------+---------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/reference/system-databases/metrics-summary.md b/reference/system-databases/metrics-summary.md index c4e332a2a440e..fa494216b7c6d 100644 --- a/reference/system-databases/metrics-summary.md +++ b/reference/system-databases/metrics-summary.md @@ -19,7 +19,7 @@ The two tables summarize all monitoring data to for you to check each monitoring desc metrics_summary; ``` -``` +```sql +--------------+-----------------------+------+------+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------------+-----------------------+------+------+---------+-------+ @@ -58,7 +58,7 @@ order by avg_value desc limit 3\G ``` -``` +```sql ***************************[ 1. row ]*************************** METRICS_NAME | tidb_get_token_duration QUANTILE | 0.99 @@ -99,7 +99,7 @@ order by avg_value desc limit 10\G ``` -``` +```sql ***************************[ 1. row ]*************************** INSTANCE | 172.16.5.40:10089 METRICS_NAME | tidb_get_token_duration @@ -132,9 +132,9 @@ MAX_VALUE | 0.008241 COMMENT | The quantile of TiDB query durations(second) ``` -The second and third lines of the above query results indicate that the `Select` and `Rollback` statements on `tidb_query_duration` have a long average execution time. +The second and third lines of the query results above indicate that the `Select` and `Rollback` statements on `tidb_query_duration` have a long average execution time. -In addition to the above example, you can use the monitoring summary table to quickly find the module with the largest change from the monitoring data by comparing the full link monitoring items of the two time periods, and quickly locate the bottleneck. The following example compares all monitoring items in two periods (where t1 is the baseline) and sorts these items according to the greatest difference: +In addition to the example above, you can use the monitoring summary table to quickly find the module with the largest change from the monitoring data by comparing the full link monitoring items of the two time periods, and quickly locate the bottleneck. The following example compares all monitoring items in two periods (where t1 is the baseline) and sorts these items according to the greatest difference: * Period t1:`("2020-03-03 17:08:00", "2020-03-03 17:11:00")` * Period t2:`("2020-03-03 17:18:00", "2020-03-03 17:21:00")` @@ -160,7 +160,7 @@ JOIN ORDER BY ratio DESC limit 10; ``` -``` +```sql +----------------+------------------------------------------+----------------+------------------+---------------------------------------------------------------------------------------------+ | ratio | metrics_name | t1_avg_value | t2_avg_value | comment | +----------------+------------------------------------------+----------------+------------------+---------------------------------------------------------------------------------------------+ @@ -177,13 +177,13 @@ ORDER BY ratio DESC limit 10; +----------------+------------------------------------------+----------------+------------------+---------------------------------------------------------------------------------------------+ ``` -From the above query result: +From the query above result: * `tib_slow_query_cop_process_total_time` (the time consumption of `cop process` in TiDB slow queries) in the period t2 is 5,865 times higher than that in period t1. * `tidb_distsql_partial_scan_key_total_num` (the number of keys to scan requested by TiDB’s `distsql`) in period t2 is 3,648 times higher than that in period t1. During period t2, `tidb_slow_query_cop_wait_total_time` (the waiting time of Coprocessor requesting to queue up in the TiDB slow query) is 267 times higher than that in period t1. * `tikv_cop_total_response_size` (the size of the TiKV Coprocessor request result) in period t2 is 192 times higher than that in period t1. * `tikv_cop_scan_details` in period t2 (the scan requested by the TiKV Coprocessor) is 105 times higher than that in period t1. -From the above result, you can see that the Coprocessor request in period t2 is much higher than period t1, which causes TiKV Coprocessor to be overloaded, and there is a `cop task` waiting. It might be that some large queries appear in period t2 that bring more load. +From the result above, you can see that the Coprocessor request in period t2 is much higher than period t1, which causes TiKV Coprocessor to be overloaded, and there is a `cop task` waiting. It might be that some large queries appear in period t2 that bring more load. In fact, during the entire time period from t1 to t2, the `go-ycsb` pressure test is being run. Then 20 `tpch` queries are run during period t2, so it is the `tpch` queries that cause many Coprocessor requests. diff --git a/reference/system-databases/metrics-tables.md b/reference/system-databases/metrics-tables.md index 3fd0fd29f3821..89c1b8c0cd4d5 100644 --- a/reference/system-databases/metrics-tables.md +++ b/reference/system-databases/metrics-tables.md @@ -14,7 +14,7 @@ The `METRICS_TABLES` table provides information of all monitoring tables in the desc metrics_tables; ``` -``` +```sql +------------+-----------------+------+------+---------+-------+ | Field | Type | Null | Key | Default | Extra | +------------+-----------------+------+------+---------+-------+ From 2016b4478fc8012a6fd6a1573b83062a0336b083 Mon Sep 17 00:00:00 2001 From: TomShawn <1135243111@qq.com> Date: Tue, 21 Apr 2020 16:22:51 +0800 Subject: [PATCH 4/5] address a comment --- reference/system-databases/metrics-schema.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/reference/system-databases/metrics-schema.md b/reference/system-databases/metrics-schema.md index df1ec0c3f28b1..9efbf7b6b5451 100644 --- a/reference/system-databases/metrics-schema.md +++ b/reference/system-databases/metrics-schema.md @@ -34,7 +34,7 @@ Field description: * `PROMQL`: The working principle of the monitoring table is to map SQL statements to `PromQL` and convert Prometheus results into SQL query results. This field is the expression template of `PromQL`. When getting the data of the monitoring table, the query conditions are used to rewrite the variables in this template to generate the final query expression. * `LABELS`: The label for the monitoring item. `tidb_query_duration` has two labels: `instance` and `sql_type`. * `QUANTILE`: The percentile. For monitoring data of the histogram type, a default percentile is specified. If the value of this field is `0`, it means that the monitoring item corresponding to the monitoring table is not a histogram. -* `COMMENT`: The comment for the monitoring table. You can see that the `tidb_query_duration` table is used to query the percentile time of the TiDB query execution, such as the query time of P999/P99/P90. The unit is second. +* `COMMENT`: Explanations for the monitoring table. You can see that the `tidb_query_duration` table is used to query the percentile time of the TiDB query execution, such as the query time of P999/P99/P90. The unit is second. To query the schema of the `tidb_query_duration` table, execute the following statement: From 20e2c893cbdfa395e3231539eddcd2b4287c45f2 Mon Sep 17 00:00:00 2001 From: TomShawn <1135243111@qq.com> Date: Thu, 23 Apr 2020 15:44:17 +0800 Subject: [PATCH 5/5] address comments --- reference/system-databases/metrics-summary.md | 20 +++++++++---------- reference/system-databases/metrics-tables.md | 4 ++-- 2 files changed, 12 insertions(+), 12 deletions(-) diff --git a/reference/system-databases/metrics-summary.md b/reference/system-databases/metrics-summary.md index fa494216b7c6d..c18362e063fee 100644 --- a/reference/system-databases/metrics-summary.md +++ b/reference/system-databases/metrics-summary.md @@ -6,12 +6,12 @@ category: reference # METRICS_SUMMARY -Because the TiDB cluster has many monitoring metrics, the SQL diagnosis system also provides the following two monitoring summary tables for you to easily find abnormal monitoring items: +The TiDB cluster has many monitoring metrics. To make it easy to detect abnormal monitoring metrics, TiDB 4.0 introduces the following two monitoring summary tables: * `information_schema.metrics_summary` * `information_schema.metrics_summary_by_label` -The two tables summarize all monitoring data to for you to check each monitoring metric with higher efficiency. Compare to `information_schema.metrics_summary`, the `information_schema.metrics_summary_by_label` table has an additional `label` column and performs differentiated statistics according to different labels. +The two tables summarize all monitoring data for you to check each monitoring metric efficiently. Compared with `information_schema.metrics_summary`, the `information_schema.metrics_summary_by_label` table has an additional `label` column and performs differentiated statistics according to different labels. {{< copyable "sql" >}} @@ -39,12 +39,12 @@ Field description: * `QUANTILE`: The percentile. You can specify `QUANTILE` using SQL statements. For example: * `select * from metrics_summary where quantile=0.99` specifies viewing the data of the 0.99 percentile. * `select * from metrics_summary where quantile in (0.80, 0.90, 0.99, 0.999)` specifies viewing the data of the 0.8, 0.90, 0.99, 0.999 percentiles at the same time. -* `SUM_VALUE, AVG_VALUE, MIN_VALUE, and MAX_VALUE` respectively mean the sum, the average value, the minimum value, and the maximum value. +* `SUM_VALUE`, `AVG_VALUE`, `MIN_VALUE`, and `MAX_VALUE` respectively mean the sum, the average value, the minimum value, and the maximum value. * `COMMENT`: The comment for the corresponding monitoring table. For example: -To query the three groups of monitoring items with the highest average time consumption in the TiDB cluster in the time range of `'2020-03-08 13:23:00', '2020-03-08 13: 33: 00'`, you can directly query the `information_schema.metrics_summary` table and use the `/*+ time_range() */` hint to specify the time range. The SQL statement is built as follows: +To query the three groups of monitoring items with the highest average time consumption in the TiDB cluster within the time range of `'2020-03-08 13:23:00', '2020-03-08 13: 33: 00'`, you can directly query the `information_schema.metrics_summary` table and use the `/*+ time_range() */` hint to specify the time range. The SQL statement is as follows: {{< copyable "sql" >}} @@ -85,7 +85,7 @@ MAX_VALUE | 0.013 COMMENT | The quantile of kv requests durations by store ``` -Similarly, below is an example of querying the `metrics_summary_by_label` monitoring summary table: +Similarly, the following example queries the `metrics_summary_by_label` monitoring summary table: {{< copyable "sql" >}} @@ -132,14 +132,14 @@ MAX_VALUE | 0.008241 COMMENT | The quantile of TiDB query durations(second) ``` -The second and third lines of the query results above indicate that the `Select` and `Rollback` statements on `tidb_query_duration` have a long average execution time. +The second and third rows of the query results above indicate that the `Select` and `Rollback` statements on `tidb_query_duration` have a long average execution time. In addition to the example above, you can use the monitoring summary table to quickly find the module with the largest change from the monitoring data by comparing the full link monitoring items of the two time periods, and quickly locate the bottleneck. The following example compares all monitoring items in two periods (where t1 is the baseline) and sorts these items according to the greatest difference: * Period t1:`("2020-03-03 17:08:00", "2020-03-03 17:11:00")` * Period t2:`("2020-03-03 17:18:00", "2020-03-03 17:21:00")` -The monitoring items of the two time periods are joined according to `METRICS_NAME` and sorted according to the difference value. `TIME_RANGE` is the hint that specifies the quey time. +The monitoring items of the two time periods are joined according to `METRICS_NAME` and sorted according to the difference value. `TIME_RANGE` is the hint that specifies the query time. {{< copyable "sql" >}} @@ -177,13 +177,13 @@ ORDER BY ratio DESC limit 10; +----------------+------------------------------------------+----------------+------------------+---------------------------------------------------------------------------------------------+ ``` -From the query above result: +From the query result above, you can get the following information: * `tib_slow_query_cop_process_total_time` (the time consumption of `cop process` in TiDB slow queries) in the period t2 is 5,865 times higher than that in period t1. * `tidb_distsql_partial_scan_key_total_num` (the number of keys to scan requested by TiDB’s `distsql`) in period t2 is 3,648 times higher than that in period t1. During period t2, `tidb_slow_query_cop_wait_total_time` (the waiting time of Coprocessor requesting to queue up in the TiDB slow query) is 267 times higher than that in period t1. * `tikv_cop_total_response_size` (the size of the TiKV Coprocessor request result) in period t2 is 192 times higher than that in period t1. * `tikv_cop_scan_details` in period t2 (the scan requested by the TiKV Coprocessor) is 105 times higher than that in period t1. -From the result above, you can see that the Coprocessor request in period t2 is much higher than period t1, which causes TiKV Coprocessor to be overloaded, and there is a `cop task` waiting. It might be that some large queries appear in period t2 that bring more load. +From the result above, you can see that the Coprocessor requests in period t2 are much more than those in period t1. This causes TiKV Coprocessor to be overloaded, and the `cop task` has to wait. It might be that some large queries appear in period t2 that bring more load. -In fact, during the entire time period from t1 to t2, the `go-ycsb` pressure test is being run. Then 20 `tpch` queries are run during period t2, so it is the `tpch` queries that cause many Coprocessor requests. +In fact, during the entire time period from t1 to t2, the `go-ycsb` pressure test is running. Then 20 `tpch` queries are running during period t2. So it is the `tpch` queries that cause many Coprocessor requests. diff --git a/reference/system-databases/metrics-tables.md b/reference/system-databases/metrics-tables.md index 89c1b8c0cd4d5..622039ac4e068 100644 --- a/reference/system-databases/metrics-tables.md +++ b/reference/system-databases/metrics-tables.md @@ -30,6 +30,6 @@ Field description: * `TABLE_NAME`: Corresponds to the table name in `metrics_schema`. * `PROMQL`: The working principle of the monitoring table is to map SQL statements to `PromQL` and convert Prometheus results into SQL query results. This field is the expression template of `PromQL`. When getting the data of the monitoring table, the query conditions are used to rewrite the variables in this template to generate the final query expression. -* `LABELS`: The label for the monitoring item. Each label corresponds to a column in the monitoring table. If the SQL statement contains filter of the corresponding column, the corresponding `PromQL` changes accordingly. +* `LABELS`: The label for the monitoring item. Each label corresponds to a column in the monitoring table. If the SQL statement contains the filter of the corresponding column, the corresponding `PromQL` changes accordingly. * `QUANTILE`: The percentile. For monitoring data of the histogram type, a default percentile is specified. If the value of this field is `0`, it means that the monitoring item corresponding to the monitoring table is not a histogram. -* `COMMENT`: The comment for the monitoring table. +* `COMMENT`: The comment about the monitoring table.