reference: add 3 metrics system tables #2251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

sre-bot merged 9 commits into pingcap:master from TomShawn:metrics-tables

Apr 24, 2020

Contributor

TomShawn commented Apr 12, 2020 •

edited

Loading

What is changed, added or deleted? (Required)

Add metrics_schema, metrics_tables, and metrics_summary system tables.

Which TiDB version(s) do your changes apply to? (Required)

If you select two or more versions from above, to trigger the bot to cherry-pick this PR to your desired release version branch(es), you must add corresponding labels such as needs-cherry-pick-4.0, needs-cherry-pick-3.1, needs-cherry-pick-3.0, and needs-cherry-pick-2.1.

What is the related PR or file link(s)?

This PR is translated from: reference/system-databases: add 12 new documents for sql diagnosis docs-cn#2522 reference: fix typos and refine sql diagnosis related documents docs-cn#2737
Other reference link(s):


          reference: add 3 metrics system tables

87a1661

TomShawn added translation/from-docs-cn v4.0 size/large needs-cherry-pick-4.0 labels

TomShawn requested a review from reafans

April 12, 2020 14:23

yikeke added the status/PTAL label

yikeke requested a review from lilin90

April 14, 2020 06:25


          refine language

a885aae

Contributor

sre-bot commented Apr 16, 2020

@lilin90, @reafans, PTAL.

lilin90 reviewed

View reviewed changes

reference/system-databases/metrics-schema.md Outdated


		# Metrics Schema

		To dynamically observe and compare cluster conditions of different time periods, the SQL diagnosis system introduces cluster monitoring system tables. All monitoring tables are in the metrics schema, and you can query the monitoring information using SQL statements in this schema. In fact, the data of the three monitoring-related summary tables ([`metrics_summary`](/reference/system-databases/metrics-summary.md), [`metrics_summary_by_label`](/reference/system-databases/metrics-summary.md), and `inspection_result`) are obtained by querying the monitoring tables in the metrics schema. Currently, many system tables are added and you can query the information of these tables through the [`information_schema.metrics_tables`](/reference/system-databases/metrics-tables.md) table.

Member

lilin90 Apr 16, 2020

Suggested change

      
            To dynamically observe and compare cluster conditions of different time periods, the SQL diagnosis system introduces cluster monitoring system tables. All monitoring tables are in the metrics schema, and you can query the monitoring information using SQL statements in this schema. In fact, the data of the three monitoring-related summary tables ([`metrics_summary`](/reference/system-databases/metrics-summary.md), [`metrics_summary_by_label`](/reference/system-databases/metrics-summary.md), and `inspection_result`) are obtained by querying the monitoring tables in the metrics schema. Currently, many system tables are added and you can query the information of these tables through the [`information_schema.metrics_tables`](/reference/system-databases/metrics-tables.md) table.
          
            To dynamically observe and compare cluster conditions of different time ranges, the SQL diagnosis system introduces cluster monitoring system tables. All monitoring tables are in the metrics schema, and you can query the monitoring information using SQL statements in this schema. The data of the three monitoring-related summary tables ([`metrics_summary`](/reference/system-databases/metrics-summary.md), [`metrics_summary_by_label`](/reference/system-databases/metrics-summary.md), and `inspection_result`) are all obtained by querying the monitoring tables in the metrics schema. Currently, many system tables are added, so you can query the information of these tables using the [`information_schema.metrics_tables`](/reference/system-databases/metrics-tables.md) table.

reference/system-databases/metrics-schema.md Show resolved Hide resolved

reference/system-databases/metrics-schema.md Outdated

+              * `PROMQL`: The working principle of the monitoring table is to map SQL statements to `PromQL` and convert Prometheus results into SQL query results. This field is the expression template of `PromQL`. When getting the data of the monitoring table, the query conditions are used to rewrite the variables in this template to generate the final query expression.
+              * `LABELS`: The label for the monitoring item. `tidb_query_duration` has two labels: `instance` and `sql_type`.
+              * `QUANTILE`: The percentile. For monitoring data of the histogram type, a default percentile is specified. If the value of this field is `0`, it means that the monitoring item corresponding to the monitoring table is not a histogram.
+              * `COMMENT`: The comment for the monitoring table. You can see that the `tidb_query_duration` table is used to query the percentile time of the TiDB query execution, such as the query time of P999/P99/P90. The unit is second.

Member

lilin90 Apr 16, 2020

I see that the Chinese version itself is confusing: 可以看出 tidb_query_duration 表的是用来查询 TiDB query 执行的百分位时间，如 P999/P99/P90 的查询耗时，单位是秒。 @reafans Would you please update 表的是 to a clear way？ @TomShawn Please confirm it and update if necessary.

Suggested change

      
            * `COMMENT`: The comment for the monitoring table. You can see that the `tidb_query_duration` table is used to query the percentile time of the TiDB query execution, such as the query time of P999/P99/P90. The unit is second.
          
            * `COMMENT`: Explanations for the monitoring table. You can see that the `tidb_query_duration` table is used to query the percentile time of the TiDB query execution, such as the query time of P999/P99/P90. The unit is second.

Contributor

reafans Apr 21, 2020

It should be 可以看出 tidb_query_duration 表是用来查询 TiDB query 执行的百分位时间的, I‘ll update it in the Chinese version

reference/system-databases/metrics-schema.md Outdated

+              * `QUANTILE`: The percentile. For monitoring data of the histogram type, a default percentile is specified. If the value of this field is `0`, it means that the monitoring item corresponding to the monitoring table is not a histogram.
+              * `COMMENT`: The comment for the monitoring table. You can see that the `tidb_query_duration` table is used to query the percentile time of the TiDB query execution, such as the query time of P999/P99/P90. The unit is second.
+              The structure of the `tidb_query_duration` table is queried as follows:

Member

lilin90 Apr 16, 2020

Suggested change

      
            The structure of the `tidb_query_duration` table is queried as follows:
          
            To query the schema of the `tidb_query_duration` table, execute the following statement:

reference/system-databases/metrics-schema.md Outdated

+              +---------------------+-------------------+----------+----------+----------------+
+              ```
+              The first row of the above query result means that at the time of 2020-03-25 23:40:00, on the TiDB instance `172.16.5.40:10089`, the P99 execution time of the `Insert` type statement is 0.509929485256 seconds. The meanings of other rows are similar. Other values of the `sql_type` column is described as follows:

Member

lilin90 Apr 16, 2020

Suggested change

      
            The first row of the above query result means that at the time of 2020-03-25 23:40:00, on the TiDB instance `172.16.5.40:10089`, the P99 execution time of the `Insert` type statement is 0.509929485256 seconds. The meanings of other rows are similar. Other values of the `sql_type` column is described as follows:
          
            The first row of the above query result means that at the time of 2020-03-25 23:40:00, on the TiDB instance `172.16.5.40:10089`, the P99 execution time of the `Insert` type statement is 0.509929485256 seconds. The meanings of other rows are similar. Other values of the `sql_type` column are described as follows:

reference/system-databases/metrics-schema.md Outdated

+              +------------------+----------+------+---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+              ```
+              From the above result, you can see that `PromQL`, `start_time`, `end_time`, and the value of `step`. During actual execution, TiDB calls the `query_range` HTTP API interface of Prometheus to query the monitoring data.

Member

lilin90 Apr 16, 2020

Suggested change

      
            From the above result, you can see that `PromQL`, `start_time`, `end_time`, and the value of `step`. During actual execution, TiDB calls the `query_range` HTTP API interface of Prometheus to query the monitoring data.
          
            From the above result, you can see that `PromQL`, `start_time`, `end_time`, and `step` are in the execution plan. During the execution process, TiDB calls the `query_range` HTTP API of Prometheus to query the monitoring data.

reference/system-databases/metrics-schema.md Outdated


		From the above result, you can see that `PromQL`, `start_time`, `end_time`, and the value of `step`. During actual execution, TiDB calls the `query_range` HTTP API interface of Prometheus to query the monitoring data.

		You might find that during the range of [`2020-03-25 23:40:00`, `2020-03-25 23:42:00`], each label only has three time values. In the execution plan, the value of `step` is 1 minute, which is determined by the following two variables:

Member

lilin90 Apr 16, 2020

Suggested change

      
            You might find that during the range of [`2020-03-25 23:40:00`, `2020-03-25 23:42:00`], each label only has three time values. In the execution plan, the value of `step` is 1 minute, which is determined by the following two variables:
          
            You might find that in the range of [`2020-03-25 23:40:00`, `2020-03-25 23:42:00`], each label only has three time values. In the execution plan, the value of `step` is 1 minute, which is determined by the following two variables:

reference/system-databases/metrics-schema.md Outdated


		You might find that during the range of [`2020-03-25 23:40:00`, `2020-03-25 23:42:00`], each label only has three time values. In the execution plan, the value of `step` is 1 minute, which is determined by the following two variables:

		* `tidb_metric_query_step`: The resolution step of the query. To get the `query_range` data from Prometheus, you need to specify `start`, `end`, and `step`. `step` uses the value of this variable.

Member

lilin90 Apr 16, 2020

How about using the similar description in Prometheus's doc? @reafans Please also help confirm. Thanks!
Confirm whether its's start/end or start_time/end_time here.

Suggested change

      
            * `tidb_metric_query_step`: The resolution step of the query. To get the `query_range` data from Prometheus, you need to specify `start`, `end`, and `step`. `step` uses the value of this variable.
          
            * `tidb_metric_query_step`: The query resolution step width. To get the `query_range` data from Prometheus, you need to specify `start`, `end`, and `step`. `step` uses the value of this variable.

Contributor

reafans Apr 21, 2020

Start_time/end_time is more accurate, I'll fix it in Chinese version.

reference/system-databases/metrics-schema.md Outdated

+              You might find that during the range of [`2020-03-25 23:40:00`, `2020-03-25 23:42:00`], each label only has three time values. In the execution plan, the value of `step` is 1 minute, which is determined by the following two variables:
+              * `tidb_metric_query_step`: The resolution step of the query. To get the `query_range` data from Prometheus, you need to specify `start`, `end`, and `step`. `step` uses the value of this variable.
+              * `tidb_metric_query_range_duration`: When querying the monitoring, the `$ RANGE_DURATION` field in `PROMQL` is replaced with the value of this variable. The default value is 60 seconds.

Member

lilin90 Apr 16, 2020

It seems that the subjects are not consistent.

Suggested change

      
            * `tidb_metric_query_range_duration`: When querying the monitoring, the `$ RANGE_DURATION` field in `PROMQL` is replaced with the value of this variable. The default value is 60 seconds.
          
            * `tidb_metric_query_range_duration`: When the monitoring data is queried, the value of the `$ RANGE_DURATION` field in `PROMQL` is replaced with the value of this variable. The default value is 60 seconds.

reference/system-databases/metrics-schema.md Outdated

+              * `tidb_metric_query_step`: The resolution step of the query. To get the `query_range` data from Prometheus, you need to specify `start`, `end`, and `step`. `step` uses the value of this variable.
+              * `tidb_metric_query_range_duration`: When querying the monitoring, the `$ RANGE_DURATION` field in `PROMQL` is replaced with the value of this variable. The default value is 60 seconds.
+              To view the values of monitoring items with different granularities, you can modify the above two session variables before querying the monitoring table. For example:

Member

lilin90 Apr 16, 2020

Suggest using above this way, which is more commonly used. Please also update other places in this PR.

Suggested change

      
            To view the values of monitoring items with different granularities, you can modify the above two session variables before querying the monitoring table. For example:
          
            To view the values of monitoring items with different granularities, you can modify the two session variables above before querying the monitoring table. For example:

Contributor Author

TomShawn Apr 21, 2020

Sure.

lilin90 added the status/require-change label

Contributor

sre-bot commented Apr 18, 2020

@lilin90, @reafans, PTAL.

1 similar comment

Contributor

sre-bot commented Apr 20, 2020

@lilin90, @reafans, PTAL.

TomShawn added 3 commits

April 21, 2020 16:21


          address comments

d29afa7


          address a comment

2016b44


          Merge remote-tracking branch 'upstream/master' into metrics-tables

8ecdab2

Contributor Author

TomShawn commented Apr 21, 2020

@lilin90 All comments are addressed, PTAL again, thanks!

TomShawn mentioned this pull request

reference: refine sql diagnosis pingcap/docs-cn#2756

Merged

5 tasks

Contributor

reafans commented Apr 22, 2020

LGTM


          Merge branch 'master' into metrics-tables

dacd0bc

lilin90 reviewed

View reviewed changes

reference/system-databases/metrics-summary.md Outdated


		# METRICS_SUMMARY

		Because the TiDB cluster has many monitoring metrics, the SQL diagnosis system also provides the following two monitoring summary tables for you to easily find abnormal monitoring items:

Member

lilin90 Apr 22, 2020

Suggested change

      
            Because the TiDB cluster has many monitoring metrics, the SQL diagnosis system also provides the following two monitoring summary tables for you to easily find abnormal monitoring items:
          
            The TiDB cluster has many monitoring metrics. To make it easy to detect abnormal monitoring metrics, TiDB 4.0 introduces the following two monitoring summary tables:

reference/system-databases/metrics-summary.md Outdated

+              * `information_schema.metrics_summary`
+              * `information_schema.metrics_summary_by_label`
+              The two tables summarize all monitoring data to for you to check each monitoring metric with higher efficiency. Compare to `information_schema.metrics_summary`, the `information_schema.metrics_summary_by_label` table has an additional `label` column and performs differentiated statistics according to different labels.

Member

lilin90 Apr 22, 2020

Typo and grammar mistake?

Suggested change

      
            The two tables summarize all monitoring data to for you to check each monitoring metric with higher efficiency. Compare to `information_schema.metrics_summary`, the `information_schema.metrics_summary_by_label` table has an additional `label` column and performs differentiated statistics according to different labels.
          
            The two tables summarize all monitoring data for you to check each monitoring metric efficiently. Compared with `information_schema.metrics_summary`, the `information_schema.metrics_summary_by_label` table has an additional `label` column and performs differentiated statistics according to different labels.

reference/system-databases/metrics-summary.md Outdated

+              * `QUANTILE`: The percentile. You can specify `QUANTILE` using SQL statements. For example:
+                  * `select * from metrics_summary where quantile=0.99` specifies viewing the data of the 0.99 percentile.
+                  * `select * from metrics_summary where quantile in (0.80, 0.90, 0.99, 0.999)` specifies viewing the data of the 0.8, 0.90, 0.99, 0.999 percentiles at the same time.
+              * `SUM_VALUE, AVG_VALUE, MIN_VALUE, and MAX_VALUE` respectively mean the sum, the average value, the minimum value, and the maximum value.

Member

lilin90 Apr 22, 2020

Suggested change

      
            * `SUM_VALUE, AVG_VALUE, MIN_VALUE, and MAX_VALUE` respectively mean the sum, the average value, the minimum value, and the maximum value.
          
            * `SUM_VALUE`, `AVG_VALUE`, `MIN_VALUE`, and `MAX_VALUE` respectively mean the sum, the average value, the minimum value, and the maximum value.

reference/system-databases/metrics-summary.md Outdated


		For example:

		To query the three groups of monitoring items with the highest average time consumption in the TiDB cluster in the time range of `'2020-03-08 13:23:00', '2020-03-08 13: 33: 00'`, you can directly query the `information_schema.metrics_summary` table and use the `/+ time_range() /` hint to specify the time range. The SQL statement is built as follows:

Member

lilin90 Apr 22, 2020

Suggested change

      
            To query the three groups of monitoring items with the highest average time consumption in the TiDB cluster in the time range of `'2020-03-08 13:23:00', '2020-03-08 13: 33: 00'`, you can directly query the `information_schema.metrics_summary` table and use the `/*+ time_range() */` hint to specify the time range. The SQL statement is built as follows:
          
            To query the three groups of monitoring items with the highest average time consumption in the TiDB cluster within the time range of `'2020-03-08 13:23:00', '2020-03-08 13: 33: 00'`, you can directly query the `information_schema.metrics_summary` table and use the `/*+ time_range() */` hint to specify the time range. The SQL statement is as follows:

reference/system-databases/metrics-summary.md Outdated

+              COMMENT      | The quantile of kv requests durations by store
+              ```
+              Similarly, below is an example of querying the `metrics_summary_by_label` monitoring summary table:

Member

lilin90 Apr 22, 2020

Suggested change

      
            Similarly, below is an example of querying the `metrics_summary_by_label` monitoring summary table:
          
            Similarly, the following example queries the `metrics_summary_by_label` monitoring summary table:

reference/system-databases/metrics-summary.md Outdated

+              +----------------+------------------------------------------+----------------+------------------+---------------------------------------------------------------------------------------------+
+              ```
+              From the query above result:

Member

lilin90 Apr 22, 2020

Suggested change

      
            From the query above result:
          
            From the query result above, you can get the following information:

reference/system-databases/metrics-summary.md Outdated

+              * `tikv_cop_total_response_size` (the size of the TiKV Coprocessor request result) in period t2 is 192 times higher than that in period t1.
+              * `tikv_cop_scan_details` in period t2 (the scan requested by the TiKV Coprocessor) is 105 times higher than that in period t1.
+              From the result above, you can see that the Coprocessor request in period t2 is much higher than period t1, which causes TiKV Coprocessor to be overloaded, and there is a `cop task` waiting. It might be that some large queries appear in period t2 that bring more load.

Member

lilin90 Apr 22, 2020

Suggested change

      
            From the result above, you can see that the Coprocessor request in period t2 is much higher than period t1, which causes TiKV Coprocessor to be overloaded, and there is a `cop task` waiting. It might be that some large queries appear in period t2 that bring more load.
          
            From the result above, you can see that the Coprocessor requests in period t2 are much more than those in period t1. This causes TiKV Coprocessor to be overloaded, and the `cop task` has to wait. It might be that some large queries appear in period t2 that bring more load.

reference/system-databases/metrics-summary.md Outdated


		From the result above, you can see that the Coprocessor request in period t2 is much higher than period t1, which causes TiKV Coprocessor to be overloaded, and there is a `cop task` waiting. It might be that some large queries appear in period t2 that bring more load.

		In fact, during the entire time period from t1 to t2, the `go-ycsb` pressure test is being run. Then 20 `tpch` queries are run during period t2, so it is the `tpch` queries that cause many Coprocessor requests.

Member

lilin90 Apr 22, 2020

Suggested change

      
            In fact, during the entire time period from t1 to t2, the `go-ycsb` pressure test is being run. Then 20 `tpch` queries are run during period t2, so it is the `tpch` queries that cause many Coprocessor requests.
          
            In fact, during the entire time period from t1 to t2, the `go-ycsb` pressure test is running. Then 20 `tpch` queries are running during period t2. So it is the `tpch` queries that cause many Coprocessor requests.

reference/system-databases/metrics-tables.md Outdated

+              * `TABLE_NAME`: Corresponds to the table name in `metrics_schema`.
+              * `PROMQL`: The working principle of the monitoring table is to map SQL statements to `PromQL` and convert Prometheus results into SQL query results. This field is the expression template of `PromQL`. When getting the data of the monitoring table, the query conditions are used to rewrite the variables in this template to generate the final query expression.
+              * `LABELS`: The label for the monitoring item. Each label corresponds to a column in the monitoring table. If the SQL statement contains filter of the corresponding column, the corresponding `PromQL` changes accordingly.

Member

lilin90 Apr 22, 2020

Suggested change

      
            * `LABELS`: The label for the monitoring item. Each label corresponds to a column in the monitoring table. If the SQL statement contains filter of the corresponding column, the corresponding `PromQL` changes accordingly.
          
            * `LABELS`: The label for the monitoring item. Each label corresponds to a column in the monitoring table. If the SQL statement contains the filter of the corresponding column, the corresponding `PromQL` changes accordingly.

reference/system-databases/metrics-tables.md Outdated

+              * `PROMQL`: The working principle of the monitoring table is to map SQL statements to `PromQL` and convert Prometheus results into SQL query results. This field is the expression template of `PromQL`. When getting the data of the monitoring table, the query conditions are used to rewrite the variables in this template to generate the final query expression.
+              * `LABELS`: The label for the monitoring item. Each label corresponds to a column in the monitoring table. If the SQL statement contains filter of the corresponding column, the corresponding `PromQL` changes accordingly.
+              * `QUANTILE`: The percentile. For monitoring data of the histogram type, a default percentile is specified. If the value of this field is `0`, it means that the monitoring item corresponding to the monitoring table is not a histogram.
+              * `COMMENT`: The comment for the monitoring table.

Member

lilin90 Apr 22, 2020

Generally, use about or on.

Suggested change

      
            * `COMMENT`: The comment for the monitoring table.
          
            * `COMMENT`: The comment about the monitoring table.

lilin90 and others added 2 commits

April 23, 2020 14:16


          Merge branch 'master' into metrics-tables

5dea6bb


          address comments

20e2c89

Contributor Author

TomShawn commented Apr 23, 2020

@lilin90 Comment addressed, PTAL again, thanks!

lilin90 approved these changes

View reviewed changes

Member

lilin90 left a comment

LGTM


          Merge branch 'master' into metrics-tables

fd8f457

lilin90 added the status/can-merge label

Contributor

sre-bot commented Apr 24, 2020

/run-all-tests

lilin90 removed the status/PTAL label

lilin90 removed the status/require-change label

sre-bot merged commit 9f650e8 into pingcap:master

sre-bot pushed a commit to sre-bot/docs that referenced this pull request


          reference: add 3 metrics system tables (pingcap#2251)

2835b94

sre-bot mentioned this pull request

reference: add 3 metrics system tables (#2251) #2394

Merged

5 tasks

Contributor

sre-bot commented Apr 24, 2020

cherry pick to release-4.0 in PR #2394

TomShawn deleted the metrics-tables branch

April 24, 2020 08:12

TomShawn added a commit that referenced this pull request


          reference: add 3 metrics system tables (#2251) (#2394)

5a7d32e

Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/large status/can-merge translation/from-docs-cn v4.0