proposal: tidb built-in sql diagnostics #13481

lonng · 2019-11-15T02:36:14Z

Signed-off-by: Lonng heng@lonng.org

Summary

Currently, TiDB diagnostic information acquisition relies mainly on external tools (perf/iosnoop/iotop/iostat/vmstat/sar/...), monitoring systems (Prometheus/Grafana), log files, HTTP APIs, and system tables provided by TiDB. The decentralized toolchains and cumbersome acquisition methods lead to high barriers to the use of TiDB clusters, difficulty in operation and maintenance, failure to detect problems in advance, and failure to timely investigate, diagnose, and recover clusters.
This proposal proposes a new method of acquiring diagnostic information in TiDB and exposing diagnostic information by the system tables so that users can query using SQL.

Motivation

This proposal mainly solves the following problems in TiDB's process of obtaining diagnostic information:

The toolchains are scattered, it needs to switch back and forth between different tools, and some Linux distributions do not have built-in corresponding tools or built-in tools don't have versions as expected.
The information acquisition methods are inconsistent, such as SQL, HTTP, export monitoring, login to each node to view logs, and so on.
There are many TiDB cluster components, and the correlation monitoring information between different components is inefficient and cumbersome.
TiDB does not have centralized log management components, and there is no efficient ways to filter, retrieve, analyze, and aggregate logs of the entire cluster.
The system table only contains the current node information, and does not reflect the state of the entire cluster, such as: SLOW_QUERY, PROCESSLIST, STATEMENTS_SUMMARY.

The efficiency of the cluster-based information query, state acquisition, log retrieval, one-click inspection, and fault diagnosis will be improved after the multi-dimensional cluster-level system table and the cluster's diagnostic rule framework is provided. And provide basic data for the subsequent abnormal early warning function.

Signed-off-by: Lonng <heng@lonng.org>

codecov · 2019-11-15T06:31:51Z

Codecov Report

Merging #13481 into master will not change coverage.
The diff coverage is n/a.

@@             Coverage Diff             @@
##             master     #13481   +/-   ##
===========================================
  Coverage   80.1103%   80.1103%           
===========================================
  Files           474        474           
  Lines        117257     117257           
===========================================
  Hits          93935      93935           
  Misses        15948      15948           
  Partials       7374       7374

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

docs/design/2019-11-14-tidb-builtin-diagnostics.md

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

Signed-off-by: Lonng <heng@lonng.org>

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

Signed-off-by: Lonng <heng@lonng.org>

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

siddontang · 2019-11-18T09:30:13Z

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

+|   33 | pd   | pd-0   | 127.0.0.1:2379  | log.format                  | text          |
+|   34 | pd   | pd-0   | 127.0.0.1:2379  | log.level                   |               |
+|   35 | pd   | pd-0   | 127.0.0.1:2379  | log.sampling                | <nil>         |
+|  114 | tidb | tidb-0 | 127.0.0.1:4000  | log.disable-error-stack     | <nil>         |


tidb-0? can we name a TiDB instance?

Actually, we cannot, it's a temporary name generated by TIDB_CLUSTER_INFO. Do we need to remote it?

Signed-off-by: Lonng <heng@lonng.org>

siddontang · 2019-11-19T10:34:38Z

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

+
+#### 监控信息系统表
+
+由于监控指标会随着程序的迭代添加和删除监控指标，对于同一个监控指标，可能有不同的表达式获取监控不同维度的信息。鉴于以上两个需求，需要设计一个有弹性的监控系统表框架，本提案暂时才采取以下方案：将表达式映射为 `metrics_schema` 数据库中的系统表，表达式与系统表的关系可以通过以下方式关联：


we also need to provide a mechanism that checking the query is valid.

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

tennix · 2019-11-21T10:19:30Z

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

+|  126 | tidb | tidb-0 | 127.0.0.1:4000  | log.record-plan-in-slow-log | 1             |
+|  127 | tidb | tidb-0 | 127.0.0.1:4000  | log.slow-query-file         | tidb-slow.log |
+|  128 | tidb | tidb-0 | 127.0.0.1:4000  | log.slow-threshold          | 300           |
+|  213 | tikv | tikv-0 | 127.0.0.1:20160 | log-file                    |               |


What is the meaning of the tikv name suffix, store id?

It has been removed and I have not updated this part yet.
See: #13587

tennix · 2019-11-21T10:21:26Z

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

+------+------+--------+-----------------+-----------------------------+---------------+
+| ID   | TYPE | NAME   | ADDRESS         | KEY                         | VALUE         |
+------+------+--------+-----------------+-----------------------------+---------------+
+|   21 | pd   | pd-0   | 127.0.0.1:2379  | log-file                    |               |


What is the meaning of ID?

It has been removed and I have not updated this part yet.
See: #13587

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

tennix · 2019-11-21T11:52:19Z

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

+    - 劣势：增加集群运维难度，第三方组件不容易与 TiDB 内部 SQL 集成；日志收集工具会收集全量日志，收集过程占用各个系统资源（磁盘 IO、网络 IO）
+- 各个节点提供日志服务，TiDB 通过各个节点的接口将谓词下推到日志检索接口，直接对各个节点返回的日志进行归并
+    - 优势：不引入三方组件，谓词下推后只返回过滤后的日志，能轻易的与 TiDB SQL 进行集成，并能复用 SQL 引擎的过滤、聚合等
+    - 劣势：如果节点日志删除后，不能检索到对应日志


This can be resolved by allowing TiDB to query other logging systems. Making the log storage pluggable. And it can also be applied to the monitoring system. If users want more persistent monitoring/logging data, they can replace the builtin storage engines for this kind of data.

We choose a lightweight way for the time being.

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

tennix · 2019-11-21T11:56:00Z

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

+
+#### 节点配置信息
+
+所有节点都包含当前节点的生效配置，不需要额外的步骤既可拿到配置信息。


Suggested change

所有节点都包含当前节点的生效配置，不需要额外的步骤既可拿到配置信息。

所有节点都包含当前节点的生效配置，不需要额外的步骤即可拿到配置信息。

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

tennix · 2019-11-21T12:23:47Z

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

+    - rxcmp/s：每秒钟接收的压缩数据包
+    - txcmp/s：每秒钟发送的压缩数据包
+    - rxmcst/s：每秒钟接收的多播数据包
+- 常用的系统配置：sysctl -a


ulimit -a

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

aylei · 2019-11-21T12:20:41Z

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

+1. PD 中添加 `remote-metrics-storage` 配置，暂时配置为 Prometheus Server 的地址。PD 作为 proxy，将请求转移到 Prometheus 上执行，主要有以下考量：
+    - 后续 PD 实现查询接口实现自举，TiDB 不需要做其他改动
+    - 用户不使用 TiDB 部署的 Prometheus 而使用自建的监控服务，依然可以使用 SQL 查询监控信息以及诊断框架
+2. 将 Prometheus 时序数据保存和查询相应的模块抽离出来，并嵌入到 PD 中


If we are going to embed Prometheus into PD, how about shim TiKV as the storage of metrics via https://github.com/bragfoo/TiPrometheus?

aylei · 2019-11-21T12:24:18Z

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

+
+- 使用 Prometheus client 和 PromQL 查询 Prometheus server 的数据
+    - 优势：有现成解决方案，只需要将 Prometheus server 的地址注册到 TiDB 即可，实现简单
+    - 劣势：增强了 TiDB 对 Prometheus 的依赖，为后续完全移除 Prometheus 增加了困难


I am curious about how to integrate the alert and graphing solutions from community, like alertmanager and grafana after we remove Prometheus.

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

Signed-off-by: Lonng <heng@lonng.org>

Co-Authored-By: Tennix <tennix@users.noreply.github.com>

Signed-off-by: Lonng <heng@lonng.org>

winkyao

LGTM

sre-bot · 2019-11-28T06:34:24Z

/run-all-tests

bb7133

lgtm

siddontang · 2019-11-28T07:27:37Z

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

+---------+-------------+------+------+---------+-------+
+5 rows in set (0.00 sec)
+
+mysql> select * from tidb_cluster_log where content like '%412134239937495042%'; -- 查询 TSO 为 412134239937495042 全链路日志


mysql> -- 查询 TSO 为 412134239937495042 全链路日 mysql> select * from tidb_cluster_log where content like '%412134239937495042%';

may look better

siddontang · 2019-11-28T07:28:15Z

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md

+------+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| TYPE | ADDRESS                | LEVEL | CONTENT                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+------+------------------------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| tidb | 10.9.120.251:10080     | INFO  | [coprocessor.go:725] ["[TIME_COP_PROCESS] resp_time:501.60574ms txnStartTS:412134239937495042 region_id:180 store_addr:10.9.82.29:20160 kv_process_ms:416 scan_total_write:340807 scan_processed_write:340806 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"]                                                                                                                                                                                                                                                             |


I think the log is too verbose here, can you simply them and only list a few?

siddontang

Rest LGTM

Signed-off-by: Lonng <heng@lonng.org>

lonng mentioned this pull request Nov 15, 2019

[Draft] proposal: TiDB built-in SQL Diagnostics tikv/rfcs#34

Closed

proposal: tidb built-in sql diagnostics

aa230ea

Signed-off-by: Lonng <heng@lonng.org>

djshow832 reviewed Nov 15, 2019

View reviewed changes

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Outdated Show resolved Hide resolved

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Outdated Show resolved Hide resolved

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Outdated Show resolved Hide resolved

dcalvin self-requested a review November 15, 2019 06:47

djshow832 reviewed Nov 15, 2019

View reviewed changes

AilinKid reviewed Nov 15, 2019

View reviewed changes

docs/design/2019-11-14-tidb-builtin-diagnostics.md Outdated Show resolved Hide resolved

djshow832 reviewed Nov 15, 2019

View reviewed changes

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Outdated Show resolved Hide resolved

djshow832 reviewed Nov 15, 2019

View reviewed changes

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Outdated Show resolved Hide resolved

djshow832 reviewed Nov 15, 2019

View reviewed changes

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Outdated Show resolved Hide resolved

djshow832 reviewed Nov 15, 2019

View reviewed changes

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Show resolved Hide resolved

djshow832 reviewed Nov 15, 2019

View reviewed changes

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Show resolved Hide resolved

djshow832 reviewed Nov 15, 2019

View reviewed changes

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Outdated Show resolved Hide resolved

djshow832 reviewed Nov 15, 2019

View reviewed changes

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Outdated Show resolved Hide resolved

lonng added 2 commits November 15, 2019 16:46

address comments

2303609

Signed-off-by: Lonng <heng@lonng.org>

address comments

8aa0d88

Signed-off-by: Lonng <heng@lonng.org>

lonng mentioned this pull request Nov 15, 2019

Incubating Program: TiDB built-in SQL Diagnostics pingcap/community#81

Closed

crazycs520 reviewed Nov 18, 2019

View reviewed changes

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Outdated Show resolved Hide resolved

crazycs520 reviewed Nov 18, 2019

View reviewed changes

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Outdated Show resolved Hide resolved

crazycs520 reviewed Nov 18, 2019

View reviewed changes

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Show resolved Hide resolved

lonng added 3 commits November 18, 2019 09:57

address comment

637593c

Signed-off-by: Lonng <heng@lonng.org>

polish the RFC

30c26d8

Signed-off-by: Lonng <heng@lonng.org>

address comment

9fc3299

Signed-off-by: Lonng <heng@lonng.org>

winkyao reviewed Nov 18, 2019

View reviewed changes

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Show resolved Hide resolved

winkyao reviewed Nov 18, 2019

View reviewed changes

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Show resolved Hide resolved

siddontang reviewed Nov 18, 2019

View reviewed changes

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Outdated Show resolved Hide resolved

siddontang reviewed Nov 18, 2019

View reviewed changes

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Outdated Show resolved Hide resolved

siddontang reviewed Nov 18, 2019

View reviewed changes

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Outdated Show resolved Hide resolved

update promql to query metric

434410a

siddontang reviewed Nov 18, 2019

View reviewed changes

change the diagnostics gRPC service defination

26cf6a0

Signed-off-by: Lonng <heng@lonng.org>

siddontang reviewed Nov 19, 2019

View reviewed changes

lonng mentioned this pull request Nov 20, 2019

Implement the Diagnostics gRPC service tikv/tikv#5980

Merged

edits from calvin

9f7ace3

tennix reviewed Nov 21, 2019

View reviewed changes

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Show resolved Hide resolved

tennix reviewed Nov 21, 2019

View reviewed changes

AstroProfundis reviewed Nov 21, 2019

View reviewed changes

docs/design/2019-11-14-tidb-builtin-diagnostics-zh_CN.md Show resolved Hide resolved

aylei reviewed Nov 21, 2019

View reviewed changes

aylei mentioned this pull request Nov 23, 2019

Extend tidbcluster v1alpha1 to support configuration and pump pingcap/tidb-operator#1193

Merged

sre-bot mentioned this pull request Nov 25, 2019

perfschema: support retrieve CPU/memory/mutex/block/allocs profile from PD via SQL #13717

Merged

lonng and others added 4 commits November 28, 2019 12:29

update tidb_cluster_log demo

75bea03

Signed-off-by: Lonng <heng@lonng.org>

Apply suggestions from code review

1ec3aba

Co-Authored-By: Tennix <tennix@users.noreply.github.com>

address comment

e0d3242

Signed-off-by: Lonng <heng@lonng.org>

Merge remote-tracking branch 'origin/master' into sql-diagnositics

372fe9b

winkyao approved these changes Nov 28, 2019

View reviewed changes

sre-bot added the status/can-merge Indicates a PR has been approved by a committer. label Nov 28, 2019

Merge branch 'master' into sql-diagnositics

3e0814e

lonng removed the status/can-merge Indicates a PR has been approved by a committer. label Nov 28, 2019

bb7133 reviewed Nov 28, 2019

View reviewed changes

lonng merged commit 08f0072 into pingcap:master Nov 28, 2019

lonng deleted the sql-diagnositics branch November 28, 2019 06:45

siddontang reviewed Nov 28, 2019

View reviewed changes

siddontang approved these changes Nov 28, 2019

View reviewed changes

XiaTianliang pushed a commit to XiaTianliang/tidb that referenced this pull request Dec 21, 2019

proposal: tidb built-in sql diagnostics (pingcap#13481)

749fd64

Signed-off-by: Lonng <heng@lonng.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: tidb built-in sql diagnostics #13481

proposal: tidb built-in sql diagnostics #13481

lonng commented Nov 15, 2019 •

edited

Loading

codecov bot commented Nov 15, 2019 •

edited

Loading

siddontang Nov 18, 2019

lonng Nov 18, 2019

siddontang Nov 19, 2019

tennix Nov 21, 2019

lonng Nov 21, 2019 •

edited

Loading

tennix Nov 21, 2019

lonng Nov 21, 2019

tennix Nov 21, 2019

lonng Nov 21, 2019

tennix Nov 21, 2019

tennix Nov 21, 2019

aylei Nov 21, 2019

aylei Nov 21, 2019

winkyao left a comment

sre-bot commented Nov 28, 2019

bb7133 left a comment

siddontang Nov 28, 2019

siddontang Nov 28, 2019

siddontang left a comment


		#### 监控信息系统表

		由于监控指标会随着程序的迭代添加和删除监控指标，对于同一个监控指标，可能有不同的表达式获取监控不同维度的信息。鉴于以上两个需求，需要设计一个有弹性的监控系统表框架，本提案暂时才采取以下方案：将表达式映射为 `metrics_schema` 数据库中的系统表，表达式与系统表的关系可以通过以下方式关联：


		#### 节点配置信息

		所有节点都包含当前节点的生效配置，不需要额外的步骤既可拿到配置信息。

	所有节点都包含当前节点的生效配置，不需要额外的步骤既可拿到配置信息。
	所有节点都包含当前节点的生效配置，不需要额外的步骤即可拿到配置信息。

proposal: tidb built-in sql diagnostics #13481

proposal: tidb built-in sql diagnostics #13481

Conversation

lonng commented Nov 15, 2019 • edited Loading

Summary

Motivation

codecov bot commented Nov 15, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lonng Nov 21, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

winkyao left a comment

Choose a reason for hiding this comment

sre-bot commented Nov 28, 2019

bb7133 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddontang left a comment

Choose a reason for hiding this comment

lonng commented Nov 15, 2019 •

edited

Loading

codecov bot commented Nov 15, 2019 •

edited

Loading

lonng Nov 21, 2019 •

edited

Loading