Skip to content

Conversation

@TomShawn
Copy link
Contributor

@TomShawn TomShawn commented Apr 13, 2020

What is changed, added or deleted? (Required)

Add inspection_result and inspection_summary tables.

Which TiDB version(s) do your changes apply to? (Required)

  • master (the latest development version)
  • v4.0 (TiDB 4.0 versions)
  • v3.1 (TiDB 3.1 versions)
  • v3.0 (TiDB 3.0 versions)
  • v2.1 (TiDB 2.1 versions)

If you select two or more versions from above, to trigger the bot to cherry-pick this PR to your desired release version branch(es), you must add corresponding labels such as needs-cherry-pick-4.0, needs-cherry-pick-3.1, needs-cherry-pick-3.0, and needs-cherry-pick-2.1.

What is the related PR or file link(s)?

@TomShawn TomShawn added translation/from-docs-cn This PR is translated from a PR in pingcap/docs-cn. v4.0 This PR/issue applies to TiDB v4.0. size/large Changes of a large size. needs-cherry-pick-4.0 labels Apr 13, 2020
@TomShawn TomShawn requested a review from reafans April 13, 2020 12:03
@TomShawn TomShawn requested a review from lilin90 April 14, 2020 06:43
@TomShawn TomShawn added the status/PTAL This PR is ready for reviewing. label Apr 14, 2020
Copy link
Member

@lilin90 lilin90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only finished reviewing part of this PR (till the ## Diagnosis rules line in inspection-result.md). Many understanding or technical writing issues exist. Please resolve my comments and check all changes in PR again. Thanks!


# INSPECTION_RESULT

TiDB has some built-in diagnosis rules for detecting faults and hidden dangers in the system.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

danger might be a little bit too strong.

Suggested change
TiDB has some built-in diagnosis rules for detecting faults and hidden dangers in the system.
TiDB has some built-in diagnosis rules for detecting faults and hidden issues in the system.


TiDB has some built-in diagnosis rules for detecting faults and hidden dangers in the system.

This diagnosis feature can help you quickly find problems and reduce your repetitive manual work. You can use the `select * from information_schema.inspection_result` statement to trigger the internal diagnosis.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This diagnosis feature can help you quickly find problems and reduce your repetitive manual work. You can use the `select * from information_schema.inspection_result` statement to trigger the internal diagnosis.
The `INSPECTION_RESULT` diagnosis feature can help you quickly find problems and reduce your repetitive manual work. You can use the `select * from information_schema.inspection_result` statement to trigger the internal diagnosis.


Field description:

* `RULE`: The name of the diagnosis rules. Below are the currently available rules:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `RULE`: The name of the diagnosis rules. Below are the currently available rules:
* `RULE`: The name of the diagnosis rule. Currently, the following rules are available:

* `config`: The consistency check of configuration. If the same configuration is inconsistent on different instances, a `warning` diagnosis result is generated.
* `version`: The consistency check of version. If the same version is inconsistent on different instances, a `warning` diagnosis result is generated.
* `current-load`: If the current system load is too high, the corresponding `warning` diagnosis result is generated.
* `critical-error`: Each module of the system defines a serious error. If a certain serious error exceeds the threshold within the corresponding time period, a warning diagnosis result is generated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reafans Please help confirm whether it's "a error" or "errors".
@TomShawn Please keep words consistent with that in the rule name.

Suggested change
* `critical-error`: Each module of the system defines a serious error. If a certain serious error exceeds the threshold within the corresponding time period, a warning diagnosis result is generated.
* `critical-error`: Each module of the system defines critical errors. If a critical error exceeds the threshold within the corresponding time period, a warning diagnosis result is generated.

* `version`: The consistency check of version. If the same version is inconsistent on different instances, a `warning` diagnosis result is generated.
* `current-load`: If the current system load is too high, the corresponding `warning` diagnosis result is generated.
* `critical-error`: Each module of the system defines a serious error. If a certain serious error exceeds the threshold within the corresponding time period, a warning diagnosis result is generated.
* `threshold-check`: The diagnosis system determines thresholds of many metrics. If a threshold is exceeded, the corresponding diagnosis information is generated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to understand it based on the context. It does not mean determine here.

Suggested change
* `threshold-check`: The diagnosis system determines thresholds of many metrics. If a threshold is exceeded, the corresponding diagnosis information is generated.
* `threshold-check`: The diagnosis system checks the thresholds of a large number of metrics. If a threshold is exceeded, the corresponding diagnosis information is generated.


You can have the following findings from the above diagnosis result:

* The first line indicates that TiDB's `log.slow-threshold` configuration value is `0`, which might affect performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the code block, please. It's row, not line.

Suggested change
* The first line indicates that TiDB's `log.slow-threshold` configuration value is `0`, which might affect performance.
* The first row indicates that TiDB's `log.slow-threshold` value is configured to `0`, which might affect performance.

You can have the following findings from the above diagnosis result:

* The first line indicates that TiDB's `log.slow-threshold` configuration value is `0`, which might affect performance.
* The second line indicates that two different TiDB versions exist in the cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The second line indicates that two different TiDB versions exist in the cluster.
* The second row indicates that two different TiDB versions exist in the cluster.


* The first line indicates that TiDB's `log.slow-threshold` configuration value is `0`, which might affect performance.
* The second line indicates that two different TiDB versions exist in the cluster.
* The third and fourth lines indicate that the TiKV write delay is too long, and the expected delay is no more than 0.1s. The actual delay is far beyond the expectation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beyond the expectation is inappropriate and confusing here.

Suggested change
* The third and fourth lines indicate that the TiKV write delay is too long, and the expected delay is no more than 0.1s. The actual delay is far beyond the expectation.
* The third and fourth rows indicate that the TiKV write delay is too long. The expected delay is no more than 0.1 second, while the actual delay is far longer than expected.

* The second line indicates that two different TiDB versions exist in the cluster.
* The third and fourth lines indicate that the TiKV write delay is too long, and the expected delay is no more than 0.1s. The actual delay is far beyond the expectation.

Diagnose the cluster problem from "2020-03-26 00:03:00" to "2020-03-26 00:08:00". To specify the time range, you need to use the SQL Hint of `/*+ time_range() */`. See the following query example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pay attentiont to the connection between paragraphs.

Suggested change
Diagnose the cluster problem from "2020-03-26 00:03:00" to "2020-03-26 00:08:00". To specify the time range, you need to use the SQL Hint of `/*+ time_range() */`. See the following query example:
You can also diagnose issues existing within a specified range, such as from "2020-03-26 00:03:00" to "2020-03-26 00:08:00". To specify the time range, use the SQL Hint of `/*+ time_range() */`. See the following query example:

Comment on lines 137 to 156
You can have the following findings from the above result:

* The first line indicates that the `172.16.5.40:4009` TiDB instance is restarted at `2020/03/26 00:05:45.670`.
* The second line indicates that the maximum `get-token-duration` time of the `172.16.5.40:10089` TiDB instance is 0.234s, but the expected time is less than 0.001s.

You can also specify conditions, for example, to query the `critical` level diagnosis results:

{{< copyable "sql" >}}

```sql
select * from inspection_result where severity='critical';
```

Query only the diagnosis result of the `critical-error` rule:

{{< copyable "sql" >}}

```sql
select * from inspection_result where rule='critical-error';
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to suggestion for the previous similar section and update accordingly.

@TomShawn
Copy link
Contributor Author

@lilin90 All comments are addressed and applied to similar sections in this PR, PTAL again, thanks!

@sre-bot
Copy link
Contributor

sre-bot commented Apr 17, 2020

@lilin90, @reafans, PTAL.

1 similar comment
@sre-bot
Copy link
Contributor

sre-bot commented Apr 19, 2020

@lilin90, @reafans, PTAL.

Copy link
Contributor

@reafans reafans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


### `config` diagnosis rule

The following two diagnosis rules are executed as the `config` diagnosis by querying the `CLUSTER_CONFIG` system table:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following two diagnosis rules are executed as the `config` diagnosis by querying the `CLUSTER_CONFIG` system table:
In the `config` diagnosis rule, the following two diagnosis rules are executed by querying the `CLUSTER_CONFIG` system table:


The following two diagnosis rules are executed as the `config` diagnosis by querying the `CLUSTER_CONFIG` system table:

* Check whether the configuration values of the same component are consistent. Not all configuration items has this consistency check. The white list of consistency check is shown below:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Check whether the configuration values of the same component are consistent. Not all configuration items has this consistency check. The white list of consistency check is shown below:
* Check whether the configuration values of the same component are consistent. Not all configuration items has this consistency check. The white list of the consistency check is as follows:


### `critical-error` diagnosis rule

The following two diagnosis rules are executed as the the `critical-error` diagnosis:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following two diagnosis rules are executed as the the `critical-error` diagnosis:
In `config` diagnosis rule, the following two diagnosis rules are executed:

| Component | Error name | Monitoring table | Error description |
| ---- | ---- | ---- | ---- |
| TiDB | panic-count | tidb_panic_count_total_count | Panic occurs in TiDB. |
| TiDB | binlog-error | tidb_binlog_error_total_count | An error occurs when TiDB writes binlog files. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| TiDB | binlog-error | tidb_binlog_error_total_count | An error occurs when TiDB writes binlog files. |
| TiDB | binlog-error | tidb_binlog_error_total_count | An error occurs when TiDB writes binlog. |

| TiKV | channel-is-full | tikv_channel_full_total_count | The "channel full" error occurs in TiKV. |
| TiKV | tikv_engine_write_stall | tikv_engine_write_stall | The "stall" error occurs in TiKV. |

* Check if any component is restarted by querying the `metrics_schema.up` monitoring table and the `CLUSTER_LOG` system table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if has multiple meanings.

Suggested change
* Check if any component is restarted by querying the `metrics_schema.up` monitoring table and the `CLUSTER_LOG` system table.
* Check whether any component is restarted by querying the `metrics_schema.up` monitoring table and the `CLUSTER_LOG` system table.


Field description:

* `RULE`: Summary rules. New rules are being added, and you can execute the `select * from inspection_rules where type='summary'` statement to check the latest rule list.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `RULE`: Summary rules. New rules are being added, and you can execute the `select * from inspection_rules where type='summary'` statement to check the latest rule list.
* `RULE`: Summary rules. Because new rules are being added continuously, you can execute the `select * from inspection_rules where type='summary'` statement to query the latest rule list.


* `RULE`: Summary rules. New rules are being added, and you can execute the `select * from inspection_rules where type='summary'` statement to check the latest rule list.
* `INSTANCE`: The monitored instance.
* `METRIC_NAME`: The monitoring metrics name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Please keep the field name consistent with that in the code block.
  • Please check the meaning of the this field because it seems inconsistent with the Chinese version 监控表.
Suggested change
* `METRIC_NAME`: The monitoring metrics name.
* `METRICS_NAME`: The monitoring metrics name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As confirmed with @reafans, The monitoring metrics name is OK.


> **Note:**
>
> Because summarizing all results causes overhead, the rules in `information_summary` are triggered passively, which means that the specified `rule` is displayed in the SQL predicate before the rule runs. For example, executing the `select * from inspection_summary` statement returns an empty result set. Executing `select * from inspection_summary where rule in ('read-link', 'ddl')` summarizes the read link and DDL-related monitoring metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> Because summarizing all results causes overhead, the rules in `information_summary` are triggered passively, which means that the specified `rule` is displayed in the SQL predicate before the rule runs. For example, executing the `select * from inspection_summary` statement returns an empty result set. Executing `select * from inspection_summary where rule in ('read-link', 'ddl')` summarizes the read link and DDL-related monitoring metrics.
> Because summarizing all results causes overhead, the rules in `information_summary` are triggered passively. That is, the specified `rule` runs only when it displays in the SQL predicate. For example, executing the `select * from inspection_summary` statement returns an empty result set. Executing `select * from inspection_summary where rule in ('read-link', 'ddl')` summarizes the read link and DDL-related monitoring metrics.


Usage example:

Both the diagnosis result table and the diagnosis monitoring summary table can specify the diagnosis time range using `hint`. `select **+ time_range('2020-03-07 12:00:00','2020-03-07 13:00:00') */* from inspection_summary` is the monitoring summary for the `2020-03-07 12:00:00` to `2020-03-07 13:00:00` period. Like the monitoring summary table, you can use the diagnosis result table to quickly find the monitoring items with large differences by comparing the data of two different periods. The following is an example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Both the diagnosis result table and the diagnosis monitoring summary table can specify the diagnosis time range using `hint`. `select **+ time_range('2020-03-07 12:00:00','2020-03-07 13:00:00') */* from inspection_summary` is the monitoring summary for the `2020-03-07 12:00:00` to `2020-03-07 13:00:00` period. Like the monitoring summary table, you can use the diagnosis result table to quickly find the monitoring items with large differences by comparing the data of two different periods. The following is an example:
Both the diagnosis result table and the diagnosis monitoring summary table can specify the diagnosis time range using `hint`. `select **+ time_range('2020-03-07 12:00:00','2020-03-07 13:00:00') */* from inspection_summary` is the monitoring summary for the `2020-03-07 12:00:00` to `2020-03-07 13:00:00` period. Like the monitoring summary table, you can use the diagnosis result table to quickly find the monitoring items with large differences by comparing the data of two different periods.


Both the diagnosis result table and the diagnosis monitoring summary table can specify the diagnosis time range using `hint`. `select **+ time_range('2020-03-07 12:00:00','2020-03-07 13:00:00') */* from inspection_summary` is the monitoring summary for the `2020-03-07 12:00:00` to `2020-03-07 13:00:00` period. Like the monitoring summary table, you can use the diagnosis result table to quickly find the monitoring items with large differences by comparing the data of two different periods. The following is an example:

You can also diagnose issues existing within a specified range, such as from "2020-01-16 16:00:54.933" to "2020-01-16 16:10:54.933":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can also diagnose issues existing within a specified range, such as from "2020-01-16 16:00:54.933" to "2020-01-16 16:10:54.933":
See the following example that diagnoses issues within a specified range, from "2020-01-16 16:00:54.933" to "2020-01-16 16:10:54.933":

@TomShawn
Copy link
Contributor Author

@lilin90 All comments are addressed. PTAL again, thanks!

Copy link
Member

@lilin90 lilin90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lilin90 lilin90 added the status/can-merge Indicates a PR has been approved by a committer. label Apr 24, 2020
@sre-bot
Copy link
Contributor

sre-bot commented Apr 24, 2020

Your auto merge job has been accepted, waiting for:

  • 2251

@TomShawn TomShawn merged commit b948a68 into pingcap:master Apr 24, 2020
sre-bot pushed a commit to sre-bot/docs that referenced this pull request Apr 24, 2020
Signed-off-by: sre-bot <sre-bot@pingcap.com>
@sre-bot
Copy link
Contributor

sre-bot commented Apr 24, 2020

cherry pick to release-4.0 in PR #2395

@TomShawn TomShawn deleted the inspection-system-tables branch April 24, 2020 08:14
TomShawn added a commit that referenced this pull request Apr 24, 2020
* cherry pick #2261 to release-4.0

Signed-off-by: sre-bot <sre-bot@pingcap.com>

* resolve conflict

Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com>
Co-authored-by: TomShawn <1135243111@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/large Changes of a large size. status/can-merge Indicates a PR has been approved by a committer. status/PTAL This PR is ready for reviewing. translation/from-docs-cn This PR is translated from a PR in pingcap/docs-cn. v4.0 This PR/issue applies to TiDB v4.0.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants