-
Notifications
You must be signed in to change notification settings - Fork 709
reference: add 2 inspection tables #2261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lilin90
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only finished reviewing part of this PR (till the ## Diagnosis rules line in inspection-result.md). Many understanding or technical writing issues exist. Please resolve my comments and check all changes in PR again. Thanks!
|
|
||
| # INSPECTION_RESULT | ||
|
|
||
| TiDB has some built-in diagnosis rules for detecting faults and hidden dangers in the system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
danger might be a little bit too strong.
| TiDB has some built-in diagnosis rules for detecting faults and hidden dangers in the system. | |
| TiDB has some built-in diagnosis rules for detecting faults and hidden issues in the system. |
|
|
||
| TiDB has some built-in diagnosis rules for detecting faults and hidden dangers in the system. | ||
|
|
||
| This diagnosis feature can help you quickly find problems and reduce your repetitive manual work. You can use the `select * from information_schema.inspection_result` statement to trigger the internal diagnosis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This diagnosis feature can help you quickly find problems and reduce your repetitive manual work. You can use the `select * from information_schema.inspection_result` statement to trigger the internal diagnosis. | |
| The `INSPECTION_RESULT` diagnosis feature can help you quickly find problems and reduce your repetitive manual work. You can use the `select * from information_schema.inspection_result` statement to trigger the internal diagnosis. |
|
|
||
| Field description: | ||
|
|
||
| * `RULE`: The name of the diagnosis rules. Below are the currently available rules: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * `RULE`: The name of the diagnosis rules. Below are the currently available rules: | |
| * `RULE`: The name of the diagnosis rule. Currently, the following rules are available: |
| * `config`: The consistency check of configuration. If the same configuration is inconsistent on different instances, a `warning` diagnosis result is generated. | ||
| * `version`: The consistency check of version. If the same version is inconsistent on different instances, a `warning` diagnosis result is generated. | ||
| * `current-load`: If the current system load is too high, the corresponding `warning` diagnosis result is generated. | ||
| * `critical-error`: Each module of the system defines a serious error. If a certain serious error exceeds the threshold within the corresponding time period, a warning diagnosis result is generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reafans Please help confirm whether it's "a error" or "errors".
@TomShawn Please keep words consistent with that in the rule name.
| * `critical-error`: Each module of the system defines a serious error. If a certain serious error exceeds the threshold within the corresponding time period, a warning diagnosis result is generated. | |
| * `critical-error`: Each module of the system defines critical errors. If a critical error exceeds the threshold within the corresponding time period, a warning diagnosis result is generated. |
| * `version`: The consistency check of version. If the same version is inconsistent on different instances, a `warning` diagnosis result is generated. | ||
| * `current-load`: If the current system load is too high, the corresponding `warning` diagnosis result is generated. | ||
| * `critical-error`: Each module of the system defines a serious error. If a certain serious error exceeds the threshold within the corresponding time period, a warning diagnosis result is generated. | ||
| * `threshold-check`: The diagnosis system determines thresholds of many metrics. If a threshold is exceeded, the corresponding diagnosis information is generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to understand it based on the context. It does not mean determine here.
| * `threshold-check`: The diagnosis system determines thresholds of many metrics. If a threshold is exceeded, the corresponding diagnosis information is generated. | |
| * `threshold-check`: The diagnosis system checks the thresholds of a large number of metrics. If a threshold is exceeded, the corresponding diagnosis information is generated. |
|
|
||
| You can have the following findings from the above diagnosis result: | ||
|
|
||
| * The first line indicates that TiDB's `log.slow-threshold` configuration value is `0`, which might affect performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the code block, please. It's row, not line.
| * The first line indicates that TiDB's `log.slow-threshold` configuration value is `0`, which might affect performance. | |
| * The first row indicates that TiDB's `log.slow-threshold` value is configured to `0`, which might affect performance. |
| You can have the following findings from the above diagnosis result: | ||
|
|
||
| * The first line indicates that TiDB's `log.slow-threshold` configuration value is `0`, which might affect performance. | ||
| * The second line indicates that two different TiDB versions exist in the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * The second line indicates that two different TiDB versions exist in the cluster. | |
| * The second row indicates that two different TiDB versions exist in the cluster. |
|
|
||
| * The first line indicates that TiDB's `log.slow-threshold` configuration value is `0`, which might affect performance. | ||
| * The second line indicates that two different TiDB versions exist in the cluster. | ||
| * The third and fourth lines indicate that the TiKV write delay is too long, and the expected delay is no more than 0.1s. The actual delay is far beyond the expectation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beyond the expectation is inappropriate and confusing here.
| * The third and fourth lines indicate that the TiKV write delay is too long, and the expected delay is no more than 0.1s. The actual delay is far beyond the expectation. | |
| * The third and fourth rows indicate that the TiKV write delay is too long. The expected delay is no more than 0.1 second, while the actual delay is far longer than expected. |
| * The second line indicates that two different TiDB versions exist in the cluster. | ||
| * The third and fourth lines indicate that the TiKV write delay is too long, and the expected delay is no more than 0.1s. The actual delay is far beyond the expectation. | ||
|
|
||
| Diagnose the cluster problem from "2020-03-26 00:03:00" to "2020-03-26 00:08:00". To specify the time range, you need to use the SQL Hint of `/*+ time_range() */`. See the following query example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please pay attentiont to the connection between paragraphs.
| Diagnose the cluster problem from "2020-03-26 00:03:00" to "2020-03-26 00:08:00". To specify the time range, you need to use the SQL Hint of `/*+ time_range() */`. See the following query example: | |
| You can also diagnose issues existing within a specified range, such as from "2020-03-26 00:03:00" to "2020-03-26 00:08:00". To specify the time range, use the SQL Hint of `/*+ time_range() */`. See the following query example: |
| You can have the following findings from the above result: | ||
|
|
||
| * The first line indicates that the `172.16.5.40:4009` TiDB instance is restarted at `2020/03/26 00:05:45.670`. | ||
| * The second line indicates that the maximum `get-token-duration` time of the `172.16.5.40:10089` TiDB instance is 0.234s, but the expected time is less than 0.001s. | ||
|
|
||
| You can also specify conditions, for example, to query the `critical` level diagnosis results: | ||
|
|
||
| {{< copyable "sql" >}} | ||
|
|
||
| ```sql | ||
| select * from inspection_result where severity='critical'; | ||
| ``` | ||
|
|
||
| Query only the diagnosis result of the `critical-error` rule: | ||
|
|
||
| {{< copyable "sql" >}} | ||
|
|
||
| ```sql | ||
| select * from inspection_result where rule='critical-error'; | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to suggestion for the previous similar section and update accordingly.
|
@lilin90 All comments are addressed and applied to similar sections in this PR, PTAL again, thanks! |
1 similar comment
reafans
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
|
||
| ### `config` diagnosis rule | ||
|
|
||
| The following two diagnosis rules are executed as the `config` diagnosis by querying the `CLUSTER_CONFIG` system table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The following two diagnosis rules are executed as the `config` diagnosis by querying the `CLUSTER_CONFIG` system table: | |
| In the `config` diagnosis rule, the following two diagnosis rules are executed by querying the `CLUSTER_CONFIG` system table: |
|
|
||
| The following two diagnosis rules are executed as the `config` diagnosis by querying the `CLUSTER_CONFIG` system table: | ||
|
|
||
| * Check whether the configuration values of the same component are consistent. Not all configuration items has this consistency check. The white list of consistency check is shown below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Check whether the configuration values of the same component are consistent. Not all configuration items has this consistency check. The white list of consistency check is shown below: | |
| * Check whether the configuration values of the same component are consistent. Not all configuration items has this consistency check. The white list of the consistency check is as follows: |
|
|
||
| ### `critical-error` diagnosis rule | ||
|
|
||
| The following two diagnosis rules are executed as the the `critical-error` diagnosis: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The following two diagnosis rules are executed as the the `critical-error` diagnosis: | |
| In `config` diagnosis rule, the following two diagnosis rules are executed: |
| | Component | Error name | Monitoring table | Error description | | ||
| | ---- | ---- | ---- | ---- | | ||
| | TiDB | panic-count | tidb_panic_count_total_count | Panic occurs in TiDB. | | ||
| | TiDB | binlog-error | tidb_binlog_error_total_count | An error occurs when TiDB writes binlog files. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| | TiDB | binlog-error | tidb_binlog_error_total_count | An error occurs when TiDB writes binlog files. | | |
| | TiDB | binlog-error | tidb_binlog_error_total_count | An error occurs when TiDB writes binlog. | |
| | TiKV | channel-is-full | tikv_channel_full_total_count | The "channel full" error occurs in TiKV. | | ||
| | TiKV | tikv_engine_write_stall | tikv_engine_write_stall | The "stall" error occurs in TiKV. | | ||
|
|
||
| * Check if any component is restarted by querying the `metrics_schema.up` monitoring table and the `CLUSTER_LOG` system table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if has multiple meanings.
| * Check if any component is restarted by querying the `metrics_schema.up` monitoring table and the `CLUSTER_LOG` system table. | |
| * Check whether any component is restarted by querying the `metrics_schema.up` monitoring table and the `CLUSTER_LOG` system table. |
|
|
||
| Field description: | ||
|
|
||
| * `RULE`: Summary rules. New rules are being added, and you can execute the `select * from inspection_rules where type='summary'` statement to check the latest rule list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * `RULE`: Summary rules. New rules are being added, and you can execute the `select * from inspection_rules where type='summary'` statement to check the latest rule list. | |
| * `RULE`: Summary rules. Because new rules are being added continuously, you can execute the `select * from inspection_rules where type='summary'` statement to query the latest rule list. |
|
|
||
| * `RULE`: Summary rules. New rules are being added, and you can execute the `select * from inspection_rules where type='summary'` statement to check the latest rule list. | ||
| * `INSTANCE`: The monitored instance. | ||
| * `METRIC_NAME`: The monitoring metrics name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please keep the field name consistent with that in the code block.
- Please check the meaning of the this field because it seems inconsistent with the Chinese version
监控表.
| * `METRIC_NAME`: The monitoring metrics name. | |
| * `METRICS_NAME`: The monitoring metrics name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As confirmed with @reafans, The monitoring metrics name is OK.
|
|
||
| > **Note:** | ||
| > | ||
| > Because summarizing all results causes overhead, the rules in `information_summary` are triggered passively, which means that the specified `rule` is displayed in the SQL predicate before the rule runs. For example, executing the `select * from inspection_summary` statement returns an empty result set. Executing `select * from inspection_summary where rule in ('read-link', 'ddl')` summarizes the read link and DDL-related monitoring metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| > Because summarizing all results causes overhead, the rules in `information_summary` are triggered passively, which means that the specified `rule` is displayed in the SQL predicate before the rule runs. For example, executing the `select * from inspection_summary` statement returns an empty result set. Executing `select * from inspection_summary where rule in ('read-link', 'ddl')` summarizes the read link and DDL-related monitoring metrics. | |
| > Because summarizing all results causes overhead, the rules in `information_summary` are triggered passively. That is, the specified `rule` runs only when it displays in the SQL predicate. For example, executing the `select * from inspection_summary` statement returns an empty result set. Executing `select * from inspection_summary where rule in ('read-link', 'ddl')` summarizes the read link and DDL-related monitoring metrics. |
|
|
||
| Usage example: | ||
|
|
||
| Both the diagnosis result table and the diagnosis monitoring summary table can specify the diagnosis time range using `hint`. `select **+ time_range('2020-03-07 12:00:00','2020-03-07 13:00:00') */* from inspection_summary` is the monitoring summary for the `2020-03-07 12:00:00` to `2020-03-07 13:00:00` period. Like the monitoring summary table, you can use the diagnosis result table to quickly find the monitoring items with large differences by comparing the data of two different periods. The following is an example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Both the diagnosis result table and the diagnosis monitoring summary table can specify the diagnosis time range using `hint`. `select **+ time_range('2020-03-07 12:00:00','2020-03-07 13:00:00') */* from inspection_summary` is the monitoring summary for the `2020-03-07 12:00:00` to `2020-03-07 13:00:00` period. Like the monitoring summary table, you can use the diagnosis result table to quickly find the monitoring items with large differences by comparing the data of two different periods. The following is an example: | |
| Both the diagnosis result table and the diagnosis monitoring summary table can specify the diagnosis time range using `hint`. `select **+ time_range('2020-03-07 12:00:00','2020-03-07 13:00:00') */* from inspection_summary` is the monitoring summary for the `2020-03-07 12:00:00` to `2020-03-07 13:00:00` period. Like the monitoring summary table, you can use the diagnosis result table to quickly find the monitoring items with large differences by comparing the data of two different periods. |
|
|
||
| Both the diagnosis result table and the diagnosis monitoring summary table can specify the diagnosis time range using `hint`. `select **+ time_range('2020-03-07 12:00:00','2020-03-07 13:00:00') */* from inspection_summary` is the monitoring summary for the `2020-03-07 12:00:00` to `2020-03-07 13:00:00` period. Like the monitoring summary table, you can use the diagnosis result table to quickly find the monitoring items with large differences by comparing the data of two different periods. The following is an example: | ||
|
|
||
| You can also diagnose issues existing within a specified range, such as from "2020-01-16 16:00:54.933" to "2020-01-16 16:10:54.933": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| You can also diagnose issues existing within a specified range, such as from "2020-01-16 16:00:54.933" to "2020-01-16 16:10:54.933": | |
| See the following example that diagnoses issues within a specified range, from "2020-01-16 16:00:54.933" to "2020-01-16 16:10:54.933": |
|
@lilin90 All comments are addressed. PTAL again, thanks! |
lilin90
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Your auto merge job has been accepted, waiting for:
|
Signed-off-by: sre-bot <sre-bot@pingcap.com>
|
cherry pick to release-4.0 in PR #2395 |
What is changed, added or deleted? (Required)
Add
inspection_resultandinspection_summarytables.Which TiDB version(s) do your changes apply to? (Required)
If you select two or more versions from above, to trigger the bot to cherry-pick this PR to your desired release version branch(es), you must add corresponding labels such as needs-cherry-pick-4.0, needs-cherry-pick-3.1, needs-cherry-pick-3.0, and needs-cherry-pick-2.1.
What is the related PR or file link(s)?