Skip to content
36 changes: 17 additions & 19 deletions system-tables/system-table-cluster-log.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,35 +40,33 @@ Field description:

> **Note:**
>
> + All fields of the cluster log table are pushed down to the corresponding instance for execution. So to reduce the overhead of using the cluster log table, specify as many conditions as possible. For example, the `select * from cluter_log where instance='tikv-1'` statement only executes the log search on the `tikv-1` instance.
> + All fields of the cluster log table are pushed down to the corresponding instance for execution. To reduce the overhead of using the cluster log table, you must specify the keywords used for the search, the time range, and as many conditions as possible. For example, `select * from cluster_log where message like '%ddl%' and time > '2020-05-18 20:40:00' and time<'2020-05-18 21:40:00' and type='tidb'`.
>
> + The `message` field supports the `like` and `regexp` regular expressions, and the corresponding pattern is encoded as `regexp`. Specifying multiple `message` conditions is equivalent to the `pipeline` form of the `grep` command. For example, executing the `select * from cluster_log where message like 'coprocessor%' and message regexp '.*slow.*'` statement is equivalent to executing `grep 'coprocessor' xxx.log | grep -E '.*slow.*'` on all cluster instances.
> + The `message` field supports the `like` and `regexp` regular expressions, and the corresponding pattern is encoded as `regexp`. Specifying multiple `message` conditions is equivalent to the `pipeline` form of the `grep` command. For example, executing the `select * from cluster_log where message like 'coprocessor%' and message regexp '.*slow.*' and time > '2020-05-18 20:40:00' and time<'2020-05-18 21:40:00'` statement is equivalent to executing `grep 'coprocessor' xxx.log | grep -E '.*slow.*'` on all cluster instances.

The following example shows how to query the execution process of a DDL statement using the `CLUSTER_LOG` table:

{{< copyable "sql" >}}

```sql
select * from information_schema.cluster_log where message like '%ddl%' and message like '%job%58%' and type='tidb' and time > '2020-03-27 15:39:00';
select time,instance,left(message,150) from information_schema.cluster_log where message like '%ddl%job%ID.80%' and type='tidb' and time > '2020-05-18 20:40:00' and time<'2020-05-18 21:40:00'
```

```sql
+-------------------------+------+------------------+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| TIME | TYPE | INSTANCE | LEVEL | MESSAGE |
+-------------------------+------+------------------+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 2020/03/27 15:39:36.140 | tidb | 172.16.5.40:4008 | INFO | [ddl_worker.go:253] ["[ddl] add DDL jobs"] ["batch count"=1] [jobs="ID:58, Type:create table, State:none, SchemaState:none, SchemaID:1, TableID:57, RowCount:0, ArgLen:1, start time: 2020-03-27 15:39:36.129 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0; "] |
| 2020/03/27 15:39:36.140 | tidb | 172.16.5.40:4008 | INFO | [ddl.go:457] ["[ddl] start DDL job"] [job="ID:58, Type:create table, State:none, SchemaState:none, SchemaID:1, TableID:57, RowCount:0, ArgLen:1, start time: 2020-03-27 15:39:36.129 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"] [query="create table t3 (a int, b int,c int)"] |
| 2020/03/27 15:39:36.879 | tidb | 172.16.5.40:4009 | INFO | [ddl_worker.go:554] ["[ddl] run DDL job"] [worker="worker 1, tp general"] [job="ID:58, Type:create table, State:none, SchemaState:none, SchemaID:1, TableID:57, RowCount:0, ArgLen:0, start time: 2020-03-27 15:39:36.129 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"] |
| 2020/03/27 15:39:36.936 | tidb | 172.16.5.40:4009 | INFO | [ddl_worker.go:739] ["[ddl] wait latest schema version changed"] [worker="worker 1, tp general"] [ver=35] ["take time"=52.165811ms] [job="ID:58, Type:create table, State:done, SchemaState:public, SchemaID:1, TableID:57, RowCount:0, ArgLen:1, start time: 2020-03-27 15:39:36.129 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"] |
| 2020/03/27 15:39:36.938 | tidb | 172.16.5.40:4009 | INFO | [ddl_worker.go:359] ["[ddl] finish DDL job"] [worker="worker 1, tp general"] [job="ID:58, Type:create table, State:synced, SchemaState:public, SchemaID:1, TableID:57, RowCount:0, ArgLen:0, start time: 2020-03-27 15:39:36.129 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"] |
| 2020/03/27 15:39:36.140 | tidb | 172.16.5.40:4009 | INFO | [ddl_worker.go:253] ["[ddl] add DDL jobs"] ["batch count"=1] [jobs="ID:58, Type:create table, State:none, SchemaState:none, SchemaID:1, TableID:57, RowCount:0, ArgLen:1, start time: 2020-03-27 15:39:36.129 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0; "] |
| 2020/03/27 15:39:36.140 | tidb | 172.16.5.40:4009 | INFO | [ddl.go:457] ["[ddl] start DDL job"] [job="ID:58, Type:create table, State:none, SchemaState:none, SchemaID:1, TableID:57, RowCount:0, ArgLen:1, start time: 2020-03-27 15:39:36.129 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"] [query="create table t3 (a int, b int,c int)"] |
| 2020/03/27 15:39:37.141 | tidb | 172.16.5.40:4008 | INFO | [ddl.go:489] ["[ddl] DDL job is finished"] [jobID=58] |
+-------------------------+------+------------------+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+-------------------------+----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| time | instance | left(message,150) |
+-------------------------+----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| 2020/05/18 21:37:54.784 | 127.0.0.1:4002 | [ddl_worker.go:261] ["[ddl] add DDL jobs"] ["batch count"=1] [jobs="ID:80, Type:create table, State:none, SchemaState:none, SchemaID:1, TableID:79, Ro |
| 2020/05/18 21:37:54.784 | 127.0.0.1:4002 | [ddl.go:477] ["[ddl] start DDL job"] [job="ID:80, Type:create table, State:none, SchemaState:none, SchemaID:1, TableID:79, RowCount:0, ArgLen:1, start |
| 2020/05/18 21:37:55.327 | 127.0.0.1:4000 | [ddl_worker.go:568] ["[ddl] run DDL job"] [worker="worker 1, tp general"] [job="ID:80, Type:create table, State:none, SchemaState:none, SchemaID:1, Ta |
| 2020/05/18 21:37:55.381 | 127.0.0.1:4000 | [ddl_worker.go:763] ["[ddl] wait latest schema version changed"] [worker="worker 1, tp general"] [ver=70] ["take time"=50.809848ms] [job="ID:80, Type: |
| 2020/05/18 21:37:55.382 | 127.0.0.1:4000 | [ddl_worker.go:359] ["[ddl] finish DDL job"] [worker="worker 1, tp general"] [job="ID:80, Type:create table, State:synced, SchemaState:public, SchemaI |
| 2020/05/18 21:37:55.786 | 127.0.0.1:4002 | [ddl.go:509] ["[ddl] DDL job is finished"] [jobID=80] |
+-------------------------+----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
```

The above query results show the following process:
The query results above show the process of executing a DDL statement:

1. The request with a DDL JOB ID of `58` is sent to the `172.16.5.40: 4008` TiDB instance.
2. The `172.16.5.40: 4009` TiDB instance processes this DDL request, which indicates that the `172.16.5.40: 4009` instance is the DDL owner at that time.
3. The request with a DDL JOB ID of `58` has been processed.
1. The request with a DDL JOB ID of `80` is sent to the `127.0.0.1:4002` TiDB instance.
2. The `127.0.0.1:4000` TiDB instance processes this DDL request, which indicates that the `127.0.0.1:4000` instance is the DDL owner at that time.
3. The request with a DDL JOB ID of `80` has been processed.
14 changes: 7 additions & 7 deletions system-tables/system-table-inspection-result.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,17 +39,17 @@ desc information_schema.inspection_result;
Field description:

* `RULE`: The name of the diagnosis rule. Currently, the following rules are available:
* `config`: The consistency check of configuration. If the same configuration is inconsistent on different instances, a `warning` diagnosis result is generated.
* `config`: Checks whether the configuration is consistent and proper. If the same configuration is inconsistent on different instances, a `warning` diagnosis result is generated.
* `version`: The consistency check of version. If the same version is inconsistent on different instances, a `warning` diagnosis result is generated.
* `node-load`: If the current system load is too high, the corresponding `warning` diagnosis result is generated.
* `node-load`: Checks the server load. If the current system load is too high, the corresponding `warning` diagnosis result is generated.
* `critical-error`: Each module of the system defines critical errors. If a critical error exceeds the threshold within the corresponding time period, a warning diagnosis result is generated.
* `threshold-check`: The diagnosis system checks the thresholds of a large number of metrics. If a threshold is exceeded, the corresponding diagnosis information is generated.
* `threshold-check`: The diagnosis system checks the thresholds of key metrics. If a threshold is exceeded, the corresponding diagnosis information is generated.
* `ITEM`: Each rule diagnoses different items. This field indicates the specific diagnosis items corresponding to each rule.
* `TYPE`: The instance type of the diagnosis. The optional values are `tidb`, `pd`, and `tikv`.
* `INSTANCE`: The specific address of the diagnosed instance.
* `STATUS_ADDRESS`: The HTTP API service address of the instance.
* `VALUE`: The value of a specific diagnosis item.
* `REFERENCE`: The reference value (threshold value) for this diagnosis item. If the difference between `VALUE` and the threshold is very large, the corresponding diagnosis information is generated.
* `REFERENCE`: The reference value (threshold value) for this diagnosis item. If `VALUE` exceeds the threshold, the corresponding diagnosis information is generated.
* `SEVERITY`: The severity level. The optional values are `warning` and `critical`.
* `DETAILS`: Diagnosis details, which might also contain SQL statement(s) or document links for further diagnosis.

Expand Down Expand Up @@ -160,7 +160,7 @@ select * from information_schema.inspection_result where rule='critical-error';

## Diagnosis rules

The diagnosis module contains a series of rules. These rules compare the results with the preset thresholds after querying the existing monitoring tables and cluster information tables. If the results exceed the thresholds or fall below the thresholds, the result of `warning` or `critical` is generated and the corresponding information is provided in the `details` column.
The diagnosis module contains a series of rules. These rules compare the results with the thresholds after querying the existing monitoring tables and cluster information tables. If the results exceed the thresholds, the diagnosis of `warning` or `critical` is generated and the corresponding information is provided in the `details` column.

You can query the existing diagnosis rules by querying the `inspection_rules` system table:

Expand Down Expand Up @@ -274,9 +274,9 @@ The `threshold-check` diagnosis rule checks whether the following metrics in the

| Component | Monitoring metric | Monitoring table | Expected value | Description |
| :---- | :---- | :---- | :---- | :---- |
| TiDB | tso-duration | pd_tso_wait_duration | < 50ms | The time it takes to get the transaction TSO timestamp. |
| TiDB | tso-duration | pd_tso_wait_duration | < 50ms | The wait duration of getting the TSO of transaction. |
| TiDB | get-token-duration | tidb_get_token_duration | < 1ms | Queries the time it takes to get the token. The related TiDB configuration item is [`token-limit`](/command-line-flags-for-tidb-configuration.md#token-limit). |
| TiDB | load-schema-duration | tidb_load_schema_duration | < 1s | The time it takes for TiDB to update and load the schema metadata.|
| TiDB | load-schema-duration | tidb_load_schema_duration | < 1s | The time it takes for TiDB to update the schema metadata.|
| TiKV | scheduler-cmd-duration | tikv_scheduler_command_duration | < 0.1s | The time it takes for TiKV to execute the KV `cmd` request. |
| TiKV | handle-snapshot-duration | tikv_handle_snapshot_duration | < 30s | The time it takes for TiKV to handle the snapshot. |
| TiKV | storage-write-duration | tikv_storage_async_request_duration | < 0.1s | The write latency of TiKV. |
Expand Down
11 changes: 7 additions & 4 deletions system-tables/system-table-inspection-summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ aliases: ['/docs/dev/reference/system-databases/inspection-summary/']

# INSPECTION_SUMMARY

In some scenarios, you might pay attention only to the monitoring summary of specific links or modules. For example, the number of threads for Coprocessor in the thread pool is configured as 8. If the CPU usage of Coprocessor reaches 750%, you can determine that a risk exists and Coprocessor might become a bottleneck in advance. However, some monitoring metrics vary greatly due to different user workloads, so it is difficult to define specific thresholds. It is important to troubleshoot issues in this scenario, so TiDB provides the `inspection_summary` table for link summary.
In some scenarios, you might need to pay attention only to the monitoring summary of specific links or modules. For example, the number of threads for Coprocessor in the thread pool is configured as 8. If the CPU usage of Coprocessor reaches 750%, you can determine that a risk exists and Coprocessor might become a bottleneck in advance. However, some monitoring metrics vary greatly due to different user workloads, so it is difficult to define specific thresholds. It is important to troubleshoot issues in this scenario, so TiDB provides the `inspection_summary` table for link summary.

The structure of the `information_schema.inspection_summary` inspection summary table is as follows:

Expand Down Expand Up @@ -48,9 +48,12 @@ Field description:

Usage example:

Both the diagnosis result table and the diagnosis monitoring summary table can specify the diagnosis time range using `hint`. `select **+ time_range('2020-03-07 12:00:00','2020-03-07 13:00:00') */* from inspection_summary` is the monitoring summary for the `2020-03-07 12:00:00` to `2020-03-07 13:00:00` period. Like the monitoring summary table, you can use the diagnosis result table to quickly find the monitoring items with large differences by comparing the data of two different periods.
Both the diagnosis result table and the diagnosis monitoring summary table can specify the diagnosis time range using `hint`. `select /*+ time_range('2020-03-07 12:00:00','2020-03-07 13:00:00') */* from inspection_summary` is the monitoring summary for the `2020-03-07 12:00:00` to `2020-03-07 13:00:00` period. Like the monitoring summary table, you can use the `inspection_summary` table to quickly find the monitoring items with large differences by comparing the data of two different periods.

See the following example that diagnoses issues within a specified range, from "2020-01-16 16:00:54.933" to "2020-01-16 16:10:54.933":
The following example compares the monitoring metrics of read links in two time periods:

* `(2020-01-16 16:00:54.933, 2020-01-16 16:10:54.933)`
* `(2020-01-16 16:10:54.933, 2020-01-16 16:20:54.933)`

{{< copyable "sql" >}}

Expand All @@ -68,7 +71,7 @@ FROM
JOIN
(
SELECT
/*+ time_range("2020-01-16 16:10:54.933","2020-01-16 16:20:54.933")*/ *
/*+ time_range("2020-01-16 16:10:54.933", "2020-01-16 16:20:54.933")*/ *
FROM information_schema.inspection_summary WHERE rule='read-link'
) t2
ON t1.metrics_name = t2.metrics_name
Expand Down
Loading