diff --git a/dashboard/top-sql.md b/dashboard/top-sql.md index 4d8d8cbe5eb8b..73c3b5e0b9185 100644 --- a/dashboard/top-sql.md +++ b/dashboard/top-sql.md @@ -1,37 +1,32 @@ --- title: TiDB Dashboard Top SQL page -summary: Use Top SQL to identify queries that consume the most CPU, network, and logical IO resources +summary: TiDB Dashboard Top SQL allows real-time monitoring and visualization of CPU overhead for SQL statements in your database. It helps optimize performance by identifying high CPU load statements and provides detailed execution information. It's suitable for analyzing performance issues and can be accessed through TiDB Dashboard or a browser. The feature has a slight impact on cluster performance and is now generally available for production use. --- # TiDB Dashboard Top SQL Page -On the Top SQL page of TiDB Dashboard, you can view and analyze the most resource-consuming SQL queries on a specified TiDB or TiKV node over a period of time. - -- After you enable Top SQL, this feature continuously collects CPU workload data from existing TiDB and TiKV nodes and retains the data for up to 30 days. -- Starting from v8.5.6, you can also enable **TiKV Network IO collection (multi-dimensional)** in the Top SQL settings to further view metrics such as `Network Bytes` and `Logical IO Bytes` for specified TiKV nodes, and perform aggregation analysis in dimensions of `By Query`, `By Table`, `By DB`, and `By Region`. +With Top SQL, you can monitor and visually explore the CPU overhead of each SQL statement in your database in real-time, which helps you optimize and resolve database performance issues. Top SQL continuously collects and stores CPU load data summarized by SQL statements at any seconds from all TiDB and TiKV instances. The collected data can be stored for up to 30 days. Top SQL presents you with visual charts and tables to quickly pinpoint which SQL statements are contributing the high CPU load of a TiDB or TiKV instance over a certain period of time. Top SQL provides the following features: -* Visualize the top `5`, `20`, or `100` SQL queries with the most resource consumption in the current time range through charts and tables, with the remaining records automatically summarized as `Others`. -* Display resource consumption hotspots sorted by CPU time or network bytes. When selecting a TiKV node, you can also sort by logical IO bytes. -* Display SQL and execution plan details by query. When selecting a TiKV node, you can also aggregate analysis in dimensions of `By Table`, `By DB`, and `By Region`. -* Zoom in on a selected time range in the chart, manually refresh data, enable auto refresh, and export table data to CSV. +* Visualize the top 5 types of SQL statements with the highest CPU overhead through charts and tables. +* Display detailed execution information such as queries per second, average latency, and query plan. * Collect all SQL statements that are executed, including those that are still running. -* Display data of a specific TiDB or TiKV node. +* Allow viewing data of a specific TiDB and TiKV instance. ## Recommended scenarios Top SQL is suitable for analyzing performance issues. The following are some typical Top SQL scenarios: -* You discovered that an individual TiDB or TiKV node in the cluster has a very high CPU usage. You want to quickly locate which type of SQL is consuming a lot of CPU resources. -* The overall cluster queries become slow. You want to find out which SQL is currently consuming the most resources, or compare the main query differences before and after the workload changes. -* You need to locate hotspots from a higher dimension and want to aggregate and view resource consumption on the TiKV side by `Table`, `DB`, or `Region`. -* You need to troubleshoot TiKV hotspots from the perspective of network traffic or logical IO, not just limited to the CPU dimension. +* You discovered that an individual TiKV instance in the cluster has a very high CPU usage through the Grafana charts. You want to know which SQL statements cause the CPU hotspots so that you can optimize them and better leverage all of your distributed resources. +* You discovered that the cluster has a very high CPU usage overall and queries are slow. You want to quickly figure out which SQL statements are currently consuming the most CPU resources so that you can optimize them. +* The CPU usage of the cluster has drastically changed and you want to know the major cause. +* Analyze the most resource-intensive SQL statements in the cluster and optimize them to reduce hardware costs. Top SQL cannot be used in the following scenarios: - Top SQL cannot be used to pinpoint non-performance issues, such as incorrect data or abnormal crashes. -- Top SQL is not suitable for directly analyzing lock conflicts, transaction semantic errors, or other issues not caused by resource consumption. +- Top SQL does not support analyzing database performance issues that are not caused by high CPU load, such as transaction lock conflicts. ## Access the page @@ -39,9 +34,9 @@ You can access the Top SQL page using either of the following methods: * After logging in to TiDB Dashboard, click **Top SQL** in the left navigation menu. - ![Top SQL](/media/dashboard/v8.5-top-sql-access.png) + ![Top SQL](/media/dashboard/top-sql-access.png) -* Visit in your browser. Replace `127.0.0.1:2379` with the actual PD node address and port. +* Visit in your browser. Replace `127.0.0.1:2379` with the actual PD instance address and port. ## Enable Top SQL @@ -52,10 +47,10 @@ You can access the Top SQL page using either of the following methods: Top SQL is not enabled by default as it has a slight impact on cluster performance (within 3% on average) when enabled. You can enable Top SQL by the following steps: 1. Visit the [Top SQL page](#access-the-page). -2. Click **Open Settings**. In the **Settings** area on the right side of the page, enable the **Enable Feature** switch. +2. Click **Open Settings**. On the right side of the **Settings** area, switch on **Enable Feature**. 3. Click **Save**. -After enabling Top SQL, you can only view data collected starting from this point in time, while historical data before enabling will not be backfilled. Data display usually has a delay of about 1 minute, so you need to wait a moment to see new data. After disabling Top SQL, if historical data has not expired, the Top SQL page still displays this historical data, but new data will no longer be collected or displayed. +After enabling the feature, wait up to 1 minute for Top SQL to load the data. Then you can see the CPU load details. In addition to the UI, you can also enable the Top SQL feature by setting the TiDB system variable [`tidb_enable_top_sql`](/system-variables.md#tidb_enable_top_sql-new-in-v540): @@ -65,104 +60,57 @@ In addition to the UI, you can also enable the Top SQL feature by setting the Ti SET GLOBAL tidb_enable_top_sql = 1; ``` -### (Optional) Enable TiKV Network IO collection New in v8.5.6 - -To view Top SQL by `Order By Network` or `Order By Logical IO` for TiKV nodes, or to use the `By Region` aggregation, you can enable the **Enable TiKV Network IO collection (multi-dimensional)** switch in Top SQL settings and save the changes. - -- **Order By Network**: Sorts by the number of network bytes generated during TiKV request processing. -- **Order By Logical IO**: Sorts by the amount of logical data (in bytes) processed by TiKV at the storage layer for TiKV requests, such as the data scanned or processed during reads and the data written by write requests. - -As shown in the following screenshot, the right **Settings** panel displays both the **Enable Feature** and **Enable TiKV Network IO collection (multi-dimensional)** switches. - -![Enable TiKV Network IO collection](/media/dashboard/v8.5-top-sql-settings-enable-tikv-network-io.png) - -**Enabling TiKV Network IO collection (multi-dimensional)** increases storage and query overhead. After enabling, the configuration is delivered to all current TiKV nodes; data display might also have a delay of about 1 minute. If some TiKV nodes fail to enable this feature, the page shows a warning, and new data might be incomplete. - -For newly added TiKV nodes, this switch does not take effect automatically. You need to set the **Enable TiKV Network IO collection (multi-dimensional)** switch to all enabled in the Top SQL settings panel and save, so the configuration is delivered to all TiKV nodes again. If you want newly added TiKV nodes to automatically enable this feature, add the following configuration under `server_configs.tikv` in the TiUP cluster topology file and use TiUP to re-deliver the TiKV configuration: - -```yaml -server_configs: - tikv: - resource-metering.enable-network-io-collection: true -``` - -For more information about TiUP topology configuration, see [TiUP cluster topology file configuration](/tiup/tiup-cluster-topology-reference.md). - ## Use Top SQL The following are the common steps to use Top SQL. 1. Visit the [Top SQL page](#access-the-page). -2. Select a particular TiDB or TiKV node that you want to observe the workload. - - ![Select a TiDB or TiKV node](/media/dashboard/v8.5-top-sql-usage-select-instance.png) - - If you are not sure which node to observe, you can first locate the node with abnormal workload from Grafana or the [TiDB Dashboard Overview page](/dashboard/dashboard-overview.md), and then return to the Top SQL page for further analysis. - -3. Set the time range and refresh data as needed. - - You can adjust the time range in the time picker or zoom the observation window by selecting a time range in the chart. Setting a smaller time range displays more fine-grained data, with a precision of up to 1 second. +2. Select a particular TiDB or TiKV instance that you want to observe the load. - ![Change time range](/media/dashboard/v8.5-top-sql-usage-change-timerange.png) + ![Select Instance](/media/dashboard/top-sql-usage-select-instance.png) - If the chart is out of date, click **Refresh** to refresh once, or select the data auto-refresh frequency from the **Refresh** drop-down list. + If you are unsure of which TiDB or TiKV instance to observe, you can select an arbitrary instance. Also, when the cluster CPU load is extremely unbalanced, you can first use Grafana charts to determine the specific instance you want to observe. - ![Refresh](/media/dashboard/v8.5-top-sql-usage-refresh.png) +3. Observe the charts and tables presented by Top SQL. -4. Select the observation mode. + ![Chart and Table](/media/dashboard/top-sql-usage-chart.png) - - Use `Limit` to display the Top `5`, `20`, or `100` SQL queries. - - The default aggregation dimension is `By Query`. If you select a TiKV node, you can also aggregate in dimensions of `By Table`, `By DB`, or `By Region`. + The size of the bars in the bar chart represents the size of CPU resources consumed by the SQL statement at that moment. Different colors distinguish different types of SQL statements. In most cases, you only need to focus on the SQL statements that have a higher CPU resource overhead in the corresponding time range in the chart. - ![Select aggregation dimension](/media/dashboard/v8.5-top-sql-usage-select-agg-by.png) +4. Click a SQL statement in the table to show more information. You can see detailed execution metrics of different plans of that statement, such as Call/sec (average queries per second) and Scan Indexes/sec (average number of index rows scanned per second). - - The default sort order is `Order By CPU` (sorted by CPU time). If you select a TiKV node and have [enabled TiKV Network IO collection (multi-dimensional)](#optional-enable-tikv-network-io-collection-new-in-v856), you can also select `Order By Network` (sorted by network bytes) or `Order By Logical IO` (sorted by logical IO bytes). + ![Details](/media/dashboard/top-sql-details.png) - ![Select order by](/media/dashboard/v8.5-top-sql-usage-select-order-by.png) +5. Based on these initial clues, you can further explore the [SQL Statement](/dashboard/dashboard-statement-list.md) or [Slow Queries](/dashboard/dashboard-slow-query.md) page to find the root cause of high CPU consumption or large data scans of the SQL statement. - > **Note** - > - > `By Region`, `Order By Network`, and `Order By Logical IO` are only available when [TiKV Network IO collection (multi-dimensional)](#optional-enable-tikv-network-io-collection-new-in-v856) is enabled. If this feature is not enabled but historical data still exists, the page continues to display historical data and prompt that new data cannot be fully collected. + You can adjust the time range in the time picker or select a time range in the chart to get a more precise and detailed look at the problem. A smaller time range can provide more detailed data, with precision of up to 1 second. -5. Observe the resource consumption hotspot records in the chart and table. + ![Change time range](/media/dashboard/top-sql-usage-change-timerange.png) - ![Chart and Table](/media/dashboard/v8.5-top-sql-usage-chart.png) + If the chart is out of date, you can click the **Refresh** button or select Auto Refresh options from the **Refresh** drop-down list. - The bar chart shows resource consumption under the current sort dimension, with different colors representing different records. The table displays cumulative values according to the current sort dimension, and provides an `Others` row at the end to summarize all non-Top N records. + ![Refresh](/media/dashboard/top-sql-usage-refresh.png) -6. In the `By Query` view, click a row in the table to view the execution plan details for that type of SQL. +6. View the CPU resource usage by table or database level to quickly identify resource usage at a higher level. Currently, only TiKV instances are supported. - ![Details](/media/dashboard/v8.5-top-sql-details.png) + Select a TiKV instance, and then select **By TABLE** or **By DB**: - In the SQL statement details, you can view the corresponding SQL template, Query template ID, Plan template ID, and execution plan text. The SQL statement details table displays different metrics depending on the node type: + ![Select aggregation dimension](/media/dashboard/top-sql-usage-select-agg-by.png) - - TiDB nodes usually show `Call/sec` and `Latency/call`. - - TiKV nodes usually show `Call/sec`, `Scan Rows/sec`, and `Scan Indexes/sec`. + View the aggregated results at a higher level: - > **Note** - > - > If you select the `By Table`, `By DB`, or `By Region` aggregation view, the page displays the aggregation results and does not show SQL statement details by SQL execution plan. - - In the `By Query` view, you can also click **Search in SQL Statements** in the Top SQL table to jump to the corresponding SQL Statement Analysis page. If you need to analyze the current table results offline, you can click **Download to CSV** above the table to export the current table data. - -7. On TiKV nodes, if you need to locate hotspots from a higher dimension, you can switch to `By Table`, `By DB`, or `By Region` to view the aggregated results. - - ![Aggregated results at DB level](/media/dashboard/v8.5-top-sql-usage-agg-by-db-detail.png) - -8. Based on these initial clues, you can further analyze the root cause using the [SQL Statement](/dashboard/dashboard-statement-list.md) or [Slow Queries](/dashboard/dashboard-slow-query.md) page. + ![Aggregated results at DB level](/media/dashboard/top-sql-usage-agg-by-db-detail.png) ## Disable Top SQL You can disable this feature by following these steps: -1. Visit the [Top SQL page](#access-the-page). -2. Click the gear icon in the upper right corner to open the settings pane and disable the **Enable Feature** switch. +1. Visit [Top SQL page](#access-the-page). +2. Click the gear icon in the upper right corner to open the settings screen and switch off **Enable Feature**. 3. Click **Save**. 4. In the popped-up dialog box, click **Disable**. -After you disable Top SQL, new Top SQL data collection will stop, but historical data can still be viewed before it expires. - In addition to the UI, you can also disable the Top SQL feature by setting the TiDB system variable [`tidb_enable_top_sql`](/system-variables.md#tidb_enable_top_sql-new-in-v540): {{< copyable "sql" >}} @@ -171,15 +119,6 @@ In addition to the UI, you can also disable the Top SQL feature by setting the T SET GLOBAL tidb_enable_top_sql = 0; ``` -### Disable TiKV Network IO collection - -If you only want to stop collecting multi-dimensional data such as `Network Bytes` and `Logical IO Bytes` for TiKV, while retaining the CPU dimension analysis capability of Top SQL, disable the **Enable TiKV Network IO collection (multi-dimensional)** switch in the Top SQL settings panel. - -After disabling: - -- The Top SQL page can still display previously collected, unexpired historical network IO and logical IO data. -- New network IO and logical IO data, as well as `By Region` data, will no longer be collected. - ## Frequently asked questions **1. Top SQL cannot be enabled and the UI displays "required component NgMonitoring is not started"**. @@ -188,37 +127,24 @@ See [TiDB Dashboard FAQ](/dashboard/dashboard-faq.md#a-required-component-ngmoni **2. Will performance be affected after enabling Top SQL?** -Enabling Top SQL has a slight impact on cluster performance. According to measurements, the average performance impact is less than 3%. If you also enable TiKV Network IO collection (multi-dimensional), there will be additional storage and query overhead. +This feature has a slight impact on cluster performance. According to our benchmark, the average performance impact is usually less than 3% when the feature is enabled. **3. What is the status of this feature?** It is now a generally available (GA) feature and can be used in production environments. -**4. What does `Others` mean in the UI?** +**4. What is the meaning of "Other Statements"?** -`Others` represents the summary result of all non-Top N records under the current sort dimension. You can use it to understand how much of the total workload comes from the Top N records. +"Other Statement" counts the total CPU overhead of all non-Top 5 statements. With this information, you can learn the CPU overhead contributed by the Top 5 statements compared with the overall. **5. What is the relationship between the CPU overhead displayed by Top SQL and the actual CPU usage of the process?** Their correlation is strong but they are not exactly the same thing. For example, the cost of writing multiple replicas is not counted in the TiKV CPU overhead displayed by Top SQL. In general, SQL statements with higher CPU usage result in higher CPU overhead displayed in Top SQL. -**6. What does the Y-axis of the Top SQL chart mean?** - -The Y-axis of the Top SQL chart represents the resource consumption under the current sort dimension. +**6. What is the meaning of the Y-axis of the Top SQL chart?** -- When `Order By CPU` is selected, the Y-axis represents CPU time. -- When `Order By Network` is selected, the Y-axis represents network bytes. -- When `Order By Logical IO` is selected, the Y-axis represents logical IO bytes. +It represents the size of CPU resources consumed. The more resources consumed by a SQL statement, the higher the value is. In most cases, you do not need to care about the meaning or unit of the specific value. **7. Does Top SQL collect running (unfinished) SQL statements?** -Yes. After you enable Top SQL, TiDB Dashboard collects resource consumption for all running SQL statements, including unfinished ones. - -**8. Why is there no new data for `Order By Network`, `Order By Logical IO`, or `By Region`?** - -These views depend on TiKV Network IO collection (multi-dimensional). You can check the following items: - -- You have selected a TiKV node. -- The **Enable TiKV Network IO collection (multi-dimensional)** switch in the Top SQL settings panel is enabled. -- The relevant TiKV nodes in the cluster have all successfully enabled this configuration. If only some nodes enable this configuration, the Top SQL page prompts that new data might be incomplete. -- For newly added TiKV nodes, you need to manually enable the **Enable TiKV Network IO collection (multi-dimensional)** switch in the Top SQL settings panel and save the changes again. To make this setting automatically enabled for newly added nodes, also enable `resource-metering.enable-network-io-collection` in the TiKV default configuration of TiUP. +Yes. The bars displayed in the Top SQL chart at each moment indicate the CPU overhead of all running SQL statements at that moment. diff --git a/releases/release-8.5.6.md b/releases/release-8.5.6.md index a5d6540c3b19e..0190761e7b683 100644 --- a/releases/release-8.5.6.md +++ b/releases/release-8.5.6.md @@ -43,14 +43,6 @@ Quick access: [Quick start](https://docs.pingcap.com/tidb/v8.5/quick-start-with- For more information, see [documentation](https://docs.pingcap.com/tidb/v8.5/identify-slow-queries). -- The Top SQL page in TiDB Dashboard now supports collecting and displaying TiKV network traffic and logical I/O metrics [#62916](https://github.com/pingcap/tidb/issues/62916) @[yibin87](https://github.com/yibin87) - - In earlier versions, TiDB Dashboard identified Top SQL queries based only on CPU-related metrics, making it difficult to identify performance bottlenecks related to network or storage access in complex scenarios. - - Starting from v8.5.6, you can enable **TiKV Network IO collection (multi-dimensional)** in the Top SQL settings to view metrics such as `Network Bytes` and `Logical IO Bytes` for TiKV nodes. You can also analyze these metrics across multiple dimensions, including `By Query`, `By Table`, `By DB`, and `By Region`, helping you identify resource hotspots more comprehensively. - - For more information, see [documentation](https://docs.pingcap.com/tidb/v8.5/top-sql). - ### SQL - Support column-level privilege management [#61706](https://github.com/pingcap/tidb/issues/61706) @[CbcWestwolf](https://github.com/CbcWestwolf) @[fzzf678](https://github.com/fzzf678) @@ -117,7 +109,6 @@ For TiDB clusters newly deployed in v8.5.5 (that is, not upgraded from versions | TiKV | [`gc.auto-compaction.mvcc-read-aware-enabled`](https://docs.pingcap.com/tidb/v8.5/tikv-configuration-file#mvcc-read-aware-enabled-new-in-v856) | Newly added | Controls whether to enable MVCC-read-aware compaction. The default value is `false`. | | TiKV | [`gc.auto-compaction.mvcc-read-weight`](https://docs.pingcap.com/tidb/v8.5/tikv-configuration-file#mvcc-read-weight-new-in-v856) | Newly added | The weight multiplier applied to MVCC read activity when calculating the compaction priority score for a Region. The default value is `3.0`. | | TiKV | [`gc.auto-compaction.mvcc-scan-threshold`](https://docs.pingcap.com/tidb/v8.5/tikv-configuration-file#mvcc-scan-threshold-new-in-v856) | Newly added | The minimum number of MVCC versions scanned per read request to mark a Region as a compaction candidate. The default value is `1000`. | -| TiKV | [`resource-metering.enable-network-io-collection`](https://docs.pingcap.com/tidb/v8.5/tikv-configuration-file#enable-network-io-collection-new-in-v856) | Newly added | Controls whether TiKV network traffic and logical I/O metrics are additionally collected in Top SQL. The default value is `false`. | | TiCDC | [`sink.csv.output-field-header`](https://docs.pingcap.com/tidb/v8.5/ticdc-csv#use-csv) | Newly added | Controls whether a header row is output in CSV files. The default value is `false`. This parameter applies only to the TiCDC new architecture. | ### System table changes @@ -143,8 +134,6 @@ For TiDB clusters newly deployed in v8.5.5 (that is, not upgraded from versions - Introduce a load-based compaction mechanism, which detects MVCC read overhead and prioritizes compaction for Regions with higher read cost to improve query performance [#19133](https://github.com/tikv/tikv/issues/19133) @[mittalrishabh](https://github.com/mittalrishabh) - Optimize the stale range cleanup logic during cluster scale-out and scale-in operations by deleting stale keys directly instead of cleaning them up through SST file ingestion, thereby reducing the impact on online request latency [#18042](https://github.com/tikv/tikv/issues/18042) @[LykxSassinator](https://github.com/LykxSassinator) - - Support collecting TiKV network traffic and logical I/O metrics for Top SQL, which helps you diagnose SQL performance issues more accurately [#18815](https://github.com/tikv/tikv/issues/18815) @[yibin87](https://github.com/yibin87) - + PD - Return `404` instead of `200` when deleting a non-existent label [#10089](https://github.com/tikv/pd/issues/10089) @[lhy1024](https://github.com/lhy1024) diff --git a/tikv-configuration-file.md b/tikv-configuration-file.md index eb7d63e659d13..0b32caf8af150 100644 --- a/tikv-configuration-file.md +++ b/tikv-configuration-file.md @@ -2661,24 +2661,6 @@ To reduce write latency, TiKV periodically fetches and caches a batch of timesta + In a default TSO physical time update interval (`50ms`), PD provides at most 262144 TSOs. When requested TSOs exceed this number, PD provides no more TSOs. This configuration item is used to avoid exhausting TSOs and the reverse impact of TSO exhaustion on other businesses. If you increase the value of this configuration item to improve high availability, you need to decrease the value of [`tso-update-physical-interval`](/pd-configuration-file.md#tso-update-physical-interval) at the same time to get enough TSOs. + Default value: `8192` -## resource-metering - -Configuration items related to resource metering. - -### `enable-network-io-collection` New in v8.5.6 - -+ Controls whether to collect TiKV network traffic and logical I/O information in [Top SQL](/dashboard/top-sql.md) in addition to CPU data. -+ When enabled, TiKV additionally records inbound network bytes, outbound network bytes, logical read bytes, and logical write bytes during request processing. -+ When reporting resource consumption, TiKV filters the Top N records based on CPU time, network traffic, and logical I/O, and additionally reports these statistics by Region for more fine-grained analysis of hotspot requests or resource usage sources. -+ Default value: `false` - -> **Note:** -> -> Logical I/O is not equivalent to physical I/O and cannot be directly correlated: -> -> - Logical I/O refers to the logical amount of data processed by requests at the TiKV storage layer, such as data scanned or processed during reads and data written by write requests. -> - Physical I/O refers to the actual disk read/write traffic on the underlying storage device, which is affected by block cache, compaction, flush, and other factors. - ## resource-control Configuration items related to resource control of the TiKV storage layer.