From cf79a21dbbcdc5818f1218e4d4f42dc08e7ca692 Mon Sep 17 00:00:00 2001 From: Ran Date: Mon, 15 Jun 2020 19:00:23 +0800 Subject: [PATCH 1/7] create tidb-best-practice --- TOC.md | 1 + tidb-best-practices.md | 129 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 130 insertions(+) create mode 100644 tidb-best-practices.md diff --git a/TOC.md b/TOC.md index 1df5259b08854..15224c387fac9 100644 --- a/TOC.md +++ b/TOC.md @@ -106,6 +106,7 @@ + Tutorials + [Geo-Redundant Deployment](/geo-redundancy-deployment.md) + Best Practices + + [Use TiDB](/tidb-best-practices.md) + [Java Application Development](/best-practices/java-app-best-practices.md) + [Use HAProxy](/best-practices/haproxy-best-practices.md) + [Highly Concurrent Write](/best-practices/high-concurrency-best-practices.md) diff --git a/tidb-best-practices.md b/tidb-best-practices.md new file mode 100644 index 0000000000000..5c1e19ee9c257 --- /dev/null +++ b/tidb-best-practices.md @@ -0,0 +1,129 @@ +--- +title: TiDB Best Practice +summary: +category: reference +--- + +# TiDB Best Practice + +This document summarizes the best practices of using TiDB, including the use of SQL and optimization tips for OLAP/OLTP scenarios, especially the optimization switches specific for TiDB. + +Before you read this document, it is recommended that you read three blog posts that introduces the technical principle of TiDB: + +* [TiDB Internal (I) - Data Storage](https://pingcap.com/blog/2017-07-11-tidbinternal1/) +* [TiDB Internal (II) - Computing](https://pingcap.com/blog/2017-07-11-tidbinternal2/) +* [TiDB Internal (III) - Scheduling](https://pingcap.com/blog/2017-07-20-tidbinternal3/) + +## Foreword + +The database is a generic component of infrastructure. When building a database, developers must have multiple target scenarios in mind. In a specific scenario, users need to adjust the parameters or usage according to the application. + +TiDB is a distributed database compatible with the MySQL protocol. However, due to TiDB's internal implementation, especially because TiDB supports distributed storage and distributed transactions, some usage is different from that of MySQL. + +## Concept + +TiDB's best practices are closely related to its implementation principles. It is recommended that you learn some of the basic mechanisms, including the Raft consensus algorithm, distributed transactions, data sharding, load balancing, the mapping scheme from SQL to KV, the implementation method of secondary indexes, and distributed execution engines. + +This section is an introduction to these concepts. For detailed information, refer to [PingCAP blog posts](https://pingcap.com/blog/). + +### Raft + +Raft is a consensus algorithm that guarantees strongly consistent data replication. TiDB replicates data at the bottom layer by using Raft. Each write operation writes into a majority of replicas before TiDB returns externally that the write operation is successful. In this way, even if a few replicas are lost, TiDB still has the latest data. + +For example, when there are 3 replicas in the system, a write operation is considered successful only when the data is written into at least 2 replicas. Whenever 1 replica is lost, at least one of the two surviving replicas has the latest data. + +Compared to the master-slave replication method, which also keeps three replicas, Raft is more efficient. The latency of a write operation depends on the two fastest replicas, not on the slowest one. So with Raft replication, the geo-distributed multi-active scenario is made possible. In a typical scenario of three data centers in two cities, each write operation only needs to be successful in this data center and the closest one to ensure data consistency, without the need for successful writes in all three data centers. + +However, this doesn't mean that you can build a cross-center deployment in every scenario. When writes are heavy, the bandwidth and latency between data centers become critical factors. if the write speed exceeds the bandwidth between the data centers, or if latency is too large between data centers, the Raft replication mechanism does not work well. + +### Distributed transactions + +TiDB provides a fully distributed transaction model, which is optimized on top of the [Google Percolator](https://research.google.com/pubs/pub36726.html). This document introduces the following features: + +* Optimistic transaction model + + In TiDB, the optimistic transaction model performs conflict checks only when the transaction is committed. If conflicts exist, the transaction needs retry. In highly contention scenarios, this model is inefficient, because all operations before the retry becomes invalid and must be done repeatedly. + + Take an extreme case as example. When the database is used as a counter, in a highly concurrent scenario, serious conflicts will cause a large number of retries and even timeout. + + If the conflicts are not serious, the optimistic transaction model is efficient. Otherwise, it is recommended that you use the pessimistic transaction model, or solve the issue in the system architecture, such as putting the counter in Redis. + +* Pessimistic transaction model + + In TiDB, the pessimistic transaction model has almost the same behavior as in MySQL. The transaction applies a lock during the execution phase, which avoids retries in conflict situations and ensures a higher success ratio. By applying the pessimistic locking, you can also lock data in advance using `select for update`. + + However, if the business scenario itself has fewer conflicts, the optimistic transaction model has better performance. + +* Transaction size limit + + Distributed transactions must perform two-phase commit (2PC), and data is replicated at the bottom layer via the Raft consensus algorithm. Therefore, if a transaction is large, the commit process becomes so slow that it blocks the Raft replication. + + To avoid the system being blocked, the size of transaction has the following limit: + + - A single transaction contains no more than 5,000 SQL statements (default) + - A single KV entry is not larger than 6 MB. + - The total size KV entry is not larger than 10 GB. + + Similar limits can also be found in [Cloud Spanner](https://cloud.google.com/spanner/quotas) of Google. + +### Data sharding + +TiKV automatically shards the bottom data by the range of keys. Each Region is a range of keys, which is a left-closed and right-open interval from `StartKey` to `EndKey`. When the number of Key-Value pairs exceeds a certain limit, the Region is automatically split into two. + +### Load balancing + +PD schedules the load of the cluster according to the status of the entire TiKV cluster. Scheduling is automatically performed in the unit of Region and takes the policy of the PD configuration as the scheduling logic. + +### SQL on KV + +TiDB automatically maps SQL to Key-Value. For details, see [TiDB Internal (II) - Computing](https://pingcap.com/blog/2017-07-11-tidbinternal2/). + +Briefly speaking, TiDB performs the following operations: + +* A row of data is mapped to a Key-Value pair. The key takes `TableID` to form the prefix and the row ID to form the suffix. +* An index is mapped to a Key-Value pair. The key takes `TableID+IndexID` to form the prefix and the index value to form the suffix. + +The data or indexes in the same table have the same prefix. In the Key space of TiKV, these Key-Value pairs are located in adjacent places. When heavy writes are on the same table, this leads to write hotspots. Especially when the data that are being written consecutively has some consecutive index values (for example, some time-incremental fields like `update time`), hotspots occur on several Regions, which becomes the bottleneck of the whole system. + +Similarly, if all read operations focus on a small range (such as several tens of thousands of consecutive rows), this leads to read hotspots. + +### Secondary index + +TiDB provides full support for global secondary indexes, and many queries can be optimized through indexes. Thus, it is important for applications to make good use of secondary indexes. + +A large amount of experience with MySQL is applicable to TiDB, but note that TiDB also has some exclusive features. This section introduces some considerations when you use secondary indexes in TiDB. + +* The more secondary indexes, the better? + + Secondary indexes can accelerate queries, but adding an index has some side effects. The previous section introduces the storage model of indexes: when a new index is added, each insert of data requires one more Key-Value pair. Therefore, the more indexes, the slower the write operation is and the more space it occupies. + + In addition, too many indexes also affects the optimizer runtime. Inappropriate indexes might mislead the optimizer. Thus, it is not necessarily true that the more indexes, the better. + +* Which column is better to create index on + + As is mentioned above, indexes are important, but not the more the better, so you need to create the right indexes for your application. In principle, you need to create indexes on the columns used in query to improve performance. The following situations are suitable for creating indexes: + + - Columns that has large differences. By using indexes, you can significantly reduce the number of filtered rows. + - When there are multiple query conditions, you can select the combined index. Note that you need to put the column of equivalent conditions in front of the combined index. + + For example, assume that a frequent query is `select * from t where c1 = 10 and c2 = 100 and c3 > 10`. You might create a combined index `Index cidx (c1, c2, c3)`, and the query conditions can be used to construct an index prefix for scan. + +* The difference between querying through index and directly scanning the table + + + +## Scenarios and practices + +### Deploy + +### Import data + +### Write data + +### Query + +### Monitoring and log + +### Documentation + +## Best scenarios of TiDB \ No newline at end of file From 6f8ad3ef5588b59fc9defbf5aa4ff94ae1a78263 Mon Sep 17 00:00:00 2001 From: Ran Date: Mon, 15 Jun 2020 21:38:17 +0800 Subject: [PATCH 2/7] align with en blog --- tidb-best-practices.md | 177 ++++++++++++++++++++++++++++++----------- 1 file changed, 131 insertions(+), 46 deletions(-) diff --git a/tidb-best-practices.md b/tidb-best-practices.md index 5c1e19ee9c257..656e436105a18 100644 --- a/tidb-best-practices.md +++ b/tidb-best-practices.md @@ -16,114 +16,199 @@ Before you read this document, it is recommended that you read three blog posts ## Foreword -The database is a generic component of infrastructure. When building a database, developers must have multiple target scenarios in mind. In a specific scenario, users need to adjust the parameters or usage according to the application. +Database is a generic infrastructure system. It is important to, for one thing, consider various user scenarios during the development process, and for the other, modify the data parameters or the way to use according to actual situations in specific business scenarios. -TiDB is a distributed database compatible with the MySQL protocol. However, due to TiDB's internal implementation, especially because TiDB supports distributed storage and distributed transactions, some usage is different from that of MySQL. +TiDB is a distributed database compatible with MySQL protocol and syntax. But with the internal implementation and supporting of distributed storage and transactions, the way of using TiDB is different from MySQL. -## Concept +## Basic Concepts -TiDB's best practices are closely related to its implementation principles. It is recommended that you learn some of the basic mechanisms, including the Raft consensus algorithm, distributed transactions, data sharding, load balancing, the mapping scheme from SQL to KV, the implementation method of secondary indexes, and distributed execution engines. +The best practices are closely related to its implementation principles. It is recommended that you learn some of the basic mechanisms, including the Raft consensus algorithm, distributed transactions, data sharding, load balancing, the mapping solution from SQL to KVV, the implementation method of secondary indexing, and distributed execution engines. This section is an introduction to these concepts. For detailed information, refer to [PingCAP blog posts](https://pingcap.com/blog/). ### Raft -Raft is a consensus algorithm that guarantees strongly consistent data replication. TiDB replicates data at the bottom layer by using Raft. Each write operation writes into a majority of replicas before TiDB returns externally that the write operation is successful. In this way, even if a few replicas are lost, TiDB still has the latest data. +Raft is a consensus algorithm and ensures data replication with strong consistency. At the bottom layer, TiDB uses Raft to synchronize data. TiDB writes data to the majority of the replicas before returning the result of success. In this way, the system will definitely have the latest data even though a few replicas might get lost. For example, if there are three replicas, the system will not return the result of success until data has been written to two replicas. Whenever a replica is lost, at least one of the remaining two replicas have the latest data. -For example, when there are 3 replicas in the system, a write operation is considered successful only when the data is written into at least 2 replicas. Whenever 1 replica is lost, at least one of the two surviving replicas has the latest data. - -Compared to the master-slave replication method, which also keeps three replicas, Raft is more efficient. The latency of a write operation depends on the two fastest replicas, not on the slowest one. So with Raft replication, the geo-distributed multi-active scenario is made possible. In a typical scenario of three data centers in two cities, each write operation only needs to be successful in this data center and the closest one to ensure data consistency, without the need for successful writes in all three data centers. - -However, this doesn't mean that you can build a cross-center deployment in every scenario. When writes are heavy, the bandwidth and latency between data centers become critical factors. if the write speed exceeds the bandwidth between the data centers, or if latency is too large between data centers, the Raft replication mechanism does not work well. +To store three replicas, compared with the replication of Master-Slave, Raft is more efficient. The write latency of Raft depends on the two fastest replicas, instead of the slowest. Therefore, the implementation of geo-distributed and multiple active data centers becomes possible by using the Raft replication. In the typical scenario of three data centers distributing in two sites, to guarantee the data consistency, we just need to successfully write data into the local data center and the closer one, instead of writing to all three data-centers. However, this does not mean that cross-data center deployment can be implemented in any scenario. When the amount of data to be written is large, the bandwidth and latency between data-centers become the key factors. If the write speed exceeds the bandwidth or the latency is too high, the Raft synchronization mechanism still cannot work well. ### Distributed transactions -TiDB provides a fully distributed transaction model, which is optimized on top of the [Google Percolator](https://research.google.com/pubs/pub36726.html). This document introduces the following features: +TiDB provides complete distributed transactions and the model has some optimizations on the basis of [Google Percolator](https://research.google.com/pubs/pub36726.html). This document introduces the following features: * Optimistic transaction model - In TiDB, the optimistic transaction model performs conflict checks only when the transaction is committed. If conflicts exist, the transaction needs retry. In highly contention scenarios, this model is inefficient, because all operations before the retry becomes invalid and must be done repeatedly. - - Take an extreme case as example. When the database is used as a counter, in a highly concurrent scenario, serious conflicts will cause a large number of retries and even timeout. - - If the conflicts are not serious, the optimistic transaction model is efficient. Otherwise, it is recommended that you use the pessimistic transaction model, or solve the issue in the system architecture, such as putting the counter in Redis. + TiDB's transaction model uses the optimistic lock and will not detect conflicts until the commit phase. If there are conflicts, retry the transaction. But this model is inefficient if the conflict is severe because operations before retry are invalid and need to repeat. Assume that the database is used as a counter. High access concurrency might lead to severe conflicts, resulting in multiple retries or even timeouts. Therefore, in the scenario of severe conflicts, it is recommended to solve problems at the system architecture level, such as placing counter in Redis. Nonetheless, the optimistic lock model is efficient if the access conflict is not very severe. * Pessimistic transaction model - In TiDB, the pessimistic transaction model has almost the same behavior as in MySQL. The transaction applies a lock during the execution phase, which avoids retries in conflict situations and ensures a higher success ratio. By applying the pessimistic locking, you can also lock data in advance using `select for update`. + In TiDB, the pessimistic transaction model has almost the same behavior as in MySQL. The transaction applies a lock during the execution phase, which avoids retries in conflict situations and ensures a higher success rate. By applying the pessimistic locking, you can also lock data in advance using `select for update`. However, if the business scenario itself has fewer conflicts, the optimistic transaction model has better performance. * Transaction size limit - Distributed transactions must perform two-phase commit (2PC), and data is replicated at the bottom layer via the Raft consensus algorithm. Therefore, if a transaction is large, the commit process becomes so slow that it blocks the Raft replication. - - To avoid the system being blocked, the size of transaction has the following limit: + As distributed transactions need to conduct two-phase commit and the bottom layer performs Raft replication, if a transaction is very large, the commit process would be quite slow and the following Raft replication flow is thus struck. To avoid this problem, we limit the transaction size: - - A single transaction contains no more than 5,000 SQL statements (default) - - A single KV entry is not larger than 6 MB. - - The total size KV entry is not larger than 10 GB. + - A transaction is limited to 5000 SQL statements (by default) + - Each Key-Value entry is no more than 6 MB + -The total size of Key-Value entry is no more than 10 GB. - Similar limits can also be found in [Cloud Spanner](https://cloud.google.com/spanner/quotas) of Google. + Similar limits can also be found in [Google Cloud Spanner](https://cloud.google.com/spanner/quotas). ### Data sharding -TiKV automatically shards the bottom data by the range of keys. Each Region is a range of keys, which is a left-closed and right-open interval from `StartKey` to `EndKey`. When the number of Key-Value pairs exceeds a certain limit, the Region is automatically split into two. +TiKV automatically shards bottom-layered data according to the Range of Key. Each Region is a range of Key, which is a left-closed and right-open interval, `[StartKey, EndKey)`. When the amount of Key-Value in Region exceeds a certain value, it will automatically split. ### Load balancing -PD schedules the load of the cluster according to the status of the entire TiKV cluster. Scheduling is automatically performed in the unit of Region and takes the policy of the PD configuration as the scheduling logic. +PD balances the load of the cluster according to the status of the entire TiKV cluster. The unit of scheduling is Region and the logic is the strategy configured by PD. ### SQL on KV -TiDB automatically maps SQL to Key-Value. For details, see [TiDB Internal (II) - Computing](https://pingcap.com/blog/2017-07-11-tidbinternal2/). +TiDB automatically maps the SQL structure into Key-Value structure. For details, see [TiDB Internal (II) - Computing](https://pingcap.com/blog/2017-07-11-tidbinternal2/). -Briefly speaking, TiDB performs the following operations: +Simply put, TiDB performs the following operations: -* A row of data is mapped to a Key-Value pair. The key takes `TableID` to form the prefix and the row ID to form the suffix. -* An index is mapped to a Key-Value pair. The key takes `TableID+IndexID` to form the prefix and the index value to form the suffix. +* A row of data is mapped to a Key-Value pair. Key is prefixed with `TableID` and suffixed with row ID. +* An index is mapped as a Key-Value pair. Key is prefixed with `TableID+IndexID` and suffixed with the index value. -The data or indexes in the same table have the same prefix. In the Key space of TiKV, these Key-Value pairs are located in adjacent places. When heavy writes are on the same table, this leads to write hotspots. Especially when the data that are being written consecutively has some consecutive index values (for example, some time-incremental fields like `update time`), hotspots occur on several Regions, which becomes the bottleneck of the whole system. +The data or indexes in the same table have the same prefix. These Key-Values are at adjacent positions in the Key space of TiKV. Therefore, when the amount of data to be written is large and all is written to one table, the write hotspot is created. The situation gets worse when some index values of the continuous written data is also continuous (e.g. fields that increase with time, like update time), which will create a few write hotspots and become the bottleneck of the entire system. -Similarly, if all read operations focus on a small range (such as several tens of thousands of consecutive rows), this leads to read hotspots. +Similarly, if all data is read from a focused small range (e.g. the continuous tens or hundreds of thousands of rows of data), access hotspot of data will probably occur. -### Secondary index +### Secondary indexes -TiDB provides full support for global secondary indexes, and many queries can be optimized through indexes. Thus, it is important for applications to make good use of secondary indexes. +TiDB supports the complete secondary indexes which are also global indexes. Many queries can be optimized by index. Thus, it is important for applications to make good use of secondary indexes. -A large amount of experience with MySQL is applicable to TiDB, but note that TiDB also has some exclusive features. This section introduces some considerations when you use secondary indexes in TiDB. +Lots of MySQL experience is also applicable to TiDB. It is noted that TiDB has its unique features. The following are a few notes when using secondary indexes in TiDB. * The more secondary indexes, the better? - Secondary indexes can accelerate queries, but adding an index has some side effects. The previous section introduces the storage model of indexes: when a new index is added, each insert of data requires one more Key-Value pair. Therefore, the more indexes, the slower the write operation is and the more space it occupies. + Secondary indexes can speed up query, but adding an index has side effects. The previous section introduces the storage model of index. For each additional index, there will be one more Key-Value when inserting a piece of data. Therefore, the more indexes, the slower the writing speed and the more space it takes up. In addition, too many indexes will influence the runtime of the optimizer. And inappropriate index will mislead the optimizer. Thus, the more secondary indexes is not necessarily the better. + +* Which columns should create indexes? - In addition, too many indexes also affects the optimizer runtime. Inappropriate indexes might mislead the optimizer. Thus, it is not necessarily true that the more indexes, the better. + As is mentioned above, index is important but the number of indexes should be proper. We need to create appropriate indexes according to the characteristics of business. In principle, we need to create indexes for the columns needed in the query, the purpose of which is to improve the performance. Below are the conditions that need to create indexes: -* Which column is better to create index on + - For columns with a high degree of differentiation, the number of filtered rows is remarkably reduced though index. + - If there are multiple query criteria, you can choose composite indexes. Note to put the columns with equivalent condition before composite index. - As is mentioned above, indexes are important, but not the more the better, so you need to create the right indexes for your application. In principle, you need to create indexes on the columns used in query to improve performance. The following situations are suitable for creating indexes: + For example, if a commonly-used query is `select * from t where c1 = 10 and c2 = 100 and c3 > 10`, you can create a composite index `Index cidx (c1, c2, c3)`. In this way, you can use the query criterion to create an index prefix and then Scan. - - Columns that has large differences. By using indexes, you can significantly reduce the number of filtered rows. - - When there are multiple query conditions, you can select the combined index. Note that you need to put the column of equivalent conditions in front of the combined index. +* The difference between querying through indexes and directly scanning the table - For example, assume that a frequent query is `select * from t where c1 = 10 and c2 = 100 and c3 > 10`. You might create a combined index `Index cidx (c1, c2, c3)`, and the query conditions can be used to construct an index prefix for scan. + TiDB has implemented global indexes, so indexes and data of the Table are not necessarily on data sharding. When querying through indexes, it should firstly scan indexes to get the corresponding row ID and then use the row ID to get the data. Thus, this method involves two network requests and has a certain performance overhead. -* The difference between querying through index and directly scanning the table + If the query involves lots of rows, scanning index proceeds concurrently. When the first batch of results is returned, getting the data of Table can then proceed. Therefore, this is a parallel + Pipeline model. Though the two accesses create overhead, the latency is not high. + The following two conditions don't have the problem of two accesses: + - Columns of the index have already met the query requirement. Assume that the `c` Column on the `t` Table has an index and the query is `select c from t where c > 10;`. At this time, all needed data can be obtained if accessing the index. We call this condition Covering Index. But if you focus more on the query performance, you can put a portion of columns that don't need to be filtered but need to be returned in the query result into index, creating composite index. Take `select c1, c2 from t where c1 > 10;` as an example. You can optimize this query by creating composite index `Index c12 (c1, c2)`. + + - The Primary Key of table is integer. In this case, TiDB uses the value of Primary Key as row ID. Thus, if the query condition is on PK, you can directly construct the range of the row ID, scan Table data, and get the result. + +* Query concurrency + + As data is distributed across many Regions, TiDB makes query concurrently. But the concurrency by default is not high in case it consumes lots of system resources. Besides, as for the OLTP query, it doesn't involve a large amount of data and the low concurrency is enough. But for the OLAP Query, the concurrency is high and TiDB modifies the query concurrency through System Variable. + + - [`tidb_distsql_scan_concurrency`](/tidb-specific-system-variables.md#tidb_distsql_scan_concurrency): + + The concurrency of scanning data, including scanning the Table and index data. + + - [`tidb_index_lookup_size`](/tidb-specific-system-variables.md#tidb_index_lookup_size): + + If it needs to access the index to get row IDs before accessing Table data, it uses a batch of row IDs as a single request to access Table data. This parameter can set the size of Batch. The larger Batch increases latency while the smaller one may lead to more queries. The proper size of this parameter is related to the amount of data that the query involves. Generally, no modification is required. + + - [`tidb_index_lookup_concurrency`](/tidb-specific-system-variables.md#tidb_index_lookup_concurrency): + + If it needs to access the index to get row IDs before accessing Table data, the concurrency of getting data through row IDs every time is modified through this parameter. + +* Ensure the order of results through index + + Index cannot only be used to filter data, but also to sort data. Firstly, get row IDs according to the index order. Then return the row content according to the return order of row IDs. In this way, the return results are ordered according to the index column. I've mentioned that the model of scanning index and getting Row is parallel + Pipeline. If Row is returned according to the index order, a high concurrency between two queries will not reduce latency. Thus, the concurrency is low by default, but it can be modified through the [`tidb_index_serial_scan_concurrency`](/tidb-specific-system-variables.md#tidb_index_serial_scan_concurrency) variable. + +* Reverse index scan + + As in MySQL 5.7, all indexes in TiDB are in ascending order. TiDB supports the ability to read an ascending index in reverse order, at a performance overhead of about 23%. Earlier versions of TiDB had a higher performance penalty, and thus reverse index scans were not recommended. ## Scenarios and practices -### Deploy +In the last section, we discussed some basic implementation mechanisms of TiDB and their influence on usage. Let's start from specific usage scenarios and operation practices. We'll go through from deployment to business supporting. + +### Deployment + +Please read [Software and Hardware Requirements](/hardware-and-software-requirements.md) before deployment. + +It is recommended to deploy the TiDB cluster using [TiUP](/production-deployment-using-tiup.md). This tool can deploy, stop, destroy, and update the whole cluster, which is quite convenient. ### Import data -### Write data +In order to improve the write performance during the import process, you can tune TiKV's parameters as stated in [Tune TiKV Performance](tune-tikv-memory-performance.md). + +### Write + +As mentioned before, TiDB limits the size of a single transaction in the Key-Value layer. As for the SQL layer, a row of data is mapped to a Key-Value entry. For each additional index, there will be one more Key-Value entries. + +> **Note:** +> +> The size limit for transactions needs to consider the overhead of TiDB encoding and the extra transaction Key. It is recommended that **the number of rows of each transaction is less than 200 and the data size of a single row is less than 100 KB**; otherwise, the performance is bad. + +It is recommended to split statements to batches or add `limit` to the statements whether it is an `insert`, `update` or `delete` statement. + +When deleting a large amount of data, it is recommended to use `Delete * from t where xx limit 5000;`. It deletes through the loop and use `Affected Rows == 0` as a condition to end the loop, so as not to exceed the limit of transaction size. + +If the amount of data that needs to be deleted at a time is large, this loop method will get slower and slower because each deletion traverses backward. After deleting the previous data, lots of deleted flags will remain in a short period (then all will be Garbage Collected) and affect the following Delete statement. If possible, it is recommended to refine the `Where` condition. Assume that you need to delete all data on `2017-05-26`, you can: + +```sql +for i from 0 to 23: + while affected_rows > 0: + delete * from t where insert_time >= i:00:00 and insert_time < (i+1):00:00 limit 5000; + affected_rows = select affected_rows() +``` + +This pseudocode means to split huge chunks of data into small ones and then delete, so that the following Delete statement will not be influenced. ### Query +For query requirements and specific statements, refer to [TiDB Specific System Variables](/tidb-specific-system-variables.md). + +You can control the concurrency of SQL execution through the `SET` statement and the selection of the `Join` operator through Hint. + +In addition, you can also use MySQL's standard index selection, the `Hint` syntax, to control the optimizer to select index through `Use Index/Ignore Index` hint. + +If the business scenario needs both OLTP and OLAP, you can send the TP request and AP request to different tidb-servers, diminishing the impact of AP business on TP. It is recommended to use high-end machines (e.g. more processor cores, larger memory, etc.) for the tidb-server that carries AP business. + +To completely isolate OLTP and OLAP workloads, it is recommended to run OLAP applications on TiFlash. TiFlash is a columnar storage engine with great performance on OLAP scenarios. TiFlash can achieve physical isolation on the storage layer and guarantees consistency read. + ### Monitoring and log +The monitoring metrics is the best method to learn the status of the system. It is recommended that you deploy the monitoring system. + +TiDB uses [Grafana + Prometheus](/tidb-monitoring-framework.md) to monitor the system state. The monitoring system is automatically deployed and configured if using TiUP. + +There are lots of items in the monitoring system, the majority of which are for TiDB developers. There is no need to understand these items but for an in-depth knowledge of the source code. We've picked out some items that are related to business or to the state of system key components in a separate panel for users. + +In addition to monitoring, you can also view the system logs. The three components of TiDB, tidb-server, tikv-server and pd-server, each has a `--log-file` parameter. If this parameter has been configured when initiating, logs will be stored in the file configured by the parameter and Log files are automatically archived on a daily basis. If the `--log-file` parameter has not been configured, log will be output to `stderr`. + +Starting from TiDB 4.0, TiDB provides TiDB Dashboard UI to improve the usability. You can access TiDB Dashboard by visiting in your browser. TiDB Dashboard provides features such as viewing cluster status, performance analysis, traffic visualization, SQL diagnosis and log searching. + ### Documentation -## Best scenarios of TiDB \ No newline at end of file +The best way to learn about a system or solve the problem is to read its documentation and understand its implementation principles. + +TiDB has a large number of official documents both in Chinese and English. You can also search the issue list for a solution. + +If you have met an issue, you can start from the FAQ and Troubleshooting sections. If the issue is not documented, please file an issue. + +For more information, see our website and our Technical Blog. + +## Best Scenarios for TiDB + +Simply put, TiDB can be used in the following scenarios: + +- The amount of data is too large for a standalone database +- Don't want to use the sharding solutions +- The access mode has no obvious hotspot +- Transactions, strong consistency, and disaster recovery +- Hope to have real-time HTAP. From 3fd8ea06629b0e4d1359f840527201974b9c3b55 Mon Sep 17 00:00:00 2001 From: Ran Date: Tue, 16 Jun 2020 15:36:56 +0800 Subject: [PATCH 3/7] refine wording --- tidb-best-practices.md | 128 +++++++++++++++++++++-------------------- 1 file changed, 66 insertions(+), 62 deletions(-) diff --git a/tidb-best-practices.md b/tidb-best-practices.md index 656e436105a18..0dba60ba4d594 100644 --- a/tidb-best-practices.md +++ b/tidb-best-practices.md @@ -1,36 +1,36 @@ --- -title: TiDB Best Practice -summary: +title: TiDB Best Practices +summary: Learn the best practices of using TiDB. category: reference --- -# TiDB Best Practice +# TiDB Best Practices -This document summarizes the best practices of using TiDB, including the use of SQL and optimization tips for OLAP/OLTP scenarios, especially the optimization switches specific for TiDB. +This document summarizes the best practices of using TiDB, including the use of SQL and optimization tips for OLAP/OLTP scenarios, especially the optimization options specific for TiDB. -Before you read this document, it is recommended that you read three blog posts that introduces the technical principle of TiDB: +Before you read this document, it is recommended that you read three blog posts that introduces the technical principles of TiDB: * [TiDB Internal (I) - Data Storage](https://pingcap.com/blog/2017-07-11-tidbinternal1/) * [TiDB Internal (II) - Computing](https://pingcap.com/blog/2017-07-11-tidbinternal2/) * [TiDB Internal (III) - Scheduling](https://pingcap.com/blog/2017-07-20-tidbinternal3/) -## Foreword +## Preface -Database is a generic infrastructure system. It is important to, for one thing, consider various user scenarios during the development process, and for the other, modify the data parameters or the way to use according to actual situations in specific business scenarios. +Database is a generic infrastructure system. It is important to consider various user scenarios during the development process and to modify the data parameters or the way to use according to actual situations in specific business scenarios. -TiDB is a distributed database compatible with MySQL protocol and syntax. But with the internal implementation and supporting of distributed storage and transactions, the way of using TiDB is different from MySQL. +TiDB is a distributed database compatible with the MySQL protocol and syntax. But with the internal implementation and supporting of distributed storage and transactions, the way of using TiDB is different from MySQL. ## Basic Concepts -The best practices are closely related to its implementation principles. It is recommended that you learn some of the basic mechanisms, including the Raft consensus algorithm, distributed transactions, data sharding, load balancing, the mapping solution from SQL to KVV, the implementation method of secondary indexing, and distributed execution engines. +The best practices are closely related to its implementation principles. It is recommended that you learn some of the basic mechanisms, including the Raft consensus algorithm, distributed transactions, data sharding, load balancing, the mapping solution from SQL to KV, the implementation method of secondary indexing, and distributed execution engines. This section is an introduction to these concepts. For detailed information, refer to [PingCAP blog posts](https://pingcap.com/blog/). ### Raft -Raft is a consensus algorithm and ensures data replication with strong consistency. At the bottom layer, TiDB uses Raft to synchronize data. TiDB writes data to the majority of the replicas before returning the result of success. In this way, the system will definitely have the latest data even though a few replicas might get lost. For example, if there are three replicas, the system will not return the result of success until data has been written to two replicas. Whenever a replica is lost, at least one of the remaining two replicas have the latest data. +Raft is a consensus algorithm that ensures data replication with strong consistency. At the bottom layer, TiDB uses Raft to replicate data. TiDB writes data to the majority of the replicas before returning the result of success. In this way, even though a few replicas might get lost, the system still has the latest data. For example, if there are three replicas, the system does not return the result of success until data has been written to two replicas. Whenever a replica is lost, at least one of the remaining two replicas have the latest data. -To store three replicas, compared with the replication of Master-Slave, Raft is more efficient. The write latency of Raft depends on the two fastest replicas, instead of the slowest. Therefore, the implementation of geo-distributed and multiple active data centers becomes possible by using the Raft replication. In the typical scenario of three data centers distributing in two sites, to guarantee the data consistency, we just need to successfully write data into the local data center and the closer one, instead of writing to all three data-centers. However, this does not mean that cross-data center deployment can be implemented in any scenario. When the amount of data to be written is large, the bandwidth and latency between data-centers become the key factors. If the write speed exceeds the bandwidth or the latency is too high, the Raft synchronization mechanism still cannot work well. +To store three replicas, compared with the replication of Master-Slave, Raft is more efficient. The write latency of Raft depends on the two fastest replicas, instead of the slowest one. Therefore, the implementation of geo-distributed and multiple active data centers becomes possible by using the Raft replication. In the typical scenario of three data centers distributing in two sites, to guarantee the data consistency, TiDB just needs to successfully write data into the local data center and the closer one, instead of writing to all three data-centers. However, this does not mean that cross-data center deployment can be implemented in any scenario. When the amount of data to be written is large, the bandwidth and latency between data centers become the key factors. If the write speed exceeds the bandwidth or the latency is too high, the Raft replication mechanism still cannot work well. ### Distributed transactions @@ -38,7 +38,9 @@ TiDB provides complete distributed transactions and the model has some optimizat * Optimistic transaction model - TiDB's transaction model uses the optimistic lock and will not detect conflicts until the commit phase. If there are conflicts, retry the transaction. But this model is inefficient if the conflict is severe because operations before retry are invalid and need to repeat. Assume that the database is used as a counter. High access concurrency might lead to severe conflicts, resulting in multiple retries or even timeouts. Therefore, in the scenario of severe conflicts, it is recommended to solve problems at the system architecture level, such as placing counter in Redis. Nonetheless, the optimistic lock model is efficient if the access conflict is not very severe. + TiDB's optimistic transaction model does not detect conflicts until the commit phase. If there are conflicts, the transaction needs retry. But this model is inefficient if the conflict is severe, because operations before retry are invalid and need to repeat. + + Assume that the database is used as a counter. High access concurrency might lead to severe conflicts, resulting in multiple retries or even timeouts. Therefore, in the scenario of severe conflicts, it is recommended to use the pessimistic transaction mode or to solve problems at the system architecture level, such as placing counter in Redis. Nonetheless, the optimistic transaction model is efficient if the access conflict is not very severe. * Pessimistic transaction model @@ -48,17 +50,17 @@ TiDB provides complete distributed transactions and the model has some optimizat * Transaction size limit - As distributed transactions need to conduct two-phase commit and the bottom layer performs Raft replication, if a transaction is very large, the commit process would be quite slow and the following Raft replication flow is thus struck. To avoid this problem, we limit the transaction size: + As distributed transactions need to conduct two-phase commit and the bottom layer performs Raft replication, if a transaction is very large, the commit process would be quite slow, and the following Raft replication process is thus struck. To avoid this problem, the transaction size is limited: - A transaction is limited to 5000 SQL statements (by default) - Each Key-Value entry is no more than 6 MB - -The total size of Key-Value entry is no more than 10 GB. + - The total size of Key-Value entry is no more than 10 GB. Similar limits can also be found in [Google Cloud Spanner](https://cloud.google.com/spanner/quotas). ### Data sharding -TiKV automatically shards bottom-layered data according to the Range of Key. Each Region is a range of Key, which is a left-closed and right-open interval, `[StartKey, EndKey)`. When the amount of Key-Value in Region exceeds a certain value, it will automatically split. +TiKV automatically shards bottom-layered data according to the Range of Key. Each Region is a range of Key, which is a left-closed and right-open interval, `[StartKey, EndKey)`. When the amount of Key-Value pairs in a Region exceeds a certain value, the Region automatically splits into two. ### Load balancing @@ -70,95 +72,97 @@ TiDB automatically maps the SQL structure into Key-Value structure. For details, Simply put, TiDB performs the following operations: -* A row of data is mapped to a Key-Value pair. Key is prefixed with `TableID` and suffixed with row ID. +* A row of data is mapped to a Key-Value pair. Key is prefixed with `TableID` and suffixed with the row ID. * An index is mapped as a Key-Value pair. Key is prefixed with `TableID+IndexID` and suffixed with the index value. -The data or indexes in the same table have the same prefix. These Key-Values are at adjacent positions in the Key space of TiKV. Therefore, when the amount of data to be written is large and all is written to one table, the write hotspot is created. The situation gets worse when some index values of the continuous written data is also continuous (e.g. fields that increase with time, like update time), which will create a few write hotspots and become the bottleneck of the entire system. +The data or indexes in the same table have the same prefix. These Key-Values are at adjacent positions in the Key space of TiKV. Therefore, when the amount of data to be written is large and all is written to one table, the write hotspot is created. The situation gets worse when some index values of the continuous written data is also continuous (e.g. fields that increase with time, like `update time`), which creates a few write hotspots and becomes the bottleneck of the entire system. -Similarly, if all data is read from a focused small range (e.g. the continuous tens or hundreds of thousands of rows of data), access hotspot of data will probably occur. +Similarly, if all data is read from a focused small range (e.g. the continuous tens or hundreds of thousands of rows of data), an access hotspot of data is likely to occur. -### Secondary indexes +### Secondary index -TiDB supports the complete secondary indexes which are also global indexes. Many queries can be optimized by index. Thus, it is important for applications to make good use of secondary indexes. +TiDB supports the complete secondary indexes, which are also global indexes. Many queries can be optimized by index. Thus, it is important for applications to make good use of secondary indexes. Lots of MySQL experience is also applicable to TiDB. It is noted that TiDB has its unique features. The following are a few notes when using secondary indexes in TiDB. * The more secondary indexes, the better? - Secondary indexes can speed up query, but adding an index has side effects. The previous section introduces the storage model of index. For each additional index, there will be one more Key-Value when inserting a piece of data. Therefore, the more indexes, the slower the writing speed and the more space it takes up. In addition, too many indexes will influence the runtime of the optimizer. And inappropriate index will mislead the optimizer. Thus, the more secondary indexes is not necessarily the better. + Secondary indexes can speed up query, but adding an index has side effects. The previous section introduces the storage model of index. For each additional index, there will be one more Key-Value when inserting a piece of data. Therefore, the more indexes, the slower the writing speed and the more space it takes up. + + In addition, too many indexes affects the runtime of the optimizer, and inappropriate index misleads the optimizer. Thus, the more secondary indexes is not necessarily the better. * Which columns should create indexes? - As is mentioned above, index is important but the number of indexes should be proper. We need to create appropriate indexes according to the characteristics of business. In principle, we need to create indexes for the columns needed in the query, the purpose of which is to improve the performance. Below are the conditions that need to create indexes: + As is mentioned above, index is important but the number of indexes should be proper. Appropriate indexes needs to be created according to the characteristics of applications. In principle, indexes should be created for the columns needed in the query to improve the performance. The following are situations that need to create indexes: - - For columns with a high degree of differentiation, the number of filtered rows is remarkably reduced though index. - - If there are multiple query criteria, you can choose composite indexes. Note to put the columns with equivalent condition before composite index. + - For columns with a high degree of differentiation, the number of filtered rows is remarkably reduced through index. + - If there are multiple query criteria, you can choose composite indexes. Note to put the columns with the equivalent condition before composite index. - For example, if a commonly-used query is `select * from t where c1 = 10 and c2 = 100 and c3 > 10`, you can create a composite index `Index cidx (c1, c2, c3)`. In this way, you can use the query criterion to create an index prefix and then Scan. + For example, if a commonly-used query is `select * from t where c1 = 10 and c2 = 100 and c3 > 10`, you can create a composite index `Index cidx (c1, c2, c3)`. In this way, you can use the query condition to create an index prefix and then scan. * The difference between querying through indexes and directly scanning the table - TiDB has implemented global indexes, so indexes and data of the Table are not necessarily on data sharding. When querying through indexes, it should firstly scan indexes to get the corresponding row ID and then use the row ID to get the data. Thus, this method involves two network requests and has a certain performance overhead. + TiDB has implemented global indexes, so indexes and data of the table are not necessarily on the same data sharding. When querying through indexes, it should firstly scan indexes to get the corresponding row ID and then use the row ID to get the data. Thus, this method involves two network requests and has a certain performance overhead. - If the query involves lots of rows, scanning index proceeds concurrently. When the first batch of results is returned, getting the data of Table can then proceed. Therefore, this is a parallel + Pipeline model. Though the two accesses create overhead, the latency is not high. + If the query involves lots of rows, scanning index proceeds concurrently. When the first batch of results is returned, getting the data of the table can then proceed. Therefore, this is a parallel + pipeline model. Though the two accesses create overhead, the latency is not high. - The following two conditions don't have the problem of two accesses: + The following two conditions do not have the problem of two accesses: - - Columns of the index have already met the query requirement. Assume that the `c` Column on the `t` Table has an index and the query is `select c from t where c > 10;`. At this time, all needed data can be obtained if accessing the index. We call this condition Covering Index. But if you focus more on the query performance, you can put a portion of columns that don't need to be filtered but need to be returned in the query result into index, creating composite index. Take `select c1, c2 from t where c1 > 10;` as an example. You can optimize this query by creating composite index `Index c12 (c1, c2)`. + - Columns of the index have already met the query requirement. Assume that the `c` column on the `t` table has an index and the query is `select c from t where c > 10;`. At this time, all needed data can be obtained if you access the index. This situation is called `Covering Index`. But if you focus more on the query performance, you can put into index a portion of columns that do not need to be filtered but need to be returned in the query result, creating composite index. Take `select c1, c2 from t where c1 > 10;` as an example. You can optimize this query by creating composite index `Index c12 (c1, c2)`. - - The Primary Key of table is integer. In this case, TiDB uses the value of Primary Key as row ID. Thus, if the query condition is on PK, you can directly construct the range of the row ID, scan Table data, and get the result. + - The Primary Key of the table is integer. In this case, TiDB uses the value of Primary Key as row ID. Thus, if the query condition is on Primary Key, you can directly construct the range of the row ID, scan the table data, and get the result. * Query concurrency - As data is distributed across many Regions, TiDB makes query concurrently. But the concurrency by default is not high in case it consumes lots of system resources. Besides, as for the OLTP query, it doesn't involve a large amount of data and the low concurrency is enough. But for the OLAP Query, the concurrency is high and TiDB modifies the query concurrency through System Variable. + As data is distributed across many Regions, TiDB makes query concurrently. But the concurrency by default is not high in case it consumes lots of system resources. Besides, the OLTP query usually does not involve a large amount of data and the low concurrency is enough. But for the OLAP query, the concurrency is high and TiDB modifies the query concurrency through the following system variables: - [`tidb_distsql_scan_concurrency`](/tidb-specific-system-variables.md#tidb_distsql_scan_concurrency): - The concurrency of scanning data, including scanning the Table and index data. + The concurrency of scanning data, including scanning the table and index data. - [`tidb_index_lookup_size`](/tidb-specific-system-variables.md#tidb_index_lookup_size): - If it needs to access the index to get row IDs before accessing Table data, it uses a batch of row IDs as a single request to access Table data. This parameter can set the size of Batch. The larger Batch increases latency while the smaller one may lead to more queries. The proper size of this parameter is related to the amount of data that the query involves. Generally, no modification is required. + If it needs to access the index to get row IDs before accessing the table data, it uses a batch of row IDs as a single request to access the table data. This parameter can set the size of a batch. The larger batch increases latency, while the smaller one might lead to more queries. The proper size of this parameter is related to the amount of data that the query involves. Generally, no modification is required. - [`tidb_index_lookup_concurrency`](/tidb-specific-system-variables.md#tidb_index_lookup_concurrency): - If it needs to access the index to get row IDs before accessing Table data, the concurrency of getting data through row IDs every time is modified through this parameter. + If it needs to access the index to get row IDs before accessing the table data, the concurrency of getting data through row IDs every time is modified through this parameter. * Ensure the order of results through index - Index cannot only be used to filter data, but also to sort data. Firstly, get row IDs according to the index order. Then return the row content according to the return order of row IDs. In this way, the return results are ordered according to the index column. I've mentioned that the model of scanning index and getting Row is parallel + Pipeline. If Row is returned according to the index order, a high concurrency between two queries will not reduce latency. Thus, the concurrency is low by default, but it can be modified through the [`tidb_index_serial_scan_concurrency`](/tidb-specific-system-variables.md#tidb_index_serial_scan_concurrency) variable. + Index cannot only be used to filter data, but also to sort data. Firstly, get row IDs according to the index order. Then return the row content according to the return order of row IDs. In this way, the return results are ordered according to the index column. It has been mentioned earlier that the model of scanning index and getting row is parallel + pipeline. If the row is returned according to the index order, a high concurrency between two queries does not reduce latency. Thus, the concurrency is low by default, but it can be modified through the [`tidb_index_serial_scan_concurrency`](/tidb-specific-system-variables.md#tidb_index_serial_scan_concurrency) variable. * Reverse index scan - As in MySQL 5.7, all indexes in TiDB are in ascending order. TiDB supports the ability to read an ascending index in reverse order, at a performance overhead of about 23%. Earlier versions of TiDB had a higher performance penalty, and thus reverse index scans were not recommended. + TiDB supports scanning an ascending index in reverse order, at a speed slower than normal scan by 20%. If the data is changed frequently and thus too many versions exist, the performance overhead might be higher. It is recommended to avoid reverse index scan as much as possible. ## Scenarios and practices -In the last section, we discussed some basic implementation mechanisms of TiDB and their influence on usage. Let's start from specific usage scenarios and operation practices. We'll go through from deployment to business supporting. +In the last section, we discussed some basic implementation mechanisms of TiDB and their influence on usage. This section introduces specific usage scenarios and operation practices, from deployment to application usage. ### Deployment -Please read [Software and Hardware Requirements](/hardware-and-software-requirements.md) before deployment. +Before deployment, read [Software and Hardware Requirements](/hardware-and-software-requirements.md). -It is recommended to deploy the TiDB cluster using [TiUP](/production-deployment-using-tiup.md). This tool can deploy, stop, destroy, and update the whole cluster, which is quite convenient. +It is recommended to deploy the TiDB cluster using [TiUP](/production-deployment-using-tiup.md). This tool can deploy, stop, destroy, and upgrade the whole cluster, which is quite convenient. It is not recommended to manually deploy the TiDB cluster, which might be troublesome to maintain and upgrade later. -### Import data +### Data Import -In order to improve the write performance during the import process, you can tune TiKV's parameters as stated in [Tune TiKV Performance](tune-tikv-memory-performance.md). +To improve the write performance during the import process, you can tune TiKV's parameters as stated in [Tune TiKV Performance](/tune-tikv-memory-performance.md). ### Write -As mentioned before, TiDB limits the size of a single transaction in the Key-Value layer. As for the SQL layer, a row of data is mapped to a Key-Value entry. For each additional index, there will be one more Key-Value entries. +As mentioned before, TiDB limits the size of a single transaction in the Key-Value layer. As for the SQL layer, a row of data is mapped to a Key-Value entry. For each additional index, one more Key-Value entry is added. > **Note:** > > The size limit for transactions needs to consider the overhead of TiDB encoding and the extra transaction Key. It is recommended that **the number of rows of each transaction is less than 200 and the data size of a single row is less than 100 KB**; otherwise, the performance is bad. -It is recommended to split statements to batches or add `limit` to the statements whether it is an `insert`, `update` or `delete` statement. +It is recommended to split statements into batches or add `limit` to the statements, whether they are `insert`, `update` or `delete` statements. -When deleting a large amount of data, it is recommended to use `Delete * from t where xx limit 5000;`. It deletes through the loop and use `Affected Rows == 0` as a condition to end the loop, so as not to exceed the limit of transaction size. +When deleting a large amount of data, it is recommended to use `Delete * from t where xx limit 5000;`. It deletes through the loop and use `Affected Rows == 0` as a condition to end the loop. -If the amount of data that needs to be deleted at a time is large, this loop method will get slower and slower because each deletion traverses backward. After deleting the previous data, lots of deleted flags will remain in a short period (then all will be Garbage Collected) and affect the following Delete statement. If possible, it is recommended to refine the `Where` condition. Assume that you need to delete all data on `2017-05-26`, you can: +If the amount of data that needs to be deleted at a time is large, this loop method gets slower and slower because each deletion traverses backward. After deleting the previous data, lots of deleted flags remain for a short period (then all is garbage collected) and affect the following `Delete` statement. If possible, it is recommended to refine the `Where` condition. Assume that you need to delete all data on `2017-05-26`, you can use the following statements: ```sql for i from 0 to 23: @@ -167,7 +171,7 @@ for i from 0 to 23: affected_rows = select affected_rows() ``` -This pseudocode means to split huge chunks of data into small ones and then delete, so that the following Delete statement will not be influenced. +This pseudocode means to split huge chunks of data into small ones and then delete, so that the earlier `Delete` statements do not affect the later ones. ### Query @@ -175,40 +179,40 @@ For query requirements and specific statements, refer to [TiDB Specific System V You can control the concurrency of SQL execution through the `SET` statement and the selection of the `Join` operator through Hint. -In addition, you can also use MySQL's standard index selection, the `Hint` syntax, to control the optimizer to select index through `Use Index/Ignore Index` hint. +In addition, you can also use MySQL's standard index selection, the `Hint` syntax, or control the optimizer to select index through `Use Index/Ignore Index hint`. -If the business scenario needs both OLTP and OLAP, you can send the TP request and AP request to different tidb-servers, diminishing the impact of AP business on TP. It is recommended to use high-end machines (e.g. more processor cores, larger memory, etc.) for the tidb-server that carries AP business. +If the business scenario has both OLTP and OLAP workloads, you can send the OLTP request and OLAP request to different TiDB servers, diminishing the impact of OLAP on OLTP. It is recommended to use machines with high configurations (e.g. more processor cores, larger memory) for the TiDB server that carries OLAP business. -To completely isolate OLTP and OLAP workloads, it is recommended to run OLAP applications on TiFlash. TiFlash is a columnar storage engine with great performance on OLAP scenarios. TiFlash can achieve physical isolation on the storage layer and guarantees consistency read. +To completely isolate OLTP and OLAP workloads, it is recommended to run OLAP applications on TiFlash. TiFlash is a columnar storage engine with great performance on OLAP workloads. TiFlash can achieve physical isolation on the storage layer and guarantees consistency read. ### Monitoring and log -The monitoring metrics is the best method to learn the status of the system. It is recommended that you deploy the monitoring system. +The monitoring metrics is the best method to learn the status of the system. It is recommended that you deploy the monitoring system along with your TiDB cluster. -TiDB uses [Grafana + Prometheus](/tidb-monitoring-framework.md) to monitor the system state. The monitoring system is automatically deployed and configured if using TiUP. +TiDB uses [Grafana + Prometheus](/tidb-monitoring-framework.md) to monitor the system status. The monitoring system is automatically deployed and configured if you deploy TiDB using TiUP. -There are lots of items in the monitoring system, the majority of which are for TiDB developers. There is no need to understand these items but for an in-depth knowledge of the source code. We've picked out some items that are related to business or to the state of system key components in a separate panel for users. +There are lots of items in the monitoring system, the majority of which are for TiDB developers. You do not have to understand these items without an in-depth knowledge of the source code. Some items that are related to applications or to the state of system key components are selected and put in a separate `overview` panel for users. -In addition to monitoring, you can also view the system logs. The three components of TiDB, tidb-server, tikv-server and pd-server, each has a `--log-file` parameter. If this parameter has been configured when initiating, logs will be stored in the file configured by the parameter and Log files are automatically archived on a daily basis. If the `--log-file` parameter has not been configured, log will be output to `stderr`. +In addition to monitoring, you can also view the system logs. The three components of TiDB, tidb-server, tikv-server and pd-server, each has a `--log-file` parameter. If this parameter has been configured when the cluster is started, logs will be stored in the file configured by the parameter and log files are automatically archived on a daily basis. If the `--log-file` parameter has not been configured, log will be output to `stderr`. -Starting from TiDB 4.0, TiDB provides TiDB Dashboard UI to improve the usability. You can access TiDB Dashboard by visiting in your browser. TiDB Dashboard provides features such as viewing cluster status, performance analysis, traffic visualization, SQL diagnosis and log searching. +Starting from TiDB 4.0, TiDB provides TiDB Dashboard UI to improve usability. You can access TiDB Dashboard by visiting in your browser. TiDB Dashboard provides features such as viewing cluster status, performance analysis, traffic visualization, SQL diagnosis and log searching. ### Documentation The best way to learn about a system or solve the problem is to read its documentation and understand its implementation principles. -TiDB has a large number of official documents both in Chinese and English. You can also search the issue list for a solution. +TiDB has a large number of official documents both in Chinese and English. If you have met an issue, you can start from [FAQ](/faq/tidb-faq.md) and [TiDB Cluster Troubleshooting Guide](/troubleshoot-tidb-cluster.md). You can also search the issue list or create an issue in [TiDB repository on GitHub](https://github.com/pingcap/tidb). -If you have met an issue, you can start from the FAQ and Troubleshooting sections. If the issue is not documented, please file an issue. +TiDB also has many useful ecosystem tools. See [Ecosystem Tool Overview](/ecosystem-tool-user-guide.md) for details. -For more information, see our website and our Technical Blog. +For more articles on the technical details of TiDB, see the [PingCAP official blog site](https://pingcap.com/blog/). ## Best Scenarios for TiDB -Simply put, TiDB can be used in the following scenarios: +TiDB is suitable for the following scenarios: -- The amount of data is too large for a standalone database -- Don't want to use the sharding solutions +- The data volume is too large for a standalone database +- You do not want to do sharding - The access mode has no obvious hotspot -- Transactions, strong consistency, and disaster recovery -- Hope to have real-time HTAP. +- Transactions, strong consistency, and disaster recovery are required +- You hope to have real-time HTAP and reduce storage link From 423591b607129fda1df200134080823feb1fafcd Mon Sep 17 00:00:00 2001 From: Ran Date: Tue, 16 Jun 2020 15:58:22 +0800 Subject: [PATCH 4/7] fix ci --- tidb-best-practices.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tidb-best-practices.md b/tidb-best-practices.md index 0dba60ba4d594..0eb94e4b2399c 100644 --- a/tidb-best-practices.md +++ b/tidb-best-practices.md @@ -148,7 +148,7 @@ It is recommended to deploy the TiDB cluster using [TiUP](/production-deployment ### Data Import -To improve the write performance during the import process, you can tune TiKV's parameters as stated in [Tune TiKV Performance](/tune-tikv-memory-performance.md). +To improve the write performance during the import process, you can tune TiKV's parameters as stated in [Tune TiKV Performance](/tune-tikv-performance.md). ### Write From 626aba98654d2f80f8dcbd8fcfba0b3e66d39ed6 Mon Sep 17 00:00:00 2001 From: Ran Date: Wed, 24 Jun 2020 22:23:35 +0800 Subject: [PATCH 5/7] Apply suggestions from code review Co-authored-by: Caitin <34535727+CaitinChen@users.noreply.github.com> --- tidb-best-practices.md | 70 +++++++++++++++++++++--------------------- 1 file changed, 35 insertions(+), 35 deletions(-) diff --git a/tidb-best-practices.md b/tidb-best-practices.md index 0eb94e4b2399c..651757c34cc8f 100644 --- a/tidb-best-practices.md +++ b/tidb-best-practices.md @@ -6,9 +6,9 @@ category: reference # TiDB Best Practices -This document summarizes the best practices of using TiDB, including the use of SQL and optimization tips for OLAP/OLTP scenarios, especially the optimization options specific for TiDB. +This document summarizes the best practices of using TiDB, including the use of SQL and optimization tips for Online Analytical Processing (OLAP) and Online Transactional Processing (OLTP) scenarios, especially the optimization options specific for TiDB. -Before you read this document, it is recommended that you read three blog posts that introduces the technical principles of TiDB: +Before you read this document, it is recommended that you read three blog posts that introduce the technical principles of TiDB: * [TiDB Internal (I) - Data Storage](https://pingcap.com/blog/2017-07-11-tidbinternal1/) * [TiDB Internal (II) - Computing](https://pingcap.com/blog/2017-07-11-tidbinternal2/) @@ -20,9 +20,9 @@ Database is a generic infrastructure system. It is important to consider various TiDB is a distributed database compatible with the MySQL protocol and syntax. But with the internal implementation and supporting of distributed storage and transactions, the way of using TiDB is different from MySQL. -## Basic Concepts +## Basic concepts -The best practices are closely related to its implementation principles. It is recommended that you learn some of the basic mechanisms, including the Raft consensus algorithm, distributed transactions, data sharding, load balancing, the mapping solution from SQL to KV, the implementation method of secondary indexing, and distributed execution engines. +The best practices are closely related to its implementation principles. It is recommended that you learn some of the basic mechanisms, including the Raft consensus algorithm, distributed transactions, data sharding, load balancing, the mapping solution from SQL to Key-Value (KV), the implementation method of secondary indexing, and distributed execution engines. This section is an introduction to these concepts. For detailed information, refer to [PingCAP blog posts](https://pingcap.com/blog/). @@ -30,7 +30,7 @@ This section is an introduction to these concepts. For detailed information, ref Raft is a consensus algorithm that ensures data replication with strong consistency. At the bottom layer, TiDB uses Raft to replicate data. TiDB writes data to the majority of the replicas before returning the result of success. In this way, even though a few replicas might get lost, the system still has the latest data. For example, if there are three replicas, the system does not return the result of success until data has been written to two replicas. Whenever a replica is lost, at least one of the remaining two replicas have the latest data. -To store three replicas, compared with the replication of Master-Slave, Raft is more efficient. The write latency of Raft depends on the two fastest replicas, instead of the slowest one. Therefore, the implementation of geo-distributed and multiple active data centers becomes possible by using the Raft replication. In the typical scenario of three data centers distributing in two sites, to guarantee the data consistency, TiDB just needs to successfully write data into the local data center and the closer one, instead of writing to all three data-centers. However, this does not mean that cross-data center deployment can be implemented in any scenario. When the amount of data to be written is large, the bandwidth and latency between data centers become the key factors. If the write speed exceeds the bandwidth or the latency is too high, the Raft replication mechanism still cannot work well. +To store three replicas, compared with the replication of Master-Slave, Raft is more efficient. The write latency of Raft depends on the two fastest replicas, instead of the slowest one. Therefore, the implementation of geo-distributed and multiple active data centers becomes possible by using the Raft replication. In the typical scenario of three data centers distributing in two sites, to guarantee the data consistency, TiDB just needs to successfully write data into the local data center and the closer one, instead of writing to all three data centers. However, this does not mean that cross-data center deployment can be implemented in any scenario. When the amount of data to be written is large, the bandwidth and latency between data centers become the key factors. If the write speed exceeds the bandwidth or the latency is too high, the Raft replication mechanism still cannot work well. ### Distributed transactions @@ -44,19 +44,19 @@ TiDB provides complete distributed transactions and the model has some optimizat * Pessimistic transaction model - In TiDB, the pessimistic transaction model has almost the same behavior as in MySQL. The transaction applies a lock during the execution phase, which avoids retries in conflict situations and ensures a higher success rate. By applying the pessimistic locking, you can also lock data in advance using `select for update`. + In TiDB, the pessimistic transaction model has almost the same behavior as in MySQL. The transaction applies a lock during the execution phase, which avoids retries in conflict situations and ensures a higher success rate. By applying the pessimistic locking, you can also lock data in advance using `SELECT FOR UPDATE`. - However, if the business scenario itself has fewer conflicts, the optimistic transaction model has better performance. + However, if the application scenario has fewer conflicts, the optimistic transaction model has better performance. * Transaction size limit - As distributed transactions need to conduct two-phase commit and the bottom layer performs Raft replication, if a transaction is very large, the commit process would be quite slow, and the following Raft replication process is thus struck. To avoid this problem, the transaction size is limited: + As distributed transactions need to conduct two-phase commit and the bottom layer performs Raft replication, if a transaction is very large, the commit process would be quite slow, and the following Raft replication process is thus stuck. To avoid this problem, the transaction size is limited: - - A transaction is limited to 5000 SQL statements (by default) + - A transaction is limited to 5,000 SQL statements (by default) - Each Key-Value entry is no more than 6 MB - - The total size of Key-Value entry is no more than 10 GB. + - The total size of Key-Value entries is no more than 10 GB. - Similar limits can also be found in [Google Cloud Spanner](https://cloud.google.com/spanner/quotas). + You can find similar limits in [Google Cloud Spanner](https://cloud.google.com/spanner/quotas). ### Data sharding @@ -64,7 +64,7 @@ TiKV automatically shards bottom-layered data according to the Range of Key. Eac ### Load balancing -PD balances the load of the cluster according to the status of the entire TiKV cluster. The unit of scheduling is Region and the logic is the strategy configured by PD. +Placement Driver (PD) balances the load of the cluster according to the status of the entire TiKV cluster. The unit of scheduling is Region and the logic is the strategy configured by PD. ### SQL on KV @@ -87,18 +87,18 @@ Lots of MySQL experience is also applicable to TiDB. It is noted that TiDB has i * The more secondary indexes, the better? - Secondary indexes can speed up query, but adding an index has side effects. The previous section introduces the storage model of index. For each additional index, there will be one more Key-Value when inserting a piece of data. Therefore, the more indexes, the slower the writing speed and the more space it takes up. + Secondary indexes can speed up queries, but adding an index has side effects. The previous section introduces the storage model of indexes. For each additional index, there will be one more Key-Value when inserting a piece of data. Therefore, the more indexes, the slower the writing speed and the more space it takes up. - In addition, too many indexes affects the runtime of the optimizer, and inappropriate index misleads the optimizer. Thus, the more secondary indexes is not necessarily the better. + In addition, too many indexes affects the runtime of the optimizer, and inappropriate indexes mislead the optimizer. Thus, more secondary indexes does not mean better performance. * Which columns should create indexes? - As is mentioned above, index is important but the number of indexes should be proper. Appropriate indexes needs to be created according to the characteristics of applications. In principle, indexes should be created for the columns needed in the query to improve the performance. The following are situations that need to create indexes: + As is mentioned above, index is important but the number of indexes should be proper. You must create appropriate indexes according to the application characteristics. In principle, you need to create an index on the columns involved in the query to improve the performance. The following are situations that need to create indexes: - - For columns with a high degree of differentiation, the number of filtered rows is remarkably reduced through index. - - If there are multiple query criteria, you can choose composite indexes. Note to put the columns with the equivalent condition before composite index. + - For columns with a high degree of differentiation, filtered rows are remarkably reduced through indexes. + - If there are multiple query criteria, you can choose composite indexes. Note to put the columns with the equivalent condition before composite indexes. - For example, if a commonly-used query is `select * from t where c1 = 10 and c2 = 100 and c3 > 10`, you can create a composite index `Index cidx (c1, c2, c3)`. In this way, you can use the query condition to create an index prefix and then scan. + For example, if a commonly used query is `select * from t where c1 = 10 and c2 = 100 and c3 > 10`, you can create a composite index `Index cidx (c1, c2, c3)`. In this way, you can use the query condition to create an index prefix and then scan. * The difference between querying through indexes and directly scanning the table @@ -114,7 +114,7 @@ Lots of MySQL experience is also applicable to TiDB. It is noted that TiDB has i * Query concurrency - As data is distributed across many Regions, TiDB makes query concurrently. But the concurrency by default is not high in case it consumes lots of system resources. Besides, the OLTP query usually does not involve a large amount of data and the low concurrency is enough. But for the OLAP query, the concurrency is high and TiDB modifies the query concurrency through the following system variables: + As data is distributed across many Regions, queries run in TiDB concurrently. But the concurrency by default is not high in case it consumes lots of system resources. Besides, the OLTP query usually does not involve a large amount of data and the low concurrency is enough. But for the OLAP query, the concurrency is high and TiDB modifies the query concurrency through the following system variables: - [`tidb_distsql_scan_concurrency`](/tidb-specific-system-variables.md#tidb_distsql_scan_concurrency): @@ -122,19 +122,19 @@ Lots of MySQL experience is also applicable to TiDB. It is noted that TiDB has i - [`tidb_index_lookup_size`](/tidb-specific-system-variables.md#tidb_index_lookup_size): - If it needs to access the index to get row IDs before accessing the table data, it uses a batch of row IDs as a single request to access the table data. This parameter can set the size of a batch. The larger batch increases latency, while the smaller one might lead to more queries. The proper size of this parameter is related to the amount of data that the query involves. Generally, no modification is required. + If it needs to access the index to get row IDs before accessing the table data, it uses a batch of row IDs as a single request to access the table data. This parameter sets the size of a batch. The larger batch increases latency, while the smaller one might lead to more queries. The proper size of this parameter is related to the amount of data that the query involves. Generally, no modification is required. - [`tidb_index_lookup_concurrency`](/tidb-specific-system-variables.md#tidb_index_lookup_concurrency): If it needs to access the index to get row IDs before accessing the table data, the concurrency of getting data through row IDs every time is modified through this parameter. -* Ensure the order of results through index +* Ensure the order of results through indexes - Index cannot only be used to filter data, but also to sort data. Firstly, get row IDs according to the index order. Then return the row content according to the return order of row IDs. In this way, the return results are ordered according to the index column. It has been mentioned earlier that the model of scanning index and getting row is parallel + pipeline. If the row is returned according to the index order, a high concurrency between two queries does not reduce latency. Thus, the concurrency is low by default, but it can be modified through the [`tidb_index_serial_scan_concurrency`](/tidb-specific-system-variables.md#tidb_index_serial_scan_concurrency) variable. + You can use indexes to filter or sort data. Firstly, get row IDs according to the index order. Then, return the row content according to the return order of row IDs. In this way, the returned results are ordered according to the index column. It has been mentioned earlier that the model of scanning index and getting row is parallel + pipeline. If the row is returned according to the index order, a high concurrency between two queries does not reduce latency. Thus, the concurrency is low by default, but it can be modified through the [`tidb_index_serial_scan_concurrency`](/tidb-specific-system-variables.md#tidb_index_serial_scan_concurrency) variable. * Reverse index scan - TiDB supports scanning an ascending index in reverse order, at a speed slower than normal scan by 20%. If the data is changed frequently and thus too many versions exist, the performance overhead might be higher. It is recommended to avoid reverse index scan as much as possible. + TiDB supports scanning an ascending index in reverse order, at a speed slower than normal scan by 20%. If the data is changed frequently and thus too many versions exist, the performance overhead might be higher. It is recommended to avoid reverse index scans as much as possible. ## Scenarios and practices @@ -146,7 +146,7 @@ Before deployment, read [Software and Hardware Requirements](/hardware-and-softw It is recommended to deploy the TiDB cluster using [TiUP](/production-deployment-using-tiup.md). This tool can deploy, stop, destroy, and upgrade the whole cluster, which is quite convenient. It is not recommended to manually deploy the TiDB cluster, which might be troublesome to maintain and upgrade later. -### Data Import +### Data import To improve the write performance during the import process, you can tune TiKV's parameters as stated in [Tune TiKV Performance](/tune-tikv-performance.md). @@ -156,13 +156,13 @@ As mentioned before, TiDB limits the size of a single transaction in the Key-Val > **Note:** > -> The size limit for transactions needs to consider the overhead of TiDB encoding and the extra transaction Key. It is recommended that **the number of rows of each transaction is less than 200 and the data size of a single row is less than 100 KB**; otherwise, the performance is bad. +> When you set the size limit for transactions, you need to consider the overhead of TiDB encoding and the extra transaction Key. It is recommended that **the number of rows of each transaction is less than 200 and the data size of a single row is less than 100 KB**; otherwise, the performance is bad. -It is recommended to split statements into batches or add `limit` to the statements, whether they are `insert`, `update` or `delete` statements. +It is recommended to split statements into batches or add a limit to the statements, whether they are `INSERT`, `UPDATE` or `DELETE` statements. When deleting a large amount of data, it is recommended to use `Delete * from t where xx limit 5000;`. It deletes through the loop and use `Affected Rows == 0` as a condition to end the loop. -If the amount of data that needs to be deleted at a time is large, this loop method gets slower and slower because each deletion traverses backward. After deleting the previous data, lots of deleted flags remain for a short period (then all is garbage collected) and affect the following `Delete` statement. If possible, it is recommended to refine the `Where` condition. Assume that you need to delete all data on `2017-05-26`, you can use the following statements: +If the amount of data that needs to be deleted at a time is large, this loop method gets slower and slower because each deletion traverses backward. After deleting the previous data, lots of deleted flags remain for a short period (then all is cleared by Garbage Collection) and affect the following `DELETE` statement. If possible, it is recommended to refine the `WHERE` condition. Assume that you need to delete all data on `2017-05-26`, you can use the following statements: ```sql for i from 0 to 23: @@ -177,13 +177,13 @@ This pseudocode means to split huge chunks of data into small ones and then dele For query requirements and specific statements, refer to [TiDB Specific System Variables](/tidb-specific-system-variables.md). -You can control the concurrency of SQL execution through the `SET` statement and the selection of the `Join` operator through Hint. +You can control the concurrency of SQL execution through the `SET` statement and the selection of the `Join` operator through hints. -In addition, you can also use MySQL's standard index selection, the `Hint` syntax, or control the optimizer to select index through `Use Index/Ignore Index hint`. +In addition, you can also use MySQL's standard index selection, the hint syntax, or control the optimizer to select indexes through `Use Index`/`Ignore Index hint`. -If the business scenario has both OLTP and OLAP workloads, you can send the OLTP request and OLAP request to different TiDB servers, diminishing the impact of OLAP on OLTP. It is recommended to use machines with high configurations (e.g. more processor cores, larger memory) for the TiDB server that carries OLAP business. +If the application scenario has both OLTP and OLAP workloads, you can send the OLTP request and OLAP request to different TiDB servers, diminishing the impact of OLAP on OLTP. It is recommended to use machines with high-performance hardware (e.g. more processor cores, larger memory) for the TiDB server that processes OLAP workloads. -To completely isolate OLTP and OLAP workloads, it is recommended to run OLAP applications on TiFlash. TiFlash is a columnar storage engine with great performance on OLAP workloads. TiFlash can achieve physical isolation on the storage layer and guarantees consistency read. +To completely isolate OLTP and OLAP workloads, it is recommended to run OLAP applications on TiFlash. TiFlash is a columnar storage engine with great performance on OLAP workloads. TiFlash can achieve physical isolation on the storage layer and guarantees consistent reads. ### Monitoring and log @@ -193,9 +193,9 @@ TiDB uses [Grafana + Prometheus](/tidb-monitoring-framework.md) to monitor the s There are lots of items in the monitoring system, the majority of which are for TiDB developers. You do not have to understand these items without an in-depth knowledge of the source code. Some items that are related to applications or to the state of system key components are selected and put in a separate `overview` panel for users. -In addition to monitoring, you can also view the system logs. The three components of TiDB, tidb-server, tikv-server and pd-server, each has a `--log-file` parameter. If this parameter has been configured when the cluster is started, logs will be stored in the file configured by the parameter and log files are automatically archived on a daily basis. If the `--log-file` parameter has not been configured, log will be output to `stderr`. +In addition to monitoring, you can also view the system logs. The three components of TiDB, tidb-server, tikv-server, and pd-server, each has a `--log-file` parameter. If this parameter has been configured when the cluster is started, logs are stored in the file configured by the parameter and log files are automatically archived on a daily basis. If the `--log-file` parameter has not been configured, the log is output to `stderr`. -Starting from TiDB 4.0, TiDB provides TiDB Dashboard UI to improve usability. You can access TiDB Dashboard by visiting in your browser. TiDB Dashboard provides features such as viewing cluster status, performance analysis, traffic visualization, SQL diagnosis and log searching. +Starting from TiDB 4.0, TiDB provides TiDB Dashboard UI to improve usability. You can access TiDB Dashboard by visiting in your browser. TiDB Dashboard provides features such as viewing cluster status, performance analysis, traffic visualization, cluster diagnostics, and log searching. ### Documentation @@ -207,7 +207,7 @@ TiDB also has many useful ecosystem tools. See [Ecosystem Tool Overview](/ecosys For more articles on the technical details of TiDB, see the [PingCAP official blog site](https://pingcap.com/blog/). -## Best Scenarios for TiDB +## Best scenarios for TiDB TiDB is suitable for the following scenarios: @@ -215,4 +215,4 @@ TiDB is suitable for the following scenarios: - You do not want to do sharding - The access mode has no obvious hotspot - Transactions, strong consistency, and disaster recovery are required -- You hope to have real-time HTAP and reduce storage link +- You hope to have real-time Hybrid Transaction/Analytical Processing (HTAP) analytics and reduce storage links From 69eda00058e8604ed3bd4ed6ed2034a0f6d25436 Mon Sep 17 00:00:00 2001 From: Ran Date: Wed, 24 Jun 2020 22:40:21 +0800 Subject: [PATCH 6/7] replace Key with key --- tidb-best-practices.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tidb-best-practices.md b/tidb-best-practices.md index 651757c34cc8f..641fb2465004d 100644 --- a/tidb-best-practices.md +++ b/tidb-best-practices.md @@ -72,10 +72,10 @@ TiDB automatically maps the SQL structure into Key-Value structure. For details, Simply put, TiDB performs the following operations: -* A row of data is mapped to a Key-Value pair. Key is prefixed with `TableID` and suffixed with the row ID. -* An index is mapped as a Key-Value pair. Key is prefixed with `TableID+IndexID` and suffixed with the index value. +* A row of data is mapped to a Key-Value pair. The key is prefixed with `TableID` and suffixed with the row ID. +* An index is mapped as a Key-Value pair. The key is prefixed with `TableID+IndexID` and suffixed with the index value. -The data or indexes in the same table have the same prefix. These Key-Values are at adjacent positions in the Key space of TiKV. Therefore, when the amount of data to be written is large and all is written to one table, the write hotspot is created. The situation gets worse when some index values of the continuous written data is also continuous (e.g. fields that increase with time, like `update time`), which creates a few write hotspots and becomes the bottleneck of the entire system. +The data or indexes in the same table have the same prefix. These Key-Values are at adjacent positions in the key space of TiKV. Therefore, when the amount of data to be written is large and all is written to one table, the write hotspot is created. The situation gets worse when some index values of the continuous written data is also continuous (e.g. fields that increase with time, like `update time`), which creates a few write hotspots and becomes the bottleneck of the entire system. Similarly, if all data is read from a focused small range (e.g. the continuous tens or hundreds of thousands of rows of data), an access hotspot of data is likely to occur. @@ -110,7 +110,7 @@ Lots of MySQL experience is also applicable to TiDB. It is noted that TiDB has i - Columns of the index have already met the query requirement. Assume that the `c` column on the `t` table has an index and the query is `select c from t where c > 10;`. At this time, all needed data can be obtained if you access the index. This situation is called `Covering Index`. But if you focus more on the query performance, you can put into index a portion of columns that do not need to be filtered but need to be returned in the query result, creating composite index. Take `select c1, c2 from t where c1 > 10;` as an example. You can optimize this query by creating composite index `Index c12 (c1, c2)`. - - The Primary Key of the table is integer. In this case, TiDB uses the value of Primary Key as row ID. Thus, if the query condition is on Primary Key, you can directly construct the range of the row ID, scan the table data, and get the result. + - The primary key of the table is integer. In this case, TiDB uses the value of the primary key as row ID. Thus, if the query condition is on Primary Key, you can directly construct the range of the row ID, scan the table data, and get the result. * Query concurrency From 3c4cce214e7ed75913ca8e8ae4bd11f3e4705f80 Mon Sep 17 00:00:00 2001 From: Ran Date: Sun, 28 Jun 2020 11:02:56 +0800 Subject: [PATCH 7/7] update key --- tidb-best-practices.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tidb-best-practices.md b/tidb-best-practices.md index 641fb2465004d..7bd3232e643ec 100644 --- a/tidb-best-practices.md +++ b/tidb-best-practices.md @@ -60,7 +60,7 @@ TiDB provides complete distributed transactions and the model has some optimizat ### Data sharding -TiKV automatically shards bottom-layered data according to the Range of Key. Each Region is a range of Key, which is a left-closed and right-open interval, `[StartKey, EndKey)`. When the amount of Key-Value pairs in a Region exceeds a certain value, the Region automatically splits into two. +TiKV automatically shards bottom-layered data according to the range of keys. Each Region is a range of keys, which is a left-closed and right-open interval, `[StartKey, EndKey)`. When the amount of Key-Value pairs in a Region exceeds a certain value, the Region automatically splits into two. ### Load balancing @@ -110,7 +110,7 @@ Lots of MySQL experience is also applicable to TiDB. It is noted that TiDB has i - Columns of the index have already met the query requirement. Assume that the `c` column on the `t` table has an index and the query is `select c from t where c > 10;`. At this time, all needed data can be obtained if you access the index. This situation is called `Covering Index`. But if you focus more on the query performance, you can put into index a portion of columns that do not need to be filtered but need to be returned in the query result, creating composite index. Take `select c1, c2 from t where c1 > 10;` as an example. You can optimize this query by creating composite index `Index c12 (c1, c2)`. - - The primary key of the table is integer. In this case, TiDB uses the value of the primary key as row ID. Thus, if the query condition is on Primary Key, you can directly construct the range of the row ID, scan the table data, and get the result. + - The primary key of the table is integer. In this case, TiDB uses the value of the primary key as row ID. Thus, if the query condition is on the primary key, you can directly construct the range of the row ID, scan the table data, and get the result. * Query concurrency @@ -156,7 +156,7 @@ As mentioned before, TiDB limits the size of a single transaction in the Key-Val > **Note:** > -> When you set the size limit for transactions, you need to consider the overhead of TiDB encoding and the extra transaction Key. It is recommended that **the number of rows of each transaction is less than 200 and the data size of a single row is less than 100 KB**; otherwise, the performance is bad. +> When you set the size limit for transactions, you need to consider the overhead of TiDB encoding and the extra transaction key. It is recommended that **the number of rows of each transaction is less than 200 and the data size of a single row is less than 100 KB**; otherwise, the performance is bad. It is recommended to split statements into batches or add a limit to the statements, whether they are `INSERT`, `UPDATE` or `DELETE` statements.