From 42cb262e895afb0589b956ce194c005918f38382 Mon Sep 17 00:00:00 2001 From: ireneontheway <48651140+ireneontheway@users.noreply.github.com> Date: Thu, 17 Sep 2020 13:10:05 +0800 Subject: [PATCH 1/5] cherry pick #3825 to release-3.1 Signed-off-by: ti-srebot --- TOC.md | 140 +++++++++++++++++ configure-placement-rules.md | 15 ++ pd-configuration-file.md | 22 +++ schedule-replicas-by-topology-labels.md | 194 ++++++++++++++++++++++++ tidb-scheduling.md | 141 +++++++++++++++++ 5 files changed, 512 insertions(+) create mode 100644 schedule-replicas-by-topology-labels.md create mode 100644 tidb-scheduling.md diff --git a/TOC.md b/TOC.md index 49009599d880a..f4a3c9c35ed81 100644 --- a/TOC.md +++ b/TOC.md @@ -148,6 +148,7 @@ - [`SET`](/data-type-string.md#set-type) - [JSON Type](/data-type-json.md) + Functions and Operators +<<<<<<< HEAD - [Function and Operator Reference](/functions-and-operators/functions-and-operators-overview.md) - [Type Conversion in Expression Evaluation](/functions-and-operators/type-conversion-in-expression-evaluation.md) - [Operators](/functions-and-operators/operators.md) @@ -386,6 +387,145 @@ - [TiKV Control](/tikv-control.md) - [TiDB Control](/tidb-control.md) - [TiDB in Kubernetes](https://pingcap.com/docs/tidb-in-kubernetes/stable/) +======= + + [Overview](/functions-and-operators/functions-and-operators-overview.md) + + [Type Conversion in Expression Evaluation](/functions-and-operators/type-conversion-in-expression-evaluation.md) + + [Operators](/functions-and-operators/operators.md) + + [Control Flow Functions](/functions-and-operators/control-flow-functions.md) + + [String Functions](/functions-and-operators/string-functions.md) + + [Numeric Functions and Operators](/functions-and-operators/numeric-functions-and-operators.md) + + [Date and Time Functions](/functions-and-operators/date-and-time-functions.md) + + [Bit Functions and Operators](/functions-and-operators/bit-functions-and-operators.md) + + [Cast Functions and Operators](/functions-and-operators/cast-functions-and-operators.md) + + [Encryption and Compression Functions](/functions-and-operators/encryption-and-compression-functions.md) + + [Information Functions](/functions-and-operators/information-functions.md) + + [JSON Functions](/functions-and-operators/json-functions.md) + + [Aggregate (GROUP BY) Functions](/functions-and-operators/aggregate-group-by-functions.md) + + [Window Functions](/functions-and-operators/window-functions.md) + + [Miscellaneous Functions](/functions-and-operators/miscellaneous-functions.md) + + [Precision Math](/functions-and-operators/precision-math.md) + + [List of Expressions for Pushdown](/functions-and-operators/expressions-pushed-down.md) + + [Constraints](/constraints.md) + + [Generated Columns](/generated-columns.md) + + [SQL Mode](/sql-mode.md) + + Transactions + + [Overview](/transaction-overview.md) + + [Isolation Levels](/transaction-isolation-levels.md) + + [Optimistic Transactions](/optimistic-transaction.md) + + [Pessimistic Transactions](/pessimistic-transaction.md) + + Garbage Collection (GC) + + [Overview](/garbage-collection-overview.md) + + [Configuration](/garbage-collection-configuration.md) + + [Views](/views.md) + + [Partitioning](/partitioned-table.md) + + [Character Set and Collation](/character-set-and-collation.md) + + System Tables + + [`mysql`](/mysql-schema.md) + + INFORMATION_SCHEMA + + [Overview](/information-schema/information-schema.md) + + [`ANALYZE_STATUS`](/information-schema/information-schema-analyze-status.md) + + [`CHARACTER_SETS`](/information-schema/information-schema-character-sets.md) + + [`CLUSTER_CONFIG`](/information-schema/information-schema-cluster-config.md) + + [`CLUSTER_HARDWARE`](/information-schema/information-schema-cluster-hardware.md) + + [`CLUSTER_INFO`](/information-schema/information-schema-cluster-info.md) + + [`CLUSTER_LOAD`](/information-schema/information-schema-cluster-load.md) + + [`CLUSTER_LOG`](/information-schema/information-schema-cluster-log.md) + + [`CLUSTER_SYSTEMINFO`](/information-schema/information-schema-cluster-systeminfo.md) + + [`COLLATIONS`](/information-schema/information-schema-collations.md) + + [`COLLATION_CHARACTER_SET_APPLICABILITY`](/information-schema/information-schema-collation-character-set-applicability.md) + + [`COLUMNS`](/information-schema/information-schema-columns.md) + + [`DDL_JOBS`](/information-schema/information-schema-ddl-jobs.md) + + [`ENGINES`](/information-schema/information-schema-engines.md) + + [`INSPECTION_RESULT`](/information-schema/information-schema-inspection-result.md) + + [`INSPECTION_RULES`](/information-schema/information-schema-inspection-rules.md) + + [`INSPECTION_SUMMARY`](/information-schema/information-schema-inspection-summary.md) + + [`KEY_COLUMN_USAGE`](/information-schema/information-schema-key-column-usage.md) + + [`METRICS_SUMMARY`](/information-schema/information-schema-metrics-summary.md) + + [`METRICS_TABLES`](/information-schema/information-schema-metrics-tables.md) + + [`PARTITIONS`](/information-schema/information-schema-partitions.md) + + [`PROCESSLIST`](/information-schema/information-schema-processlist.md) + + [`SCHEMATA`](/information-schema/information-schema-schemata.md) + + [`SEQUENCES`](/information-schema/information-schema-sequences.md) + + [`SESSION_VARIABLES`](/information-schema/information-schema-session-variables.md) + + [`SLOW_QUERY`](/information-schema/information-schema-slow-query.md) + + [`STATISTICS`](/information-schema/information-schema-statistics.md) + + [`TABLES`](/information-schema/information-schema-tables.md) + + [`TABLE_CONSTRAINTS`](/information-schema/information-schema-table-constraints.md) + + [`TABLE_STORAGE_STATS`](/information-schema/information-schema-table-storage-stats.md) + + [`TIDB_HOT_REGIONS`](/information-schema/information-schema-tidb-hot-regions.md) + + [`TIDB_INDEXES`](/information-schema/information-schema-tidb-indexes.md) + + [`TIDB_SERVERS_INFO`](/information-schema/information-schema-tidb-servers-info.md) + + [`TIFLASH_REPLICA`](/information-schema/information-schema-tiflash-replica.md) + + [`TIKV_REGION_PEERS`](/information-schema/information-schema-tikv-region-peers.md) + + [`TIKV_REGION_STATUS`](/information-schema/information-schema-tikv-region-status.md) + + [`TIKV_STORE_STATUS`](/information-schema/information-schema-tikv-store-status.md) + + [`USER_PRIVILEGES`](/information-schema/information-schema-user-privileges.md) + + [`VIEWS`](/information-schema/information-schema-views.md) + + [`METRICS_SCHEMA`](/metrics-schema.md) + + UI + + TiDB Dashboard + + [Overview](/dashboard/dashboard-intro.md) + + Maintain + + [Deploy](/dashboard/dashboard-ops-deploy.md) + + [Reverse Proxy](/dashboard/dashboard-ops-reverse-proxy.md) + + [Secure](/dashboard/dashboard-ops-security.md) + + [Access](/dashboard/dashboard-access.md) + + [Overview Page](/dashboard/dashboard-overview.md) + + [Cluster Info Page](/dashboard/dashboard-cluster-info.md) + + [Key Visualizer Page](/dashboard/dashboard-key-visualizer.md) + + SQL Statements Analysis + + [SQL Statements Page](/dashboard/dashboard-statement-list.md) + + [SQL Details Page](/dashboard/dashboard-statement-details.md) + + [Slow Queries Page](/dashboard/dashboard-slow-query.md) + + Cluster Diagnostics + + [Access Cluster Diagnostics Page](/dashboard/dashboard-diagnostics-access.md) + + [View Diagnostics Report](/dashboard/dashboard-diagnostics-report.md) + + [Use Diagnostics](/dashboard/dashboard-diagnostics-usage.md) + + [Search Logs Page](/dashboard/dashboard-log-search.md) + + [Profile Instances Page](/dashboard/dashboard-profiling.md) + + [FAQ](/dashboard/dashboard-faq.md) + + CLI + + [tikv-ctl](/tikv-control.md) + + [pd-ctl](/pd-control.md) + + [tidb-ctl](/tidb-control.md) + + [pd-recover](/pd-recover.md) + + Command Line Flags + + [tidb-server](/command-line-flags-for-tidb-configuration.md) + + [tikv-server](/command-line-flags-for-tikv-configuration.md) + + [tiflash-server](/tiflash/tiflash-command-line-flags.md) + + [pd-server](/command-line-flags-for-pd-configuration.md) + + Configuration File Parameters + + [tidb-server](/tidb-configuration-file.md) + + [tikv-server](/tikv-configuration-file.md) + + [tiflash-server](/tiflash/tiflash-configuration.md) + + [pd-server](/pd-configuration-file.md) + + [System Variables](/system-variables.md) + + Storage Engines + + TiKV + + [TiKV Overview](/tikv-overview.md) + + [RocksDB Overview](/storage-engine/rocksdb-overview.md) + + TiFlash + + [Overview](/tiflash/tiflash-overview.md) + + [Use TiFlash](/tiflash/use-tiflash.md) + + TiUP + + [Documentation Guide](/tiup/tiup-documentation-guide.md) + + [Overview](/tiup/tiup-overview.md) + + [Terminology and Concepts](/tiup/tiup-terminology-and-concepts.md) + + [Manage TiUP Components](/tiup/tiup-component-management.md) + + [FAQ](/tiup/tiup-faq.md) + + [Troubleshooting Guide](/tiup/tiup-troubleshooting-guide.md) + + TiUP Components + + [tiup-playground](/tiup/tiup-playground.md) + + [tiup-cluster](/tiup/tiup-cluster.md) + + [tiup-mirror](/tiup/tiup-mirror.md) + + [tiup-bench](/tiup/tiup-bench.md) + + [Telemetry](/telemetry.md) + + [Errors Codes](/error-codes.md) + + [TiCDC Overview](/ticdc/ticdc-overview.md) + + [TiCDC Open Protocol](/ticdc/ticdc-open-protocol.md) + + [Table Filter](/table-filter.md) + + [Schedule Replicas by Topology Labels](/schedule-replicas-by-topology-labels.md) +>>>>>>> 24f31fb... improve location-awareness (#3825) + FAQs - [TiDB FAQs](/faq/tidb-faq.md) - [TiDB Lightning FAQs](/tidb-lightning/tidb-lightning-faq.md) diff --git a/configure-placement-rules.md b/configure-placement-rules.md index 805a06d9a4ea5..3e3ff9fa1cb68 100644 --- a/configure-placement-rules.md +++ b/configure-placement-rules.md @@ -48,6 +48,21 @@ The following table shows the meaning of each field in a rule: The meaning and function of `LocationLabels` are the same with those earlier than v4.0. For example, if you have deployed `[zone,rack,host]` that defines a three-layer topology: the cluster has multiple zones (Availability Zones), each zone has multiple racks, and each rack has multiple hosts. When performing schedule, PD first tries to place the Region's peers in different zones. If this try fails (such as there are three replicas but only two zones in total), PD guarantees to place these replicas in different racks. If the number of racks is not enough to guarantee isolation, then PD tries the host-level isolation. +<<<<<<< HEAD +======= +The meaning and function of `IsolationLevel` is elaborated in [Cluster topology configuration](/schedule-replicas-by-topology-labels.md). For example, if you have deployed `[zone,rack,host]` that defines a three-layer topology with `LocationLabels` and set `IsolationLevel` to `zone`, then PD ensures that all peers of each Region are placed in different zones during the scheduling. If the minimum isolation level restriction on `IsolationLevel` cannot be met (for example, 3 replicas are configured but there are only 2 data zones in total), PD will not try to make up to meet this restriction. The default value of `IsolationLevel` is an empty string, which means that it is disabled. + +### Fields of the rule group + +The following table shows the description of each field in a rule group: + +| Field name | Type and restriction | Description | +| :--- | :--- | :--- | +| `ID` | `string` | The group ID that marks the source of the rule. | +| `Index` | `int` | The stacking sequence of different groups. | +| `Override` | `true`/`false` | Whether to override groups with smaller indexes. | + +>>>>>>> 24f31fb... improve location-awareness (#3825) ## Configure rules The operations in this section are based on [pd-ctl](/pd-control.md), and the commands involved in the operations also support calls via HTTP API. diff --git a/pd-configuration-file.md b/pd-configuration-file.md index e121182bdf1ce..267e79a77db50 100644 --- a/pd-configuration-file.md +++ b/pd-configuration-file.md @@ -256,6 +256,28 @@ Configuration items related to replicas + The topology information of a TiKV cluster + Default value: `[]` +<<<<<<< HEAD +======= ++ [Cluster topology configuration](/schedule-replicas-by-topology-labels.md) + +### `isolation-level` + ++ The minimum topological isolation level of a TiKV cluster ++ Default value: `""` ++ [Cluster topology configuration](/schedule-replicas-by-topology-labels.md) + +### `strictly-match-label` + ++ Enables the strict check for whether the TiKV label matches PD's `location-labels`. ++ Default value: `false` + +### `enable-placement-rules` + ++ Enables `placement-rules`. ++ Default value: `false` ++ See [Placement Rules](/configure-placement-rules.md). ++ An experimental feature of TiDB 4.0. +>>>>>>> 24f31fb... improve location-awareness (#3825) ## `label-property` diff --git a/schedule-replicas-by-topology-labels.md b/schedule-replicas-by-topology-labels.md new file mode 100644 index 0000000000000..546bab38743aa --- /dev/null +++ b/schedule-replicas-by-topology-labels.md @@ -0,0 +1,194 @@ +--- +title: Schedule Replicas by Topology Labels +summary: Learn how to schedule replicas by topology labels. +aliases: ['/docs/dev/location-awareness/','/docs/dev/how-to/deploy/geographic-redundancy/location-awareness/','/tidb/dev/location-awareness'] +--- + +# Schedule Replicas by Topology Labels + +To improve the high availability and disaster recovery capability of TiDB clusters, it is recommended that TiKV nodes are physically scattered as much as possible. For example, TiKV nodes can be distributed on different racks or even in different data centers. According to the topology information of TiKV, the PD scheduler automatically performs scheduling at the background to isolate each replica of a Region as much as possible, which maximizes the capability of disaster recovery. + +To make this mechanism effective, you need to properly configure TiKV and PD so that the topology information of the cluster, especially the TiKV location information, is reported to PD during deployment. Before you begin, see [Deploy TiDB Using TiUP](/production-deployment-using-tiup.md) first. + +## Configure `labels` based on the cluster topology + +### Configure `labels` for TiKV + +You can use the command-line flag or set the TiKV configuration file to bind some attributes in the form of key-value pairs. These attributes are called `labels`. After TiKV is started, it reports its `labels` to PD so users can identify the location of TiKV nodes. + +Assume that the topology has three layers: zone > rack > host, and you can use these labels (zone, rack, host) to set the TiKV location in one of the following methods: + ++ Use the command-line flag: + + {{< copyable "" >}} + + ``` + tikv-server --labels zone=,rack=,host= + ``` + ++ Configure in the TiKV configuration file: + + {{< copyable "" >}} + + ```toml + [server] + labels = "zone=,rack=,host=" + ``` + +### Configure `location-labels` for PD + +According to the description above, the label can be any key-value pair used to describe TiKV attributes. But PD cannot identify the location-related labels and the layer relationship of these labels. Therefore, you need to make the following configuration for PD to understand the TiKV node topology. + ++ If the PD cluster is not initialized, configure `location-labels` in the PD configuration file: + + {{< copyable "" >}} + + ```toml + [replication] + location-labels = ["zone", "rack", "host"] + ``` + ++ If the PD cluster is already initialized, use the pd-ctl tool to make online changes: + + {{< copyable "shell-regular" >}} + + ```bash + pd-ctl config set location-labels zone,rack,host + ``` + +The `location-labels` configuration is an array of strings, and each item corresponds to the key of TiKV `labels`. The sequence of each key represents the layer relationship of different labels. + +> **Note:** +> +> You must configure `location-labels` for PD and `labels` for TiKV at the same time for the configurations to take effect. Otherwise, PD does not perform scheduling according to the topology. + +### Configure `isolation-level` for PD + +If `location-labels` has been configured, you can further enhance the topological isolation requirements on TiKV clusters by configuring `isolation-level` in the PD configuration file. + +Assume that you have made a three-layer cluster topology by configuring `location-labels` according to the instructions above: zone -> rack -> host, you can configure the `isolation-level` to `zone` as follows: + +{{< copyable "" >}} + +```toml +[replication] +isolation-level = "zone" +``` + +If the PD cluster is already initialized, you need to use the pd-ctl tool to make online changes: + +{{< copyable "shell-regular" >}} + +```bash +pd-ctl config set isolation-level zone +``` + +The `location-level` configuration is an array of strings, which needs to correspond to a key of `location-labels`. This parameter limits the minimum and mandatory isolation level requirements on TiKV topology clusters. + +> **Note:** +> +> `isolation-level` is empty by default, which means there is no mandatory restriction on the isolation level. To set it, you need to configure `location-labels` for PD and ensure that the value of `isolation-level` is one of `location-labels` names. + +### Configure a cluster using TiUP (recommended) + +When using TiUP to deploy a cluster, you can configure the TiKV location in the [initialization configuration file](/production-deployment-using-tiup.md#step-3-edit-the-initialization-configuration-file). TiUP will generate the corresponding TiKV and PD configuration files during deployment. + +In the following example, a two-layer topology of `zone/host` is defined. The TiKV nodes of the cluster are distributed among three zones, each zone with two hosts. In z1, two TiKV instances are deployed per host. In z2 and z3, one TiKV instance is deployed per host. In the following example, `tikv-n` represents the IP address of the `n`th TiKV node. + +``` +server_configs: + pd: + replication.location-labels: ["zone", "host"] + +tikv_servers: +# z1 + - host: tikv-1 + config: + server.labels: + zone: z1 + host: h1 + - host: tikv-2 + config: + server.labels: + zone: z1 + host: h1 + - host: tikv-3 + config: + server.labels: + zone: z1 + host: h2 + - host: tikv-4 + config: + server.labels: + zone: z1 + host: h2 +# z2 + - host: tikv-5 + config: + server.labels: + zone: z2 + host: h1 + - host: tikv-6 + config: + server.labels: + zone: z2 + host: h2 +# z3 + - host: tikv-7 + config: + server.labels: + zone: z3 + host: h1 + - host: tikv-8 + config: + server.labels: + zone: z3 + host: h2s +``` + +For details, see [Geo-distributed Deployment topology](/geo-distributed-deployment-topology.md). + +
+ Configure a cluster using TiDB Ansible + +When using TiDB Ansible to deploy a cluster, you can directly configure the TiKV location in the `inventory.ini` file. `tidb-ansible` will generate the corresponding TiKV and PD configuration files during deployment. + +In the following example, a two-layer topology of `zone/host` is defined. The TiKV nodes of the cluster are distributed among three zones, each zone with two hosts. In z1, two TiKV instances are deployed per host. In z2 and z3, one TiKV instance is deployed per host. + +``` +[tikv_servers] +# z1 +tikv-1 labels="zone=z1,host=h1" +tikv-2 labels="zone=z1,host=h1" +tikv-3 labels="zone=z1,host=h2" +tikv-4 labels="zone=z1,host=h2" +# z2 +tikv-5 labels="zone=z2,host=h1" +tikv-6 labels="zone=z2,host=h2" +# z3 +tikv-7 labels="zone=z3,host=h1" +tikv-8 labels="zone=z3,host=h2" + +[pd_servers:vars] +location_labels = ["zone", "host"] +``` + +
+ +## PD schedules based on topology label + +PD schedules replicas according to the label layer to make sure that different replicas of the same data are scattered as much as possible. + +Take the topology in the previous section as an example. + +Assume that the number of cluster replicas is 3 (`max-replicas=3`). Because there are 3 zones in total, PD ensures that the 3 replicas of each Region are respectively placed in z1, z2, and z3. In this way, the TiDB cluster is still available when one data center fails. + +Then, assume that the number of cluster replicas is 5 (`max-replicas=5`). Because there are only 3 zones in total, PD cannot guarantee the isolation of each replica at the zone level. In this situation, the PD scheduler will ensure replica isolation at the host level. In other words, multiple replicas of a Region might be distributed in the same zone but not on the same host. + +In the case of the 5-replica configuration, if z3 fails or is isolated as a whole, and cannot be recovered after a period of time (controlled by `max-store-down-time`), PD will make up the 5 replicas through scheduling. At this time, only 3 hosts are available. This means that host-level isolation cannot be guaranteed and that multiple replicas might be scheduled to the same host. But if the `isolation-level` value is set to `zone` instead of being left empty, this specifies the minimum physical isolation requirements for Region replicas. That is to say, PD will ensure that replicas of the same Region are scattered among different zones. PD will not perform corresponding scheduling even if following this isolation restriction does not meet the requirement of `max-replicas` for multiple replicas. + +For example, a TiKV cluster is distributed across three data zones z1, z2, and z3. Each Region has three replicas as required, and PD distributes the three replicas of the same Region to these three data zones respectively. If a power outage occurs in z1 and cannot be recovered after a period of time, PD determines that the Region replicas on z1 are no longer available. However, because `isolation-level` is set to `zone`, PD needs to strictly guarantee that different replicas of the same Region will not be scheduled on the same data zone. Because both z2 and z3 already have replicas, PD will not perform any scheduling under the minimum isolation level restriction of `isolation-level`, even if there are only two replicas at this moment. + +Similarly, when `isolation-level` is set to `rack`, the minimum isolation level applies to different racks in the same data center. With this configuration, the isolation at the zone layer is guaranteed first if possible. When the isolation at the zone level cannot be guaranteed, PD tries to avoid scheduling different replicas to the same rack in the same zone. The scheduling works similarly when `isolation-level` is set to `host` where PD first guarantees the isolation level of rack, and then the level of host. + +In summary, PD maximizes the disaster recovery of the cluster according to the current topology. Therefore, if you want to achieve a certain level of disaster recovery, deploy more machines on different sites according to the topology than the number of `max-replicas`. TiDB also provides mandatory configuration items such as `isolation-level` for you to more flexibly control the topological isolation level of data according to different scenarios. diff --git a/tidb-scheduling.md b/tidb-scheduling.md new file mode 100644 index 0000000000000..84e9cae9f2225 --- /dev/null +++ b/tidb-scheduling.md @@ -0,0 +1,141 @@ +--- +title: TiDB Scheduling +summary: Introduces the PD scheduling component in a TiDB cluster. +--- + +# TiDB Scheduling + +PD works as the manager in a TiDB cluster, and it also schedules Regions in the cluster. This article introduces the design and core concepts of the PD scheduling component. + +## Scheduling situations + +TiKV is the distributed key-value storage engine used by TiDB. In TiKV, data is organized as Regions, which are replicated on several stores. In all replicas, a leader is responsible for reading and writing, and followers are responsible for replicating Raft logs from the leader. + +Now consider about the following situations: + +* To utilize storage space in a high-efficient way, multiple Replicas of the same Region need to be properly distributed on different nodes according to the Region size; +* For multiple data center topologies, one data center failure only causes one replica to fail for all Regions; +* When a new TiKV store is added, data can be rebalanced to it; +* When a TiKV store fails, PD needs to consider: + * Recovery time of the failed store. + * If it's short (e.g. the service is restarted), whether scheduling is necessary or not. + * If it's long (e.g. disk fault, data is lost, etc.), how to do scheduling. + * Replicas of all Regions. + * If the number of replicas is not enough for some Regions, PD needs to complete them. + * If the number of replicas is more than expected (e.g. the failed store re-joins into the cluster after recovery), PD needs to delete them. +* Read/Write operations are performed on leaders, which can not be distributed only on a few individual stores; +* Not all Regions are hot, so loads of all TiKV stores need to be balanced; +* When Regions are in balancing, data transferring utilizes much network/disk traffic and CPU time, which can influence online services. + +These situations can occur at the same time, which makes it harder to resolve. Also, the whole system is changing dynamically, so a scheduler is needed to collect all information about the cluster, and then adjust the cluster. So, PD is introduced into the TiDB cluster. + +## Scheduling requirements + +The above situations can be classified into two types: + +1. A distributed and highly available storage system must meet the following requirements: + + * The right number of replicas. + * Replicas need to be distributed on different machines according to different topologies. + * The cluster can perform automatic disaster recovery from TiKV peers' failure. + +2. A good distributed system needs to have the following optimizations: + + * All Region leaders are distributed evenly on stores; + * Storage capacity of all TiKV peers are balanced; + * Hot spots are balanced; + * Speed of load balancing for the Regions needs to be limited to ensure that online services are stable; + * Maintainers are able to to take peers online/offline manually. + +After the first type of requirements is satisfied, the system will be failure tolerable. After the second type of requirements is satisfied, resources will be utilized more efficiently and the system will have better scalability. + +To achieve the goals, PD needs to collect information firstly, such as state of peers, information about Raft groups and the statistics of accessing the peers. Then we need to specify some strategies for PD, so that PD can make scheduling plans from these information and strategies. Finally, PD distributes some operators to TiKV peers to complete scheduling plans. + +## Basic scheduling operators + +All scheduling plans contain three basic operators: + +* Add a new replica +* Remove a replica +* Transfer a Region leader between replicas in a Raft group + +They are implemented by the Raft commands `AddReplica`, `RemoveReplica`, and `TransferLeader`. + +## Information collection + +Scheduling is based on information collection. In short, the PD scheduling component needs to know the states of all TiKV peers and all Regions. TiKV peers report the following information to PD: + +- State information reported by each TiKV peer: + + Each TiKV peer sends heartbeats to PD periodically. PD not only checks whether the store is alive, but also collects [`StoreState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L421) in the heartbeat message. `StoreState` includes: + + * Total disk space + * Available disk space + * The number of Regions + * Data read/write speed + * The number of snapshots that are sent/received (The data might be replicated between replicas through snapshots) + * Whether the store is overloaded + * Labels (See [Perception of Topology](/schedule-replicas-by-topology-labels.md)) + +- Information reported by Region leaders: + + Each Region leader sends heartbeats to PD periodically to report [`RegionState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L271), including: + + * Position of the leader itself + * Positions of other replicas + * The number of offline replicas + * data read/write speed + +PD collects cluster information by the two types of heartbeats and then makes decision based on it. + +Besides, PD can get more information from an expanded interface to make a more precise decision. For example, if a store's heartbeats are broken, PD can't know whether the peer steps down temporarily or forever. It just waits a while (by default 30min) and then treats the store as offline if there are still no heartbeats received. Then PD balances all regions on the store to other stores. + +But sometimes stores are manually set offline by a maintainer, so the maintainer can tell PD this by the PD control interface. Then PD can balance all regions immediately. + +## Scheduling strategies + +After collecting the information, PD needs some strategies to make scheduling plans. + +**Strategy 1: The number of replicas of a Region needs to be correct** + +PD can know that the replica count of a Region is incorrect from the Region leader's heartbeat. If it happens, PD can adjust the replica count by adding/removing replica(s). The reason for incorrect replica count can be: + +* Store failure, so some Region's replica count is less than expected; +* Store recovery after failure, so some Region's replica count could be more than expected; +* [`max-replicas`](https://github.com/pingcap/pd/blob/v4.0.0-beta/conf/config.toml#L95) is changed. + +**Strategy 2: Replicas of a Region need to be at different positions** + +Note that here "position" is different from "machine". Generally PD can only ensure that replicas of a Region are not at a same peer to avoid that the peer's failure causes more than one replicas to become lost. However in production, you might have the following requirements: + +* Multiple TiKV peers are on one machine; +* TiKV peers are on multiple racks, and the system is expected to be available even if a rack fails; +* TiKV peers are in multiple data centers, and the system is expected to be available even if a data center fails; + +The key to these requirements is that peers can have the same "position", which is the smallest unit for failure toleration. Replicas of a Region must not be in one unit. So, we can configure [labels](https://github.com/tikv/tikv/blob/v4.0.0-beta/etc/config-template.toml#L140) for the TiKV peers, and set [location-labels](https://github.com/pingcap/pd/blob/v4.0.0-beta/conf/config.toml#L100) on PD to specify which labels are used for marking positions. + +**Strategy 3: Replicas need to be balanced between stores** + +The size limit of a Region replica is fixed, so keeping the replicas balanced between stores is helpful for data size balance. + +**Strategy 4: Leaders need to be balanced between stores** + +Read and write operations are performed on leaders according to the Raft protocol, so that PD needs to distribute leaders into the whole cluster instead of several peers. + +**Strategy 5: Hot spots need to be balanced between stores** + +PD can detect hot spots from store heartbeats and Region heartbeats, so that PD can distribute hot spots. + +**Strategy 6: Storage size needs to be balanced between stores** + +When started up, a TiKV store reports `capacity` of storage, which indicates the store's space limit. PD will consider this when scheduling. + +**Strategy 7: Adjust scheduling speed to stabilize online services** + +Scheduling utilizes CPU, memory, network and I/O traffic. Too much resource utilization will influence online services. Therefore, PD needs to limit the number of the concurrent scheduling tasks. By default this strategy is conservative, while it can be changed if quicker scheduling is required. + +## Scheduling implementation + +PD collects cluster information from store heartbeats and Region heartbeats, and then makes scheduling plans from the information and strategies. Scheduling plans are a sequence of basic operators. Every time PD receives a Region heartbeat from a Region leader, it checks whether there is a pending operator on the Region or not. If PD needs to dispatch a new operator to a Region, it puts the operator into heartbeat responses, and monitors the operator by checking follow-up Region heartbeats. + +Note that here "operators" are only suggestions to the Region leader, which can be skipped by Regions. Leader of Regions can decide whether to skip a scheduling operator or not based on its current status. From dc596433324862885607c58a3915d05db5474b69 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Thu, 17 Sep 2020 13:40:24 +0800 Subject: [PATCH 2/5] 3.1-change --- TOC.md | 142 +----------------------- location-awareness.md | 91 --------------- pd-configuration-file.md | 22 ---- schedule-replicas-by-topology-labels.md | 101 +---------------- tidb-scheduling.md | 141 ----------------------- 5 files changed, 6 insertions(+), 491 deletions(-) delete mode 100644 location-awareness.md delete mode 100644 tidb-scheduling.md diff --git a/TOC.md b/TOC.md index f4a3c9c35ed81..0e9d70fd0d942 100644 --- a/TOC.md +++ b/TOC.md @@ -58,7 +58,7 @@ - [Docker Deployment](/test-deployment-using-docker.md) + Geographic Redundancy - [Overview](/geo-redundancy-deployment.md) - - [Configure Location Awareness](/location-awareness.md) + - [Schedule Replicas by Topology Labels](/schedule-replicas-by-topology-labels.md) - [Data Migration with Ansible](https://pingcap.com/docs/tidb-data-migration/stable/deploy-a-dm-cluster-using-ansible/) + Configure - [Time Zone](/configure-time-zone.md) @@ -148,7 +148,6 @@ - [`SET`](/data-type-string.md#set-type) - [JSON Type](/data-type-json.md) + Functions and Operators -<<<<<<< HEAD - [Function and Operator Reference](/functions-and-operators/functions-and-operators-overview.md) - [Type Conversion in Expression Evaluation](/functions-and-operators/type-conversion-in-expression-evaluation.md) - [Operators](/functions-and-operators/operators.md) @@ -387,145 +386,6 @@ - [TiKV Control](/tikv-control.md) - [TiDB Control](/tidb-control.md) - [TiDB in Kubernetes](https://pingcap.com/docs/tidb-in-kubernetes/stable/) -======= - + [Overview](/functions-and-operators/functions-and-operators-overview.md) - + [Type Conversion in Expression Evaluation](/functions-and-operators/type-conversion-in-expression-evaluation.md) - + [Operators](/functions-and-operators/operators.md) - + [Control Flow Functions](/functions-and-operators/control-flow-functions.md) - + [String Functions](/functions-and-operators/string-functions.md) - + [Numeric Functions and Operators](/functions-and-operators/numeric-functions-and-operators.md) - + [Date and Time Functions](/functions-and-operators/date-and-time-functions.md) - + [Bit Functions and Operators](/functions-and-operators/bit-functions-and-operators.md) - + [Cast Functions and Operators](/functions-and-operators/cast-functions-and-operators.md) - + [Encryption and Compression Functions](/functions-and-operators/encryption-and-compression-functions.md) - + [Information Functions](/functions-and-operators/information-functions.md) - + [JSON Functions](/functions-and-operators/json-functions.md) - + [Aggregate (GROUP BY) Functions](/functions-and-operators/aggregate-group-by-functions.md) - + [Window Functions](/functions-and-operators/window-functions.md) - + [Miscellaneous Functions](/functions-and-operators/miscellaneous-functions.md) - + [Precision Math](/functions-and-operators/precision-math.md) - + [List of Expressions for Pushdown](/functions-and-operators/expressions-pushed-down.md) - + [Constraints](/constraints.md) - + [Generated Columns](/generated-columns.md) - + [SQL Mode](/sql-mode.md) - + Transactions - + [Overview](/transaction-overview.md) - + [Isolation Levels](/transaction-isolation-levels.md) - + [Optimistic Transactions](/optimistic-transaction.md) - + [Pessimistic Transactions](/pessimistic-transaction.md) - + Garbage Collection (GC) - + [Overview](/garbage-collection-overview.md) - + [Configuration](/garbage-collection-configuration.md) - + [Views](/views.md) - + [Partitioning](/partitioned-table.md) - + [Character Set and Collation](/character-set-and-collation.md) - + System Tables - + [`mysql`](/mysql-schema.md) - + INFORMATION_SCHEMA - + [Overview](/information-schema/information-schema.md) - + [`ANALYZE_STATUS`](/information-schema/information-schema-analyze-status.md) - + [`CHARACTER_SETS`](/information-schema/information-schema-character-sets.md) - + [`CLUSTER_CONFIG`](/information-schema/information-schema-cluster-config.md) - + [`CLUSTER_HARDWARE`](/information-schema/information-schema-cluster-hardware.md) - + [`CLUSTER_INFO`](/information-schema/information-schema-cluster-info.md) - + [`CLUSTER_LOAD`](/information-schema/information-schema-cluster-load.md) - + [`CLUSTER_LOG`](/information-schema/information-schema-cluster-log.md) - + [`CLUSTER_SYSTEMINFO`](/information-schema/information-schema-cluster-systeminfo.md) - + [`COLLATIONS`](/information-schema/information-schema-collations.md) - + [`COLLATION_CHARACTER_SET_APPLICABILITY`](/information-schema/information-schema-collation-character-set-applicability.md) - + [`COLUMNS`](/information-schema/information-schema-columns.md) - + [`DDL_JOBS`](/information-schema/information-schema-ddl-jobs.md) - + [`ENGINES`](/information-schema/information-schema-engines.md) - + [`INSPECTION_RESULT`](/information-schema/information-schema-inspection-result.md) - + [`INSPECTION_RULES`](/information-schema/information-schema-inspection-rules.md) - + [`INSPECTION_SUMMARY`](/information-schema/information-schema-inspection-summary.md) - + [`KEY_COLUMN_USAGE`](/information-schema/information-schema-key-column-usage.md) - + [`METRICS_SUMMARY`](/information-schema/information-schema-metrics-summary.md) - + [`METRICS_TABLES`](/information-schema/information-schema-metrics-tables.md) - + [`PARTITIONS`](/information-schema/information-schema-partitions.md) - + [`PROCESSLIST`](/information-schema/information-schema-processlist.md) - + [`SCHEMATA`](/information-schema/information-schema-schemata.md) - + [`SEQUENCES`](/information-schema/information-schema-sequences.md) - + [`SESSION_VARIABLES`](/information-schema/information-schema-session-variables.md) - + [`SLOW_QUERY`](/information-schema/information-schema-slow-query.md) - + [`STATISTICS`](/information-schema/information-schema-statistics.md) - + [`TABLES`](/information-schema/information-schema-tables.md) - + [`TABLE_CONSTRAINTS`](/information-schema/information-schema-table-constraints.md) - + [`TABLE_STORAGE_STATS`](/information-schema/information-schema-table-storage-stats.md) - + [`TIDB_HOT_REGIONS`](/information-schema/information-schema-tidb-hot-regions.md) - + [`TIDB_INDEXES`](/information-schema/information-schema-tidb-indexes.md) - + [`TIDB_SERVERS_INFO`](/information-schema/information-schema-tidb-servers-info.md) - + [`TIFLASH_REPLICA`](/information-schema/information-schema-tiflash-replica.md) - + [`TIKV_REGION_PEERS`](/information-schema/information-schema-tikv-region-peers.md) - + [`TIKV_REGION_STATUS`](/information-schema/information-schema-tikv-region-status.md) - + [`TIKV_STORE_STATUS`](/information-schema/information-schema-tikv-store-status.md) - + [`USER_PRIVILEGES`](/information-schema/information-schema-user-privileges.md) - + [`VIEWS`](/information-schema/information-schema-views.md) - + [`METRICS_SCHEMA`](/metrics-schema.md) - + UI - + TiDB Dashboard - + [Overview](/dashboard/dashboard-intro.md) - + Maintain - + [Deploy](/dashboard/dashboard-ops-deploy.md) - + [Reverse Proxy](/dashboard/dashboard-ops-reverse-proxy.md) - + [Secure](/dashboard/dashboard-ops-security.md) - + [Access](/dashboard/dashboard-access.md) - + [Overview Page](/dashboard/dashboard-overview.md) - + [Cluster Info Page](/dashboard/dashboard-cluster-info.md) - + [Key Visualizer Page](/dashboard/dashboard-key-visualizer.md) - + SQL Statements Analysis - + [SQL Statements Page](/dashboard/dashboard-statement-list.md) - + [SQL Details Page](/dashboard/dashboard-statement-details.md) - + [Slow Queries Page](/dashboard/dashboard-slow-query.md) - + Cluster Diagnostics - + [Access Cluster Diagnostics Page](/dashboard/dashboard-diagnostics-access.md) - + [View Diagnostics Report](/dashboard/dashboard-diagnostics-report.md) - + [Use Diagnostics](/dashboard/dashboard-diagnostics-usage.md) - + [Search Logs Page](/dashboard/dashboard-log-search.md) - + [Profile Instances Page](/dashboard/dashboard-profiling.md) - + [FAQ](/dashboard/dashboard-faq.md) - + CLI - + [tikv-ctl](/tikv-control.md) - + [pd-ctl](/pd-control.md) - + [tidb-ctl](/tidb-control.md) - + [pd-recover](/pd-recover.md) - + Command Line Flags - + [tidb-server](/command-line-flags-for-tidb-configuration.md) - + [tikv-server](/command-line-flags-for-tikv-configuration.md) - + [tiflash-server](/tiflash/tiflash-command-line-flags.md) - + [pd-server](/command-line-flags-for-pd-configuration.md) - + Configuration File Parameters - + [tidb-server](/tidb-configuration-file.md) - + [tikv-server](/tikv-configuration-file.md) - + [tiflash-server](/tiflash/tiflash-configuration.md) - + [pd-server](/pd-configuration-file.md) - + [System Variables](/system-variables.md) - + Storage Engines - + TiKV - + [TiKV Overview](/tikv-overview.md) - + [RocksDB Overview](/storage-engine/rocksdb-overview.md) - + TiFlash - + [Overview](/tiflash/tiflash-overview.md) - + [Use TiFlash](/tiflash/use-tiflash.md) - + TiUP - + [Documentation Guide](/tiup/tiup-documentation-guide.md) - + [Overview](/tiup/tiup-overview.md) - + [Terminology and Concepts](/tiup/tiup-terminology-and-concepts.md) - + [Manage TiUP Components](/tiup/tiup-component-management.md) - + [FAQ](/tiup/tiup-faq.md) - + [Troubleshooting Guide](/tiup/tiup-troubleshooting-guide.md) - + TiUP Components - + [tiup-playground](/tiup/tiup-playground.md) - + [tiup-cluster](/tiup/tiup-cluster.md) - + [tiup-mirror](/tiup/tiup-mirror.md) - + [tiup-bench](/tiup/tiup-bench.md) - + [Telemetry](/telemetry.md) - + [Errors Codes](/error-codes.md) - + [TiCDC Overview](/ticdc/ticdc-overview.md) - + [TiCDC Open Protocol](/ticdc/ticdc-open-protocol.md) - + [Table Filter](/table-filter.md) - + [Schedule Replicas by Topology Labels](/schedule-replicas-by-topology-labels.md) ->>>>>>> 24f31fb... improve location-awareness (#3825) + FAQs - [TiDB FAQs](/faq/tidb-faq.md) - [TiDB Lightning FAQs](/tidb-lightning/tidb-lightning-faq.md) diff --git a/location-awareness.md b/location-awareness.md deleted file mode 100644 index c73d4837fb0fd..0000000000000 --- a/location-awareness.md +++ /dev/null @@ -1,91 +0,0 @@ ---- -title: Cluster Topology Configuration -summary: Learn to configure cluster topology to maximize the capacity for disaster recovery. -aliases: ['/docs/v3.1/location-awareness/','/docs/v3.1/how-to/deploy/geographic-redundancy/location-awareness/'] ---- - -# Cluster Topology Configuration - -## Overview - -PD schedules according to the topology of the TiKV cluster to maximize the TiKV's capability for disaster recovery. - -Before you begin, see [Deploy TiDB Using TiDB Ansible (Recommended)](/online-deployment-using-ansible.md) and [Deploy TiDB Using Docker](/test-deployment-using-docker.md). - -## TiKV reports the topological information - -TiKV reports the topological information to PD according to the startup parameter or configuration of TiKV. - -Assuming that the topology has three structures: zone > rack > host, use lables to specify the following information: - -Startup parameter: - -``` -tikv-server --labels zone=,rack=,host= -``` - -Configuration: - -``` toml -[server] -labels = "zone=,rack=,host=" -``` - -## PD understands the TiKV topology - -PD gets the topology of TiKV cluster through the PD configuration. - -``` toml -[replication] -max-replicas = 3 -location-labels = ["zone", "rack", "host"] -``` - -`location-labels` needs to correspond to the TiKV `labels` name so that PD can understand that the `labels` represents the TiKV topology. - -> **Note:** -> -> You must configure `location-labels` for PD and `labels` for TiKV at the same time for `labels` to take effect. - -## PD schedules based on the TiKV topology - -PD makes optimal scheduling according to the topological information. You just need to care about what kind of topology can achieve the desired effect. - -If you use 3 replicas and hope that the TiDB cluster is always highly available even when a data zone goes down, you need at least 4 data zones. - -Assume that you have 4 data zones, each zone has 2 racks, and each rack has 2 hosts. You can start 2 TiKV instances on each host: - -``` -# zone=z1 -tikv-server --labels zone=z1,rack=r1,host=h1 -tikv-server --labels zone=z1,rack=r1,host=h2 -tikv-server --labels zone=z1,rack=r2,host=h1 -tikv-server --labels zone=z1,rack=r2,host=h2 - -# zone=z2 -tikv-server --labels zone=z2,rack=r1,host=h1 -tikv-server --labels zone=z2,rack=r1,host=h2 -tikv-server --labels zone=z2,rack=r2,host=h1 -tikv-server --labels zone=z2,rack=r2,host=h2 - -# zone=z3 -tikv-server --labels zone=z3,rack=r1,host=h1 -tikv-server --labels zone=z3,rack=r1,host=h2 -tikv-server --labels zone=z3,rack=r2,host=h1 -tikv-server --labels zone=z3,rack=r2,host=h2 - -# zone=z4 -tikv-server --labels zone=z4,rack=r1,host=h1 -tikv-server --labels zone=z4,rack=r1,host=h2 -tikv-server --labels zone=z4,rack=r2,host=h1 -tikv-server --labels zone=z4,rack=r2,host=h2 -``` - -In other words, 16 TiKV instances are distributed across 4 data zones, 8 racks and 16 machines. - -In this case, PD will schedule different replicas of each datum to different data zones. - -- If one of the data zones goes down, the high availability of the TiDB cluster is not affected. -- If the data zone cannot recover within a period of time, PD will remove the replica from this data zone. - -To sum up, PD maximizes the disaster recovery of the cluster according to the current topology. Therefore, if you want to reach a certain level of disaster recovery, deploy many machines in different sites according to the topology. The number of machines must be more than the number of `max-replicas`. diff --git a/pd-configuration-file.md b/pd-configuration-file.md index 267e79a77db50..e121182bdf1ce 100644 --- a/pd-configuration-file.md +++ b/pd-configuration-file.md @@ -256,28 +256,6 @@ Configuration items related to replicas + The topology information of a TiKV cluster + Default value: `[]` -<<<<<<< HEAD -======= -+ [Cluster topology configuration](/schedule-replicas-by-topology-labels.md) - -### `isolation-level` - -+ The minimum topological isolation level of a TiKV cluster -+ Default value: `""` -+ [Cluster topology configuration](/schedule-replicas-by-topology-labels.md) - -### `strictly-match-label` - -+ Enables the strict check for whether the TiKV label matches PD's `location-labels`. -+ Default value: `false` - -### `enable-placement-rules` - -+ Enables `placement-rules`. -+ Default value: `false` -+ See [Placement Rules](/configure-placement-rules.md). -+ An experimental feature of TiDB 4.0. ->>>>>>> 24f31fb... improve location-awareness (#3825) ## `label-property` diff --git a/schedule-replicas-by-topology-labels.md b/schedule-replicas-by-topology-labels.md index 546bab38743aa..47987f850ee79 100644 --- a/schedule-replicas-by-topology-labels.md +++ b/schedule-replicas-by-topology-labels.md @@ -1,14 +1,14 @@ --- title: Schedule Replicas by Topology Labels summary: Learn how to schedule replicas by topology labels. -aliases: ['/docs/dev/location-awareness/','/docs/dev/how-to/deploy/geographic-redundancy/location-awareness/','/tidb/dev/location-awareness'] +aliases: ['/docs/v3.1/location-awareness/','/docs/v3.1/how-to/deploy/geographic-redundancy/location-awareness/','/tidb/v3.1/location-awareness'] --- # Schedule Replicas by Topology Labels To improve the high availability and disaster recovery capability of TiDB clusters, it is recommended that TiKV nodes are physically scattered as much as possible. For example, TiKV nodes can be distributed on different racks or even in different data centers. According to the topology information of TiKV, the PD scheduler automatically performs scheduling at the background to isolate each replica of a Region as much as possible, which maximizes the capability of disaster recovery. -To make this mechanism effective, you need to properly configure TiKV and PD so that the topology information of the cluster, especially the TiKV location information, is reported to PD during deployment. Before you begin, see [Deploy TiDB Using TiUP](/production-deployment-using-tiup.md) first. +To make this mechanism effective, you need to properly configure TiKV and PD so that the topology information of the cluster, especially the TiKV location information, is reported to PD during deployment. Before you begin, see [Deploy TiDB Using TiDB Ansible](/online-deployment-using-ansible.md) first. ## Configure `labels` based on the cluster topology @@ -62,94 +62,7 @@ The `location-labels` configuration is an array of strings, and each item corres > > You must configure `location-labels` for PD and `labels` for TiKV at the same time for the configurations to take effect. Otherwise, PD does not perform scheduling according to the topology. -### Configure `isolation-level` for PD - -If `location-labels` has been configured, you can further enhance the topological isolation requirements on TiKV clusters by configuring `isolation-level` in the PD configuration file. - -Assume that you have made a three-layer cluster topology by configuring `location-labels` according to the instructions above: zone -> rack -> host, you can configure the `isolation-level` to `zone` as follows: - -{{< copyable "" >}} - -```toml -[replication] -isolation-level = "zone" -``` - -If the PD cluster is already initialized, you need to use the pd-ctl tool to make online changes: - -{{< copyable "shell-regular" >}} - -```bash -pd-ctl config set isolation-level zone -``` - -The `location-level` configuration is an array of strings, which needs to correspond to a key of `location-labels`. This parameter limits the minimum and mandatory isolation level requirements on TiKV topology clusters. - -> **Note:** -> -> `isolation-level` is empty by default, which means there is no mandatory restriction on the isolation level. To set it, you need to configure `location-labels` for PD and ensure that the value of `isolation-level` is one of `location-labels` names. - -### Configure a cluster using TiUP (recommended) - -When using TiUP to deploy a cluster, you can configure the TiKV location in the [initialization configuration file](/production-deployment-using-tiup.md#step-3-edit-the-initialization-configuration-file). TiUP will generate the corresponding TiKV and PD configuration files during deployment. - -In the following example, a two-layer topology of `zone/host` is defined. The TiKV nodes of the cluster are distributed among three zones, each zone with two hosts. In z1, two TiKV instances are deployed per host. In z2 and z3, one TiKV instance is deployed per host. In the following example, `tikv-n` represents the IP address of the `n`th TiKV node. - -``` -server_configs: - pd: - replication.location-labels: ["zone", "host"] - -tikv_servers: -# z1 - - host: tikv-1 - config: - server.labels: - zone: z1 - host: h1 - - host: tikv-2 - config: - server.labels: - zone: z1 - host: h1 - - host: tikv-3 - config: - server.labels: - zone: z1 - host: h2 - - host: tikv-4 - config: - server.labels: - zone: z1 - host: h2 -# z2 - - host: tikv-5 - config: - server.labels: - zone: z2 - host: h1 - - host: tikv-6 - config: - server.labels: - zone: z2 - host: h2 -# z3 - - host: tikv-7 - config: - server.labels: - zone: z3 - host: h1 - - host: tikv-8 - config: - server.labels: - zone: z3 - host: h2s -``` - -For details, see [Geo-distributed Deployment topology](/geo-distributed-deployment-topology.md). - -
- Configure a cluster using TiDB Ansible +### Configure a cluster using TiDB Ansible When using TiDB Ansible to deploy a cluster, you can directly configure the TiKV location in the `inventory.ini` file. `tidb-ansible` will generate the corresponding TiKV and PD configuration files during deployment. @@ -185,10 +98,6 @@ Assume that the number of cluster replicas is 3 (`max-replicas=3`). Because ther Then, assume that the number of cluster replicas is 5 (`max-replicas=5`). Because there are only 3 zones in total, PD cannot guarantee the isolation of each replica at the zone level. In this situation, the PD scheduler will ensure replica isolation at the host level. In other words, multiple replicas of a Region might be distributed in the same zone but not on the same host. -In the case of the 5-replica configuration, if z3 fails or is isolated as a whole, and cannot be recovered after a period of time (controlled by `max-store-down-time`), PD will make up the 5 replicas through scheduling. At this time, only 3 hosts are available. This means that host-level isolation cannot be guaranteed and that multiple replicas might be scheduled to the same host. But if the `isolation-level` value is set to `zone` instead of being left empty, this specifies the minimum physical isolation requirements for Region replicas. That is to say, PD will ensure that replicas of the same Region are scattered among different zones. PD will not perform corresponding scheduling even if following this isolation restriction does not meet the requirement of `max-replicas` for multiple replicas. - -For example, a TiKV cluster is distributed across three data zones z1, z2, and z3. Each Region has three replicas as required, and PD distributes the three replicas of the same Region to these three data zones respectively. If a power outage occurs in z1 and cannot be recovered after a period of time, PD determines that the Region replicas on z1 are no longer available. However, because `isolation-level` is set to `zone`, PD needs to strictly guarantee that different replicas of the same Region will not be scheduled on the same data zone. Because both z2 and z3 already have replicas, PD will not perform any scheduling under the minimum isolation level restriction of `isolation-level`, even if there are only two replicas at this moment. - -Similarly, when `isolation-level` is set to `rack`, the minimum isolation level applies to different racks in the same data center. With this configuration, the isolation at the zone layer is guaranteed first if possible. When the isolation at the zone level cannot be guaranteed, PD tries to avoid scheduling different replicas to the same rack in the same zone. The scheduling works similarly when `isolation-level` is set to `host` where PD first guarantees the isolation level of rack, and then the level of host. +In the case of the 5-replica configuration, if z3 fails or is isolated as a whole, and cannot be recovered after a period of time (controlled by `max-store-down-time`), PD will make up the 5 replicas through scheduling. At this time, only 3 hosts are available. This means that host-level isolation cannot be guaranteed and that multiple replicas might be scheduled to the same host. -In summary, PD maximizes the disaster recovery of the cluster according to the current topology. Therefore, if you want to achieve a certain level of disaster recovery, deploy more machines on different sites according to the topology than the number of `max-replicas`. TiDB also provides mandatory configuration items such as `isolation-level` for you to more flexibly control the topological isolation level of data according to different scenarios. +In summary, PD maximizes the disaster recovery of the cluster according to the current topology. Therefore, if you want to achieve a certain level of disaster recovery, deploy more machines on different sites according to the topology than the number of `max-replicas`. diff --git a/tidb-scheduling.md b/tidb-scheduling.md deleted file mode 100644 index 84e9cae9f2225..0000000000000 --- a/tidb-scheduling.md +++ /dev/null @@ -1,141 +0,0 @@ ---- -title: TiDB Scheduling -summary: Introduces the PD scheduling component in a TiDB cluster. ---- - -# TiDB Scheduling - -PD works as the manager in a TiDB cluster, and it also schedules Regions in the cluster. This article introduces the design and core concepts of the PD scheduling component. - -## Scheduling situations - -TiKV is the distributed key-value storage engine used by TiDB. In TiKV, data is organized as Regions, which are replicated on several stores. In all replicas, a leader is responsible for reading and writing, and followers are responsible for replicating Raft logs from the leader. - -Now consider about the following situations: - -* To utilize storage space in a high-efficient way, multiple Replicas of the same Region need to be properly distributed on different nodes according to the Region size; -* For multiple data center topologies, one data center failure only causes one replica to fail for all Regions; -* When a new TiKV store is added, data can be rebalanced to it; -* When a TiKV store fails, PD needs to consider: - * Recovery time of the failed store. - * If it's short (e.g. the service is restarted), whether scheduling is necessary or not. - * If it's long (e.g. disk fault, data is lost, etc.), how to do scheduling. - * Replicas of all Regions. - * If the number of replicas is not enough for some Regions, PD needs to complete them. - * If the number of replicas is more than expected (e.g. the failed store re-joins into the cluster after recovery), PD needs to delete them. -* Read/Write operations are performed on leaders, which can not be distributed only on a few individual stores; -* Not all Regions are hot, so loads of all TiKV stores need to be balanced; -* When Regions are in balancing, data transferring utilizes much network/disk traffic and CPU time, which can influence online services. - -These situations can occur at the same time, which makes it harder to resolve. Also, the whole system is changing dynamically, so a scheduler is needed to collect all information about the cluster, and then adjust the cluster. So, PD is introduced into the TiDB cluster. - -## Scheduling requirements - -The above situations can be classified into two types: - -1. A distributed and highly available storage system must meet the following requirements: - - * The right number of replicas. - * Replicas need to be distributed on different machines according to different topologies. - * The cluster can perform automatic disaster recovery from TiKV peers' failure. - -2. A good distributed system needs to have the following optimizations: - - * All Region leaders are distributed evenly on stores; - * Storage capacity of all TiKV peers are balanced; - * Hot spots are balanced; - * Speed of load balancing for the Regions needs to be limited to ensure that online services are stable; - * Maintainers are able to to take peers online/offline manually. - -After the first type of requirements is satisfied, the system will be failure tolerable. After the second type of requirements is satisfied, resources will be utilized more efficiently and the system will have better scalability. - -To achieve the goals, PD needs to collect information firstly, such as state of peers, information about Raft groups and the statistics of accessing the peers. Then we need to specify some strategies for PD, so that PD can make scheduling plans from these information and strategies. Finally, PD distributes some operators to TiKV peers to complete scheduling plans. - -## Basic scheduling operators - -All scheduling plans contain three basic operators: - -* Add a new replica -* Remove a replica -* Transfer a Region leader between replicas in a Raft group - -They are implemented by the Raft commands `AddReplica`, `RemoveReplica`, and `TransferLeader`. - -## Information collection - -Scheduling is based on information collection. In short, the PD scheduling component needs to know the states of all TiKV peers and all Regions. TiKV peers report the following information to PD: - -- State information reported by each TiKV peer: - - Each TiKV peer sends heartbeats to PD periodically. PD not only checks whether the store is alive, but also collects [`StoreState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L421) in the heartbeat message. `StoreState` includes: - - * Total disk space - * Available disk space - * The number of Regions - * Data read/write speed - * The number of snapshots that are sent/received (The data might be replicated between replicas through snapshots) - * Whether the store is overloaded - * Labels (See [Perception of Topology](/schedule-replicas-by-topology-labels.md)) - -- Information reported by Region leaders: - - Each Region leader sends heartbeats to PD periodically to report [`RegionState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L271), including: - - * Position of the leader itself - * Positions of other replicas - * The number of offline replicas - * data read/write speed - -PD collects cluster information by the two types of heartbeats and then makes decision based on it. - -Besides, PD can get more information from an expanded interface to make a more precise decision. For example, if a store's heartbeats are broken, PD can't know whether the peer steps down temporarily or forever. It just waits a while (by default 30min) and then treats the store as offline if there are still no heartbeats received. Then PD balances all regions on the store to other stores. - -But sometimes stores are manually set offline by a maintainer, so the maintainer can tell PD this by the PD control interface. Then PD can balance all regions immediately. - -## Scheduling strategies - -After collecting the information, PD needs some strategies to make scheduling plans. - -**Strategy 1: The number of replicas of a Region needs to be correct** - -PD can know that the replica count of a Region is incorrect from the Region leader's heartbeat. If it happens, PD can adjust the replica count by adding/removing replica(s). The reason for incorrect replica count can be: - -* Store failure, so some Region's replica count is less than expected; -* Store recovery after failure, so some Region's replica count could be more than expected; -* [`max-replicas`](https://github.com/pingcap/pd/blob/v4.0.0-beta/conf/config.toml#L95) is changed. - -**Strategy 2: Replicas of a Region need to be at different positions** - -Note that here "position" is different from "machine". Generally PD can only ensure that replicas of a Region are not at a same peer to avoid that the peer's failure causes more than one replicas to become lost. However in production, you might have the following requirements: - -* Multiple TiKV peers are on one machine; -* TiKV peers are on multiple racks, and the system is expected to be available even if a rack fails; -* TiKV peers are in multiple data centers, and the system is expected to be available even if a data center fails; - -The key to these requirements is that peers can have the same "position", which is the smallest unit for failure toleration. Replicas of a Region must not be in one unit. So, we can configure [labels](https://github.com/tikv/tikv/blob/v4.0.0-beta/etc/config-template.toml#L140) for the TiKV peers, and set [location-labels](https://github.com/pingcap/pd/blob/v4.0.0-beta/conf/config.toml#L100) on PD to specify which labels are used for marking positions. - -**Strategy 3: Replicas need to be balanced between stores** - -The size limit of a Region replica is fixed, so keeping the replicas balanced between stores is helpful for data size balance. - -**Strategy 4: Leaders need to be balanced between stores** - -Read and write operations are performed on leaders according to the Raft protocol, so that PD needs to distribute leaders into the whole cluster instead of several peers. - -**Strategy 5: Hot spots need to be balanced between stores** - -PD can detect hot spots from store heartbeats and Region heartbeats, so that PD can distribute hot spots. - -**Strategy 6: Storage size needs to be balanced between stores** - -When started up, a TiKV store reports `capacity` of storage, which indicates the store's space limit. PD will consider this when scheduling. - -**Strategy 7: Adjust scheduling speed to stabilize online services** - -Scheduling utilizes CPU, memory, network and I/O traffic. Too much resource utilization will influence online services. Therefore, PD needs to limit the number of the concurrent scheduling tasks. By default this strategy is conservative, while it can be changed if quicker scheduling is required. - -## Scheduling implementation - -PD collects cluster information from store heartbeats and Region heartbeats, and then makes scheduling plans from the information and strategies. Scheduling plans are a sequence of basic operators. Every time PD receives a Region heartbeat from a Region leader, it checks whether there is a pending operator on the Region or not. If PD needs to dispatch a new operator to a Region, it puts the operator into heartbeat responses, and monitors the operator by checking follow-up Region heartbeats. - -Note that here "operators" are only suggestions to the Region leader, which can be skipped by Regions. Leader of Regions can decide whether to skip a scheduling operator or not based on its current status. From ad47f588be25fc1395f3cd29d811345c1122fbea Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Thu, 17 Sep 2020 13:41:55 +0800 Subject: [PATCH 3/5] Update configure-placement-rules.md --- configure-placement-rules.md | 15 --------------- 1 file changed, 15 deletions(-) diff --git a/configure-placement-rules.md b/configure-placement-rules.md index 3e3ff9fa1cb68..805a06d9a4ea5 100644 --- a/configure-placement-rules.md +++ b/configure-placement-rules.md @@ -48,21 +48,6 @@ The following table shows the meaning of each field in a rule: The meaning and function of `LocationLabels` are the same with those earlier than v4.0. For example, if you have deployed `[zone,rack,host]` that defines a three-layer topology: the cluster has multiple zones (Availability Zones), each zone has multiple racks, and each rack has multiple hosts. When performing schedule, PD first tries to place the Region's peers in different zones. If this try fails (such as there are three replicas but only two zones in total), PD guarantees to place these replicas in different racks. If the number of racks is not enough to guarantee isolation, then PD tries the host-level isolation. -<<<<<<< HEAD -======= -The meaning and function of `IsolationLevel` is elaborated in [Cluster topology configuration](/schedule-replicas-by-topology-labels.md). For example, if you have deployed `[zone,rack,host]` that defines a three-layer topology with `LocationLabels` and set `IsolationLevel` to `zone`, then PD ensures that all peers of each Region are placed in different zones during the scheduling. If the minimum isolation level restriction on `IsolationLevel` cannot be met (for example, 3 replicas are configured but there are only 2 data zones in total), PD will not try to make up to meet this restriction. The default value of `IsolationLevel` is an empty string, which means that it is disabled. - -### Fields of the rule group - -The following table shows the description of each field in a rule group: - -| Field name | Type and restriction | Description | -| :--- | :--- | :--- | -| `ID` | `string` | The group ID that marks the source of the rule. | -| `Index` | `int` | The stacking sequence of different groups. | -| `Override` | `true`/`false` | Whether to override groups with smaller indexes. | - ->>>>>>> 24f31fb... improve location-awareness (#3825) ## Configure rules The operations in this section are based on [pd-ctl](/pd-control.md), and the commands involved in the operations also support calls via HTTP API. From 1a705982a03d7add8894a86bf7980a217a82047b Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Thu, 17 Sep 2020 13:46:17 +0800 Subject: [PATCH 4/5] Update schedule-replicas-by-topology-labels.md --- schedule-replicas-by-topology-labels.md | 1 - 1 file changed, 1 deletion(-) diff --git a/schedule-replicas-by-topology-labels.md b/schedule-replicas-by-topology-labels.md index 47987f850ee79..4329fc03194fd 100644 --- a/schedule-replicas-by-topology-labels.md +++ b/schedule-replicas-by-topology-labels.md @@ -86,7 +86,6 @@ tikv-8 labels="zone=z3,host=h2" location_labels = ["zone", "host"] ``` -
## PD schedules based on topology label From b4c95a51160a44e242d8f8b70ac0855952348aca Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Thu, 17 Sep 2020 13:46:55 +0800 Subject: [PATCH 5/5] Update schedule-replicas-by-topology-labels.md --- schedule-replicas-by-topology-labels.md | 1 - 1 file changed, 1 deletion(-) diff --git a/schedule-replicas-by-topology-labels.md b/schedule-replicas-by-topology-labels.md index 4329fc03194fd..646f7c6bf6bd9 100644 --- a/schedule-replicas-by-topology-labels.md +++ b/schedule-replicas-by-topology-labels.md @@ -86,7 +86,6 @@ tikv-8 labels="zone=z3,host=h2" location_labels = ["zone", "host"] ``` - ## PD schedules based on topology label PD schedules replicas according to the label layer to make sure that different replicas of the same data are scattered as much as possible.