From eaca8f89c84edc4428e1138f17a84e5134721af3 Mon Sep 17 00:00:00 2001 From: qupeng Date: Wed, 1 Jul 2020 15:35:49 +0800 Subject: [PATCH 1/8] translate tidb-scheduling Signed-off-by: qupeng --- tidb-scheduling.md | 181 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 181 insertions(+) create mode 100644 tidb-scheduling.md diff --git a/tidb-scheduling.md b/tidb-scheduling.md new file mode 100644 index 0000000000000..ac26bc6d95f49 --- /dev/null +++ b/tidb-scheduling.md @@ -0,0 +1,181 @@ +--- +title: Overview about TiDB Scheduling +category: reference +--- + +# Overview +PD works as manager in a TiDB cluster, and it also schedules Regions in the +cluster. This article will introduces the design and core concepts about PD's +scheduling module. + +# Situations +TiKV is the distributed K/V storage engine used by TiDB. In TiKV, data is +organized as Regions, which are replicated on serveral stores. In all replicas, +Leader is responsible for reading and writing, Followers are responsible for +replicating Raft logs from the leader. + +So there are some situations need to be considered: + +* Regions need to be distributed fine in the cluster to utilize storage space + high-efficient; +* For multiple datacenter topologies, one datacenter failure should only cause + one replica fail for all Regions; +* When a new TiKV store is added, data can be rebalanced to it; +* When a TiKV store fails, PD needs to consider + * recovery time of the failure store. + * If it's short (e.g. the service is restarted), scheduling is necessary + or not. + * If it's long (e.g. disk fault, data is lost), how to do scheduling. + * replicas of all regions. + * If replicas are not enouth for some regions, needs to complete them. + * If replicas are more than expected (e.g. failed store re-joins into the + cluster after recovery), needs to delete them. +* Read/Write operations are performed on leaders, which shouldn't be distributed + only on invividual stores; +* Not all regions are hot, load of all TiKV stores needs to be balanced. +* When regions are in balancing, data transferring utilizes much network/disk + traffic and CPU time, which can influence online services. + +These situations can occur at the same time, which makes it harder to resolve. +And, the whole system is changing dynamically, so a single point is necessary +to collect all informations about the cluster, and then adjust the cluster. So, +PD is introduced into TiDB cluster. + +## Scheduling Requirements +The above situations can be classified into 2 classes: + +** The First: must be satisfied to reach high availability, includes ** + +* Count of replicas can't be more or less; +* Replicas needs to be distributed to different machines; +* The cluster can auto recovery from TiKV peers failure. + +** The Second: need to be satisfied as a good distributed system, includes ** + +* All Region leaders are balanced; +* Storeage size of all TiKV peers are balanced; +* Hot points are balanced; +* Speed of Region balance needs to be limited to ensure online services are stable; +* It's possible to online/offline peers manually. + +After the first class requirements are satisfied, the system will be failure tolerable. +After the second class requirements are satisfied, resources will be utilized more +efficent and the system will become well expandable. + +To achieve these targets, PD needs to collect informations firstly, such as state of peers, +informations about Raft groups and statistics of peers' accession. Then we can specify +some strategies on PD, so that PD can make sheculing plans from these information and +strategies. Finally, PD will distribute some operators to TiKVs to complete scheduling plans. + +## Basic Schedule operators + +All scheduling plan contain 3 basic operators: +* Add a new replica +* Remove a replica +* Transfer a Region leader between replicas + +They are implemented by Raft command `AddReplica`, `RemoveReplica` and `TransferLeader`. + +## Information collecting + +Scheduling is based on information collecting. In one word, scheduling needs to know +states of all TiKV peers and all Regions. TiKV peers report those information to PD. + +** Information reported by TiKV peers ** + +TiKV sends heartbeats to PD periodically. PD can not only check the store is active +nor not, but also collect [`StoreState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L421) +in the message. `StoreState` includes + +* Total disk space +* Available disk space +* Region count +* Data read/write speed +* Send/receive snapshot count +* It's overload or not +* Labels (See [Perception of Topology](/location-awareness.md)) + +** Information reported by Region leaders ** + +Region leader send heartbeaets to PD periodically to report [`RegionState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L271), +includes + +* Position of the leader itself +* Positions of other replicas +* Offline replicas count +* data read/write speed + +PD collects cluster information by these 2 type heartbeats and then makes dicision based on it. + +Beside these, PD can get more information from expanded interface. For example, +if a store's heartbeats are broken, PD can't know the peer steps down temporarily or forever. +It just waits a while (by default 30min) and then treats the store become offline if there +are still no heartbeats received. Then PD balances all regions on the store to other stores. + +But sometimes stores are set offline by maintainers manually, so that we can tell PD this by +PD control interface. Then PD can balance all regions immediately. + +## Scheduling stretagies + +PD needs some stretagies to make scheduling plans. + +** Replicas count of Regions need to be correct ** + +PD can know replica count of a Region is incorrect from Region leader's heartbeat. If it happens, +PD can adjust replica count by add/rmeove replica operation. The reason of incorrect replica counts +could be: + +* Store failure, so some Region's replica count will be less than expected; +* Store recovery after failure, so some Region's replica count could be more than expected; +* [`max-replicas`](https://github.com/pingcap/pd/blob/v4.0.0-beta/conf/config.toml#L95) is changed. + +** Replicas of a Region need to be at different positions ** + +Please note that 'position' is different from 'machine'. Generally PD can only ensure that +replicas of a Region won't be at a same peer to avoid the peer's failure cause more than one +replicas become lost. However in production, these requirements are possible: + +* Multiple TiKV peers are on one machine; +* TiKVs are on multiple racks, and the system is expected to be available even if a rack fails; +* TiKVs are in multiple datacenters, and the system is expected to be available even if a datacenter fails; + +The key of there requirements is that peers can have same 'position', which is the smallest unit +for failure-toleration. Replicas of a Region shouldn't be in one unit. So, we can configure +[labels](https://github.com/tikv/tikv/blob/v4.0.0-beta/etc/config-template.toml#L140) for TiKVs, +and set [location-labels](https://github.com/pingcap/pd/blob/v4.0.0-beta/conf/config.toml#L100) on PD +to specify which labels are used for marking positions. + +** Replicas should be balanced between stores ** + +Size limit of a Region is fixed, so make Region count be balanced between store is helpful for data size balance. + +** Leaders should be balanced between stores ** + +Read and write operations are performed on leaders in Raft. So PD needs to distributed leader into whole cluster +instead of serveral peers. + +** Hot points should be balanced between stores ** + +PD can detect hot points from store heartbets and Region heartbeats. So PD can disturb hot points. + +** Storage size needs to be balanced between stores ** + +TiKV reports `capacity` of storage when it starts up, which indicates the store's space limit. PD will consider this +when doing schedule. + +** Adjust scheduling speed to stabilize online services ** + +Scheduling utilizes CPU, memory, network and I/O traffic. Too much resource utilization will influence +online services. So PD needs to limit concurrent scheduling count. By default the strategy is conservative, +while it can be changed if quicker scheduling is required. + +## Scheduling implementation + +PD collects cluster information from store heartbeats and Region heartbeats, and then makes scheduling plan +from the information and stretagies. Scheduling plans are constructed by a sequence of basic operators. +Every time when PD receives a region heartbeat from a Region leader, it checks whether there is a pending +operator on the Region or not. If PD needs to dispatch a new operator to a Region, it puts the operator into +heartbeat responses, and monitors the operator by checking follow-up Region heartbeats. + +Note that operators are only suggestions, which could be skipeed by Regions. Leader of Regions can decide +whether to step a scheduling operator or not based on its current status. From 90ace57e99ad7ead2556eb0dfbd106ef7402a592 Mon Sep 17 00:00:00 2001 From: yikeke Date: Wed, 15 Jul 2020 12:35:49 +0800 Subject: [PATCH 2/8] Update TOC.md --- TOC.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/TOC.md b/TOC.md index 0ff644a96239e..18c3bbe04447e 100644 --- a/TOC.md +++ b/TOC.md @@ -16,7 +16,8 @@ - [Interaction Test on Online Workloads and `ADD INDEX` Operations](/benchmark/online-workloads-and-add-index-operations.md) - [Quick Start with TiDB](/quick-start-with-tidb.md) + Concepts - - [Architecture](/architecture.md) + + Cluster Architecture + - [Scheduling](/tidb-scheduling.md) + Key Features - [Horizontal Scalability](/key-features.md#horizontal-scalability) - [MySQL Compatible Syntax](/key-features.md#mysql-compatible-syntax) From 0ac749fcd4ac5549c5eb006f9fad7b390debd52a Mon Sep 17 00:00:00 2001 From: yikeke Date: Thu, 16 Jul 2020 14:39:29 +0800 Subject: [PATCH 3/8] Update tidb-scheduling.md --- tidb-scheduling.md | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/tidb-scheduling.md b/tidb-scheduling.md index ac26bc6d95f49..efa29d5f0f904 100644 --- a/tidb-scheduling.md +++ b/tidb-scheduling.md @@ -1,14 +1,15 @@ --- -title: Overview about TiDB Scheduling -category: reference +title: TiDB Scheduling +summary: Introduces the PD scheduling component in a TiDB cluster. --- -# Overview -PD works as manager in a TiDB cluster, and it also schedules Regions in the -cluster. This article will introduces the design and core concepts about PD's -scheduling module. +# TiDB Scheduling + +PD works as the manager in a TiDB cluster, and it also schedules Regions in the +cluster. This article introduces the design and core concepts of the PD scheduling component. + +## Use scenarios -# Situations TiKV is the distributed K/V storage engine used by TiDB. In TiKV, data is organized as Regions, which are replicated on serveral stores. In all replicas, Leader is responsible for reading and writing, Followers are responsible for @@ -42,6 +43,7 @@ to collect all informations about the cluster, and then adjust the cluster. So, PD is introduced into TiDB cluster. ## Scheduling Requirements + The above situations can be classified into 2 classes: ** The First: must be satisfied to reach high availability, includes ** @@ -70,6 +72,7 @@ strategies. Finally, PD will distribute some operators to TiKVs to complete sche ## Basic Schedule operators All scheduling plan contain 3 basic operators: + * Add a new replica * Remove a replica * Transfer a Region leader between replicas From 9878dfa5be45704d589955e972e82a3b2df8db33 Mon Sep 17 00:00:00 2001 From: yikeke Date: Thu, 16 Jul 2020 16:30:41 +0800 Subject: [PATCH 4/8] Update tidb-scheduling.md --- tidb-scheduling.md | 64 ++++++++++++++++++++++------------------------ 1 file changed, 31 insertions(+), 33 deletions(-) diff --git a/tidb-scheduling.md b/tidb-scheduling.md index efa29d5f0f904..ce47f06491742 100644 --- a/tidb-scheduling.md +++ b/tidb-scheduling.md @@ -8,57 +8,55 @@ summary: Introduces the PD scheduling component in a TiDB cluster. PD works as the manager in a TiDB cluster, and it also schedules Regions in the cluster. This article introduces the design and core concepts of the PD scheduling component. -## Use scenarios +## Scheduling situations TiKV is the distributed K/V storage engine used by TiDB. In TiKV, data is -organized as Regions, which are replicated on serveral stores. In all replicas, +organized as Regions, which are replicated on serveral Stores. In all replicas, Leader is responsible for reading and writing, Followers are responsible for -replicating Raft logs from the leader. +replicating Raft logs from the Leader. -So there are some situations need to be considered: +Now consider about the following situations: -* Regions need to be distributed fine in the cluster to utilize storage space - high-efficient; -* For multiple datacenter topologies, one datacenter failure should only cause - one replica fail for all Regions; +* To utilize storage space in a high-efficient way, multiple Replicas of the same Region need to be properly distributed on different nodes according to the Region size; +* For multiple data center topologies, one data center failure only causes one replica to fail for all Regions; * When a new TiKV store is added, data can be rebalanced to it; -* When a TiKV store fails, PD needs to consider - * recovery time of the failure store. +* When a TiKV store fails, PD needs to consider: + * Recovery time of the failed store. * If it's short (e.g. the service is restarted), scheduling is necessary or not. - * If it's long (e.g. disk fault, data is lost), how to do scheduling. - * replicas of all regions. - * If replicas are not enouth for some regions, needs to complete them. + * If it's long (e.g. disk fault, data is lost, etc.), how to do scheduling. + * Replicas of all Regions. + * If replicas are not enough for some Regions, the cluster needs to complete them. * If replicas are more than expected (e.g. failed store re-joins into the - cluster after recovery), needs to delete them. -* Read/Write operations are performed on leaders, which shouldn't be distributed - only on invividual stores; -* Not all regions are hot, load of all TiKV stores needs to be balanced. -* When regions are in balancing, data transferring utilizes much network/disk + cluster after recovery), the cluster needs to delete them. +* Read/Write operations are performed on leaders, which should not be distributed + only on individual stores; +* Not all Regions are hot, so load of all TiKV stores needs to be balanced. +* When Regions are in balancing, data transferring utilizes much network/disk traffic and CPU time, which can influence online services. These situations can occur at the same time, which makes it harder to resolve. -And, the whole system is changing dynamically, so a single point is necessary -to collect all informations about the cluster, and then adjust the cluster. So, -PD is introduced into TiDB cluster. +Also, the whole system is changing dynamically, so a new component is necessary +to collect all information about the cluster, and then adjust the cluster. So, +PD is introduced into the TiDB cluster. -## Scheduling Requirements +## Scheduling requirements -The above situations can be classified into 2 classes: +The above situations can be classified into two types: -** The First: must be satisfied to reach high availability, includes ** +1. A distributed and highly available storage system must meet the following requirements: -* Count of replicas can't be more or less; -* Replicas needs to be distributed to different machines; -* The cluster can auto recovery from TiKV peers failure. + * The right number of replicas. + * Replicas should be distributed on different machines according to different topologies. + * The cluster can perform automatic disaster recovery from TiKV peers failure. -** The Second: need to be satisfied as a good distributed system, includes ** +2. A good distributed system needs to have the following optimizations: -* All Region leaders are balanced; -* Storeage size of all TiKV peers are balanced; -* Hot points are balanced; -* Speed of Region balance needs to be limited to ensure online services are stable; -* It's possible to online/offline peers manually. + * All Region leaders are balanced; + * Storage size of all TiKV peers are balanced; + * Hot spots are balanced; + * Speed of Region balance needs to be limited to ensure online services are stable; + * It's possible to make peers online/offline manually. After the first class requirements are satisfied, the system will be failure tolerable. After the second class requirements are satisfied, resources will be utilized more From 7a2b51519040655d409330e573ba9b978c00aba3 Mon Sep 17 00:00:00 2001 From: yikeke Date: Fri, 17 Jul 2020 13:56:43 +0800 Subject: [PATCH 5/8] Update tidb-scheduling.md --- tidb-scheduling.md | 126 ++++++++++++++++----------------------------- 1 file changed, 43 insertions(+), 83 deletions(-) diff --git a/tidb-scheduling.md b/tidb-scheduling.md index ce47f06491742..59896322e3d79 100644 --- a/tidb-scheduling.md +++ b/tidb-scheduling.md @@ -5,15 +5,11 @@ summary: Introduces the PD scheduling component in a TiDB cluster. # TiDB Scheduling -PD works as the manager in a TiDB cluster, and it also schedules Regions in the -cluster. This article introduces the design and core concepts of the PD scheduling component. +PD works as the manager in a TiDB cluster, and it also schedules Regions in the cluster. This article introduces the design and core concepts of the PD scheduling component. ## Scheduling situations -TiKV is the distributed K/V storage engine used by TiDB. In TiKV, data is -organized as Regions, which are replicated on serveral Stores. In all replicas, -Leader is responsible for reading and writing, Followers are responsible for -replicating Raft logs from the Leader. +TiKV is the distributed K/V storage engine used by TiDB. In TiKV, data is organized as Regions, which are replicated on several Stores. In all replicas, Leader is responsible for reading and writing, and Followers are responsible for replicating Raft logs from the Leader. Now consider about the following situations: @@ -22,23 +18,16 @@ Now consider about the following situations: * When a new TiKV store is added, data can be rebalanced to it; * When a TiKV store fails, PD needs to consider: * Recovery time of the failed store. - * If it's short (e.g. the service is restarted), scheduling is necessary - or not. + * If it's short (e.g. the service is restarted), scheduling is necessary or not. * If it's long (e.g. disk fault, data is lost, etc.), how to do scheduling. * Replicas of all Regions. * If replicas are not enough for some Regions, the cluster needs to complete them. - * If replicas are more than expected (e.g. failed store re-joins into the - cluster after recovery), the cluster needs to delete them. -* Read/Write operations are performed on leaders, which should not be distributed - only on individual stores; -* Not all Regions are hot, so load of all TiKV stores needs to be balanced. -* When Regions are in balancing, data transferring utilizes much network/disk - traffic and CPU time, which can influence online services. - -These situations can occur at the same time, which makes it harder to resolve. -Also, the whole system is changing dynamically, so a new component is necessary -to collect all information about the cluster, and then adjust the cluster. So, -PD is introduced into the TiDB cluster. + * If replicas are more than expected (e.g. failed store re-joins into the cluster after recovery), the cluster needs to delete them. +* Read/Write operations are performed on leaders, which should not be distributed only on individual stores; +* Not all Regions are hot, so load of all TiKV stores needs to be balanced; +* When Regions are in balancing, data transferring utilizes much network/disk traffic and CPU time, which can influence online services. + +These situations can occur at the same time, which makes it harder to resolve. Also, the whole system is changing dynamically, so a new component is necessary to collect all information about the cluster, and then adjust the cluster. So, PD is introduced into the TiDB cluster. ## Scheduling requirements @@ -58,63 +47,51 @@ The above situations can be classified into two types: * Speed of Region balance needs to be limited to ensure online services are stable; * It's possible to make peers online/offline manually. -After the first class requirements are satisfied, the system will be failure tolerable. -After the second class requirements are satisfied, resources will be utilized more -efficent and the system will become well expandable. +After the first type of requirements are satisfied, the system will be failure tolerable. After the second type of requirements are satisfied, resources will be utilized more efficiently and the system will have better scalability. -To achieve these targets, PD needs to collect informations firstly, such as state of peers, -informations about Raft groups and statistics of peers' accession. Then we can specify -some strategies on PD, so that PD can make sheculing plans from these information and -strategies. Finally, PD will distribute some operators to TiKVs to complete scheduling plans. +To achieve these targets, PD needs to collect information firstly, such as state of peers, information about Raft groups and peers' accessing statistics. Then we can specify some strategies on PD, so that PD can make scheduling plans from these information and strategies. Finally, PD distributes some operators to TiKVs to complete scheduling plans. -## Basic Schedule operators +## Basic scheduling operators -All scheduling plan contain 3 basic operators: +All scheduling plans contain three basic operators: * Add a new replica * Remove a replica -* Transfer a Region leader between replicas +* Transfer a Region leader between replicas in a Raft group -They are implemented by Raft command `AddReplica`, `RemoveReplica` and `TransferLeader`. +They are implemented by the Raft commands `AddReplica`, `RemoveReplica`, and `TransferLeader`. -## Information collecting +## Information collection -Scheduling is based on information collecting. In one word, scheduling needs to know -states of all TiKV peers and all Regions. TiKV peers report those information to PD. +Scheduling is based on information collection. In short, the PD scheduling component needs to know the states of all TiKV peers and all Regions. TiKV peers report these information to PD: -** Information reported by TiKV peers ** +- State information reported by each TiKV peer: -TiKV sends heartbeats to PD periodically. PD can not only check the store is active -nor not, but also collect [`StoreState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L421) -in the message. `StoreState` includes + TiKV sends heartbeats to PD periodically. PD not only checks whether the Store is alive, but also collects [`StoreState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L421) in the message. `StoreState` includes: -* Total disk space -* Available disk space -* Region count -* Data read/write speed -* Send/receive snapshot count -* It's overload or not -* Labels (See [Perception of Topology](/location-awareness.md)) + * Total disk space + * Available disk space + * Region count + * Data read/write speed + * Send/receive snapshot count + * It's overload or not + * Labels (See [Perception of Topology](/location-awareness.md)) -** Information reported by Region leaders ** +- Information reported by Region leaders: -Region leader send heartbeaets to PD periodically to report [`RegionState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L271), -includes + Region leader send heartbeaets to PD periodically to report [`RegionState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L271), + includes -* Position of the leader itself -* Positions of other replicas -* Offline replicas count -* data read/write speed + * Position of the leader itself + * Positions of other replicas + * Offline replicas count + * data read/write speed PD collects cluster information by these 2 type heartbeats and then makes dicision based on it. -Beside these, PD can get more information from expanded interface. For example, -if a store's heartbeats are broken, PD can't know the peer steps down temporarily or forever. -It just waits a while (by default 30min) and then treats the store become offline if there -are still no heartbeats received. Then PD balances all regions on the store to other stores. +Beside these, PD can get more information from expanded interface. For example, if a store's heartbeats are broken, PD can't know the peer steps down temporarily or forever. It just waits a while (by default 30min) and then treats the store become offline if there are still no heartbeats received. Then PD balances all regions on the store to other stores. -But sometimes stores are set offline by maintainers manually, so that we can tell PD this by -PD control interface. Then PD can balance all regions immediately. +But sometimes stores are set offline by maintainers manually, so that we can tell PD this by PD control interface. Then PD can balance all regions immediately. ## Scheduling stretagies @@ -122,9 +99,7 @@ PD needs some stretagies to make scheduling plans. ** Replicas count of Regions need to be correct ** -PD can know replica count of a Region is incorrect from Region leader's heartbeat. If it happens, -PD can adjust replica count by add/rmeove replica operation. The reason of incorrect replica counts -could be: +PD can know replica count of a Region is incorrect from Region leader's heartbeat. If it happens, PD can adjust replica count by add/remove replica operation. The reason of incorrect replica counts could be: * Store failure, so some Region's replica count will be less than expected; * Store recovery after failure, so some Region's replica count could be more than expected; @@ -132,19 +107,13 @@ could be: ** Replicas of a Region need to be at different positions ** -Please note that 'position' is different from 'machine'. Generally PD can only ensure that -replicas of a Region won't be at a same peer to avoid the peer's failure cause more than one -replicas become lost. However in production, these requirements are possible: +Please note that 'position' is different from 'machine'. Generally PD can only ensure that replicas of a Region won't be at a same peer to avoid the peer's failure cause more than one replicas become lost. However in production, these requirements are possible: * Multiple TiKV peers are on one machine; * TiKVs are on multiple racks, and the system is expected to be available even if a rack fails; * TiKVs are in multiple datacenters, and the system is expected to be available even if a datacenter fails; -The key of there requirements is that peers can have same 'position', which is the smallest unit -for failure-toleration. Replicas of a Region shouldn't be in one unit. So, we can configure -[labels](https://github.com/tikv/tikv/blob/v4.0.0-beta/etc/config-template.toml#L140) for TiKVs, -and set [location-labels](https://github.com/pingcap/pd/blob/v4.0.0-beta/conf/config.toml#L100) on PD -to specify which labels are used for marking positions. +The key of there requirements is that peers can have same 'position', which is the smallest unit for failure-toleration. Replicas of a Region shouldn't be in one unit. So, we can configure [labels](https://github.com/tikv/tikv/blob/v4.0.0-beta/etc/config-template.toml#L140) for TiKVs, and set [location-labels](https://github.com/pingcap/pd/blob/v4.0.0-beta/conf/config.toml#L100) on PD to specify which labels are used for marking positions. ** Replicas should be balanced between stores ** @@ -152,8 +121,7 @@ Size limit of a Region is fixed, so make Region count be balanced between store ** Leaders should be balanced between stores ** -Read and write operations are performed on leaders in Raft. So PD needs to distributed leader into whole cluster -instead of serveral peers. +Read and write operations are performed on leaders in Raft. So PD needs to distributed leader into whole cluster instead of serveral peers. ** Hot points should be balanced between stores ** @@ -161,22 +129,14 @@ PD can detect hot points from store heartbets and Region heartbeats. So PD can d ** Storage size needs to be balanced between stores ** -TiKV reports `capacity` of storage when it starts up, which indicates the store's space limit. PD will consider this -when doing schedule. +TiKV reports `capacity` of storage when it starts up, which indicates the store's space limit. PD will consider this when doing schedule. ** Adjust scheduling speed to stabilize online services ** -Scheduling utilizes CPU, memory, network and I/O traffic. Too much resource utilization will influence -online services. So PD needs to limit concurrent scheduling count. By default the strategy is conservative, -while it can be changed if quicker scheduling is required. +Scheduling utilizes CPU, memory, network and I/O traffic. Too much resource utilization will influence online services. So PD needs to limit concurrent scheduling count. By default the strategy is conservative, while it can be changed if quicker scheduling is required. ## Scheduling implementation -PD collects cluster information from store heartbeats and Region heartbeats, and then makes scheduling plan -from the information and stretagies. Scheduling plans are constructed by a sequence of basic operators. -Every time when PD receives a region heartbeat from a Region leader, it checks whether there is a pending -operator on the Region or not. If PD needs to dispatch a new operator to a Region, it puts the operator into -heartbeat responses, and monitors the operator by checking follow-up Region heartbeats. +PD collects cluster information from store heartbeats and Region heartbeats, and then makes scheduling plan from the information and stretagies. Scheduling plans are constructed by a sequence of basic operators. Every time when PD receives a region heartbeat from a Region leader, it checks whether there is a pending operator on the Region or not. If PD needs to dispatch a new operator to a Region, it puts the operator into heartbeat responses, and monitors the operator by checking follow-up Region heartbeats. -Note that operators are only suggestions, which could be skipeed by Regions. Leader of Regions can decide -whether to step a scheduling operator or not based on its current status. +Note that operators are only suggestions, which could be skipeed by Regions. Leader of Regions can decide whether to step a scheduling operator or not based on its current status. From 7d36496511a9d88c0fe0a5ea3b3d97510b25cb54 Mon Sep 17 00:00:00 2001 From: yikeke Date: Mon, 20 Jul 2020 18:12:14 +0800 Subject: [PATCH 6/8] Update tidb-scheduling.md --- tidb-scheduling.md | 61 +++++++++++++++++++++++----------------------- 1 file changed, 30 insertions(+), 31 deletions(-) diff --git a/tidb-scheduling.md b/tidb-scheduling.md index 59896322e3d79..0a139136d9d21 100644 --- a/tidb-scheduling.md +++ b/tidb-scheduling.md @@ -71,72 +71,71 @@ Scheduling is based on information collection. In short, the PD scheduling compo * Total disk space * Available disk space - * Region count + * The number of Regions * Data read/write speed - * Send/receive snapshot count - * It's overload or not + * The number of snapshots that are sent/received (The data might be replicated between replicas through snapshots) + * Whether the store is overloaded * Labels (See [Perception of Topology](/location-awareness.md)) - Information reported by Region leaders: - Region leader send heartbeaets to PD periodically to report [`RegionState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L271), - includes + Region leader sends heartbeats to PD periodically to report [`RegionState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L271), including: * Position of the leader itself * Positions of other replicas - * Offline replicas count + * The number of offline replicas * data read/write speed -PD collects cluster information by these 2 type heartbeats and then makes dicision based on it. +PD collects cluster information by these 2 types of heartbeats and then makes decision based on it. -Beside these, PD can get more information from expanded interface. For example, if a store's heartbeats are broken, PD can't know the peer steps down temporarily or forever. It just waits a while (by default 30min) and then treats the store become offline if there are still no heartbeats received. Then PD balances all regions on the store to other stores. +Besides, PD can get more information from an expanded interface to make a more precise decision. For example, if a store's heartbeats are broken, PD can't know whether the peer steps down temporarily or forever. It just waits a while (by default 30min) and then treats the store as offline if there are still no heartbeats received. Then PD balances all regions on the store to other stores. -But sometimes stores are set offline by maintainers manually, so that we can tell PD this by PD control interface. Then PD can balance all regions immediately. +But sometimes stores are manually set offline by a maintainer, so that the maintainer can tell PD this by PD control interface. Then PD can balance all regions immediately. -## Scheduling stretagies +## Scheduling strategies -PD needs some stretagies to make scheduling plans. +After collecting the information, PD needs some strategies to make scheduling plans. -** Replicas count of Regions need to be correct ** +**Strategy 1: The number of replicas of a Region needs to be correct** -PD can know replica count of a Region is incorrect from Region leader's heartbeat. If it happens, PD can adjust replica count by add/remove replica operation. The reason of incorrect replica counts could be: +PD can know that the replica count of a Region is incorrect from Region leader's heartbeat. If it happens, PD can adjust the replica count by adding/removing replica(s). The reason for incorrect replica count could be: -* Store failure, so some Region's replica count will be less than expected; +* Store failure, so some Region's replica count is less than expected; * Store recovery after failure, so some Region's replica count could be more than expected; * [`max-replicas`](https://github.com/pingcap/pd/blob/v4.0.0-beta/conf/config.toml#L95) is changed. -** Replicas of a Region need to be at different positions ** +**Strategy 2: Replicas of a Region need to be at different positions** -Please note that 'position' is different from 'machine'. Generally PD can only ensure that replicas of a Region won't be at a same peer to avoid the peer's failure cause more than one replicas become lost. However in production, these requirements are possible: +Note that here "position" is different from "machine". Generally PD can only ensure that replicas of a Region are not at a same peer to avoid that the peer's failure causes more than one replicas to become lost. However in production, you might have the following requirements: * Multiple TiKV peers are on one machine; -* TiKVs are on multiple racks, and the system is expected to be available even if a rack fails; -* TiKVs are in multiple datacenters, and the system is expected to be available even if a datacenter fails; +* TiKV peers are on multiple racks, and the system is expected to be available even if a rack fails; +* TiKV peers are in multiple data centers, and the system is expected to be available even if a data center fails; -The key of there requirements is that peers can have same 'position', which is the smallest unit for failure-toleration. Replicas of a Region shouldn't be in one unit. So, we can configure [labels](https://github.com/tikv/tikv/blob/v4.0.0-beta/etc/config-template.toml#L140) for TiKVs, and set [location-labels](https://github.com/pingcap/pd/blob/v4.0.0-beta/conf/config.toml#L100) on PD to specify which labels are used for marking positions. +The key to these requirements is that peers can have the same "position", which is the smallest unit for failure toleration. Replicas of a Region must not be in one unit. So, we can configure [labels](https://github.com/tikv/tikv/blob/v4.0.0-beta/etc/config-template.toml#L140) for the TiKV peers, and set [location-labels](https://github.com/pingcap/pd/blob/v4.0.0-beta/conf/config.toml#L100) on PD to specify which labels are used for marking positions. -** Replicas should be balanced between stores ** +**Strategy 3: Replicas need to be balanced between stores** -Size limit of a Region is fixed, so make Region count be balanced between store is helpful for data size balance. +The size limit of a Region replica is fixed, so keeping the replicas balanced between stores is helpful for data size balance. -** Leaders should be balanced between stores ** +**Strategy 4: Leaders need to be balanced between stores** -Read and write operations are performed on leaders in Raft. So PD needs to distributed leader into whole cluster instead of serveral peers. +Read and write operations are performed on leaders according to the Raft protocol, so that PD needs to distribute leaders into the whole cluster instead of several peers. -** Hot points should be balanced between stores ** +**Strategy 5: Hot spots need to be balanced between stores** -PD can detect hot points from store heartbets and Region heartbeats. So PD can disturb hot points. +PD can detect hot spots from store heartbeats and Region heartbeats, so that PD can distribute hot spots. -** Storage size needs to be balanced between stores ** +**Strategy 6: Storage size needs to be balanced between stores** -TiKV reports `capacity` of storage when it starts up, which indicates the store's space limit. PD will consider this when doing schedule. +When started up, a TiKV store reports `capacity` of storage, which indicates the store's space limit. PD will consider this when scheduling. -** Adjust scheduling speed to stabilize online services ** +**Strategy 7: Adjust scheduling speed to stabilize online services** -Scheduling utilizes CPU, memory, network and I/O traffic. Too much resource utilization will influence online services. So PD needs to limit concurrent scheduling count. By default the strategy is conservative, while it can be changed if quicker scheduling is required. +Scheduling utilizes CPU, memory, network and I/O traffic. Too much resource utilization will influence online services. Therefore, PD needs to limit the number of the concurrent scheduling tasks. By default this strategy is conservative, while it can be changed if quicker scheduling is required. ## Scheduling implementation -PD collects cluster information from store heartbeats and Region heartbeats, and then makes scheduling plan from the information and stretagies. Scheduling plans are constructed by a sequence of basic operators. Every time when PD receives a region heartbeat from a Region leader, it checks whether there is a pending operator on the Region or not. If PD needs to dispatch a new operator to a Region, it puts the operator into heartbeat responses, and monitors the operator by checking follow-up Region heartbeats. +PD collects cluster information from store heartbeats and Region heartbeats, and then makes scheduling plans from the information and strategies. Scheduling plans are a sequence of basic operators. Every time PD receives a Region heartbeat from a Region leader, it checks whether there is a pending operator on the Region or not. If PD needs to dispatch a new operator to a Region, it puts the operator into heartbeat responses, and monitors the operator by checking follow-up Region heartbeats. -Note that operators are only suggestions, which could be skipeed by Regions. Leader of Regions can decide whether to step a scheduling operator or not based on its current status. +Note that here "operators" are only suggestions to the Region leader, which can be skipped by Regions. Leader of Regions can decide whether to skip a scheduling operator or not based on its current status. From 0a3f2a878a9021b27b88372e242cbb77b7b70697 Mon Sep 17 00:00:00 2001 From: yikeke Date: Mon, 20 Jul 2020 18:23:08 +0800 Subject: [PATCH 7/8] Update tidb-scheduling.md --- tidb-scheduling.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/tidb-scheduling.md b/tidb-scheduling.md index 0a139136d9d21..74cb7e0aa2f93 100644 --- a/tidb-scheduling.md +++ b/tidb-scheduling.md @@ -9,7 +9,7 @@ PD works as the manager in a TiDB cluster, and it also schedules Regions in the ## Scheduling situations -TiKV is the distributed K/V storage engine used by TiDB. In TiKV, data is organized as Regions, which are replicated on several Stores. In all replicas, Leader is responsible for reading and writing, and Followers are responsible for replicating Raft logs from the Leader. +TiKV is the distributed key-value storage engine used by TiDB. In TiKV, data is organized as Regions, which are replicated on several stores. In all replicas, a leader is responsible for reading and writing, and followers are responsible for replicating Raft logs from the leader. Now consider about the following situations: @@ -18,16 +18,16 @@ Now consider about the following situations: * When a new TiKV store is added, data can be rebalanced to it; * When a TiKV store fails, PD needs to consider: * Recovery time of the failed store. - * If it's short (e.g. the service is restarted), scheduling is necessary or not. + * If it's short (e.g. the service is restarted), whether scheduling is necessary or not. * If it's long (e.g. disk fault, data is lost, etc.), how to do scheduling. * Replicas of all Regions. - * If replicas are not enough for some Regions, the cluster needs to complete them. - * If replicas are more than expected (e.g. failed store re-joins into the cluster after recovery), the cluster needs to delete them. -* Read/Write operations are performed on leaders, which should not be distributed only on individual stores; -* Not all Regions are hot, so load of all TiKV stores needs to be balanced; + * If the number of replicas is not enough for some Regions, PD needs to complete them. + * If the number of replicas is more than expected (e.g. the failed store re-joins into the cluster after recovery), PD needs to delete them. +* Read/Write operations are performed on leaders, which can not be distributed only on a few individual stores; +* Not all Regions are hot, so loads of all TiKV stores need to be balanced; * When Regions are in balancing, data transferring utilizes much network/disk traffic and CPU time, which can influence online services. -These situations can occur at the same time, which makes it harder to resolve. Also, the whole system is changing dynamically, so a new component is necessary to collect all information about the cluster, and then adjust the cluster. So, PD is introduced into the TiDB cluster. +These situations can occur at the same time, which makes it harder to resolve. Also, the whole system is changing dynamically, so a scheduler is needed to collect all information about the cluster, and then adjust the cluster. So, PD is introduced into the TiDB cluster. ## Scheduling requirements @@ -67,7 +67,7 @@ Scheduling is based on information collection. In short, the PD scheduling compo - State information reported by each TiKV peer: - TiKV sends heartbeats to PD periodically. PD not only checks whether the Store is alive, but also collects [`StoreState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L421) in the message. `StoreState` includes: + TiKV sends heartbeats to PD periodically. PD not only checks whether the store is alive, but also collects [`StoreState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L421) in the message. `StoreState` includes: * Total disk space * Available disk space From 83d29a815c462d29f13937c85cbe31d2307fe186 Mon Sep 17 00:00:00 2001 From: yikeke Date: Mon, 20 Jul 2020 18:32:50 +0800 Subject: [PATCH 8/8] Update tidb-scheduling.md --- tidb-scheduling.md | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/tidb-scheduling.md b/tidb-scheduling.md index 74cb7e0aa2f93..dd0d645dbfb3e 100644 --- a/tidb-scheduling.md +++ b/tidb-scheduling.md @@ -36,20 +36,20 @@ The above situations can be classified into two types: 1. A distributed and highly available storage system must meet the following requirements: * The right number of replicas. - * Replicas should be distributed on different machines according to different topologies. - * The cluster can perform automatic disaster recovery from TiKV peers failure. + * Replicas need to be distributed on different machines according to different topologies. + * The cluster can perform automatic disaster recovery from TiKV peers' failure. 2. A good distributed system needs to have the following optimizations: - * All Region leaders are balanced; - * Storage size of all TiKV peers are balanced; + * All Region leaders are distributed evenly on stores; + * Storage capacity of all TiKV peers are balanced; * Hot spots are balanced; - * Speed of Region balance needs to be limited to ensure online services are stable; - * It's possible to make peers online/offline manually. + * Speed of load balancing for the Regions needs to be limited to ensure that online services are stable; + * Maintainers are able to to take peers online/offline manually. -After the first type of requirements are satisfied, the system will be failure tolerable. After the second type of requirements are satisfied, resources will be utilized more efficiently and the system will have better scalability. +After the first type of requirements is satisfied, the system will be failure tolerable. After the second type of requirements is satisfied, resources will be utilized more efficiently and the system will have better scalability. -To achieve these targets, PD needs to collect information firstly, such as state of peers, information about Raft groups and peers' accessing statistics. Then we can specify some strategies on PD, so that PD can make scheduling plans from these information and strategies. Finally, PD distributes some operators to TiKVs to complete scheduling plans. +To achieve the goals, PD needs to collect information firstly, such as state of peers, information about Raft groups and the statistics of accessing the peers. Then we need to specify some strategies for PD, so that PD can make scheduling plans from these information and strategies. Finally, PD distributes some operators to TiKV peers to complete scheduling plans. ## Basic scheduling operators @@ -63,11 +63,11 @@ They are implemented by the Raft commands `AddReplica`, `RemoveReplica`, and `Tr ## Information collection -Scheduling is based on information collection. In short, the PD scheduling component needs to know the states of all TiKV peers and all Regions. TiKV peers report these information to PD: +Scheduling is based on information collection. In short, the PD scheduling component needs to know the states of all TiKV peers and all Regions. TiKV peers report the following information to PD: - State information reported by each TiKV peer: - TiKV sends heartbeats to PD periodically. PD not only checks whether the store is alive, but also collects [`StoreState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L421) in the message. `StoreState` includes: + Each TiKV peer sends heartbeats to PD periodically. PD not only checks whether the store is alive, but also collects [`StoreState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L421) in the heartbeat message. `StoreState` includes: * Total disk space * Available disk space @@ -79,18 +79,18 @@ Scheduling is based on information collection. In short, the PD scheduling compo - Information reported by Region leaders: - Region leader sends heartbeats to PD periodically to report [`RegionState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L271), including: + Each Region leader sends heartbeats to PD periodically to report [`RegionState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L271), including: * Position of the leader itself * Positions of other replicas * The number of offline replicas * data read/write speed -PD collects cluster information by these 2 types of heartbeats and then makes decision based on it. +PD collects cluster information by the two types of heartbeats and then makes decision based on it. Besides, PD can get more information from an expanded interface to make a more precise decision. For example, if a store's heartbeats are broken, PD can't know whether the peer steps down temporarily or forever. It just waits a while (by default 30min) and then treats the store as offline if there are still no heartbeats received. Then PD balances all regions on the store to other stores. -But sometimes stores are manually set offline by a maintainer, so that the maintainer can tell PD this by PD control interface. Then PD can balance all regions immediately. +But sometimes stores are manually set offline by a maintainer, so the maintainer can tell PD this by the PD control interface. Then PD can balance all regions immediately. ## Scheduling strategies @@ -98,7 +98,7 @@ After collecting the information, PD needs some strategies to make scheduling pl **Strategy 1: The number of replicas of a Region needs to be correct** -PD can know that the replica count of a Region is incorrect from Region leader's heartbeat. If it happens, PD can adjust the replica count by adding/removing replica(s). The reason for incorrect replica count could be: +PD can know that the replica count of a Region is incorrect from the Region leader's heartbeat. If it happens, PD can adjust the replica count by adding/removing replica(s). The reason for incorrect replica count can be: * Store failure, so some Region's replica count is less than expected; * Store recovery after failure, so some Region's replica count could be more than expected;