From 0ad389a09e91886a5ea5da91bd42ce6bb246df59 Mon Sep 17 00:00:00 2001 From: ireneontheway <48651140+ireneontheway@users.noreply.github.com> Date: Thu, 17 Sep 2020 13:10:05 +0800 Subject: [PATCH 1/2] cherry pick #3825 to release-2.1 Signed-off-by: ti-srebot --- TOC.md | 140 ++++++++ configure-placement-rules.md | 436 ++++++++++++++++++++++++ pd-configuration-file.md | 275 +++++++++++++++ schedule-replicas-by-topology-labels.md | 194 +++++++++++ tidb-scheduling.md | 141 ++++++++ 5 files changed, 1186 insertions(+) create mode 100644 configure-placement-rules.md create mode 100644 schedule-replicas-by-topology-labels.md create mode 100644 tidb-scheduling.md diff --git a/TOC.md b/TOC.md index 03bd128ed1420..2bdc6a3731709 100644 --- a/TOC.md +++ b/TOC.md @@ -137,6 +137,7 @@ - [`SET`](/data-type-string.md#set-type) - [JSON Type](/data-type-json.md) + Functions and Operators +<<<<<<< HEAD - [Function and Operator Reference](/functions-and-operators/functions-and-operators-overview.md) - [Type Conversion in Expression Evaluation](/functions-and-operators/type-conversion-in-expression-evaluation.md) - [Operators](/functions-and-operators/operators.md) @@ -320,6 +321,145 @@ - [PD Recover](/pd-recover.md) - [TiKV Control](/tikv-control.md) - [TiDB Control](/tidb-control.md) +======= + + [Overview](/functions-and-operators/functions-and-operators-overview.md) + + [Type Conversion in Expression Evaluation](/functions-and-operators/type-conversion-in-expression-evaluation.md) + + [Operators](/functions-and-operators/operators.md) + + [Control Flow Functions](/functions-and-operators/control-flow-functions.md) + + [String Functions](/functions-and-operators/string-functions.md) + + [Numeric Functions and Operators](/functions-and-operators/numeric-functions-and-operators.md) + + [Date and Time Functions](/functions-and-operators/date-and-time-functions.md) + + [Bit Functions and Operators](/functions-and-operators/bit-functions-and-operators.md) + + [Cast Functions and Operators](/functions-and-operators/cast-functions-and-operators.md) + + [Encryption and Compression Functions](/functions-and-operators/encryption-and-compression-functions.md) + + [Information Functions](/functions-and-operators/information-functions.md) + + [JSON Functions](/functions-and-operators/json-functions.md) + + [Aggregate (GROUP BY) Functions](/functions-and-operators/aggregate-group-by-functions.md) + + [Window Functions](/functions-and-operators/window-functions.md) + + [Miscellaneous Functions](/functions-and-operators/miscellaneous-functions.md) + + [Precision Math](/functions-and-operators/precision-math.md) + + [List of Expressions for Pushdown](/functions-and-operators/expressions-pushed-down.md) + + [Constraints](/constraints.md) + + [Generated Columns](/generated-columns.md) + + [SQL Mode](/sql-mode.md) + + Transactions + + [Overview](/transaction-overview.md) + + [Isolation Levels](/transaction-isolation-levels.md) + + [Optimistic Transactions](/optimistic-transaction.md) + + [Pessimistic Transactions](/pessimistic-transaction.md) + + Garbage Collection (GC) + + [Overview](/garbage-collection-overview.md) + + [Configuration](/garbage-collection-configuration.md) + + [Views](/views.md) + + [Partitioning](/partitioned-table.md) + + [Character Set and Collation](/character-set-and-collation.md) + + System Tables + + [`mysql`](/mysql-schema.md) + + INFORMATION_SCHEMA + + [Overview](/information-schema/information-schema.md) + + [`ANALYZE_STATUS`](/information-schema/information-schema-analyze-status.md) + + [`CHARACTER_SETS`](/information-schema/information-schema-character-sets.md) + + [`CLUSTER_CONFIG`](/information-schema/information-schema-cluster-config.md) + + [`CLUSTER_HARDWARE`](/information-schema/information-schema-cluster-hardware.md) + + [`CLUSTER_INFO`](/information-schema/information-schema-cluster-info.md) + + [`CLUSTER_LOAD`](/information-schema/information-schema-cluster-load.md) + + [`CLUSTER_LOG`](/information-schema/information-schema-cluster-log.md) + + [`CLUSTER_SYSTEMINFO`](/information-schema/information-schema-cluster-systeminfo.md) + + [`COLLATIONS`](/information-schema/information-schema-collations.md) + + [`COLLATION_CHARACTER_SET_APPLICABILITY`](/information-schema/information-schema-collation-character-set-applicability.md) + + [`COLUMNS`](/information-schema/information-schema-columns.md) + + [`DDL_JOBS`](/information-schema/information-schema-ddl-jobs.md) + + [`ENGINES`](/information-schema/information-schema-engines.md) + + [`INSPECTION_RESULT`](/information-schema/information-schema-inspection-result.md) + + [`INSPECTION_RULES`](/information-schema/information-schema-inspection-rules.md) + + [`INSPECTION_SUMMARY`](/information-schema/information-schema-inspection-summary.md) + + [`KEY_COLUMN_USAGE`](/information-schema/information-schema-key-column-usage.md) + + [`METRICS_SUMMARY`](/information-schema/information-schema-metrics-summary.md) + + [`METRICS_TABLES`](/information-schema/information-schema-metrics-tables.md) + + [`PARTITIONS`](/information-schema/information-schema-partitions.md) + + [`PROCESSLIST`](/information-schema/information-schema-processlist.md) + + [`SCHEMATA`](/information-schema/information-schema-schemata.md) + + [`SEQUENCES`](/information-schema/information-schema-sequences.md) + + [`SESSION_VARIABLES`](/information-schema/information-schema-session-variables.md) + + [`SLOW_QUERY`](/information-schema/information-schema-slow-query.md) + + [`STATISTICS`](/information-schema/information-schema-statistics.md) + + [`TABLES`](/information-schema/information-schema-tables.md) + + [`TABLE_CONSTRAINTS`](/information-schema/information-schema-table-constraints.md) + + [`TABLE_STORAGE_STATS`](/information-schema/information-schema-table-storage-stats.md) + + [`TIDB_HOT_REGIONS`](/information-schema/information-schema-tidb-hot-regions.md) + + [`TIDB_INDEXES`](/information-schema/information-schema-tidb-indexes.md) + + [`TIDB_SERVERS_INFO`](/information-schema/information-schema-tidb-servers-info.md) + + [`TIFLASH_REPLICA`](/information-schema/information-schema-tiflash-replica.md) + + [`TIKV_REGION_PEERS`](/information-schema/information-schema-tikv-region-peers.md) + + [`TIKV_REGION_STATUS`](/information-schema/information-schema-tikv-region-status.md) + + [`TIKV_STORE_STATUS`](/information-schema/information-schema-tikv-store-status.md) + + [`USER_PRIVILEGES`](/information-schema/information-schema-user-privileges.md) + + [`VIEWS`](/information-schema/information-schema-views.md) + + [`METRICS_SCHEMA`](/metrics-schema.md) + + UI + + TiDB Dashboard + + [Overview](/dashboard/dashboard-intro.md) + + Maintain + + [Deploy](/dashboard/dashboard-ops-deploy.md) + + [Reverse Proxy](/dashboard/dashboard-ops-reverse-proxy.md) + + [Secure](/dashboard/dashboard-ops-security.md) + + [Access](/dashboard/dashboard-access.md) + + [Overview Page](/dashboard/dashboard-overview.md) + + [Cluster Info Page](/dashboard/dashboard-cluster-info.md) + + [Key Visualizer Page](/dashboard/dashboard-key-visualizer.md) + + SQL Statements Analysis + + [SQL Statements Page](/dashboard/dashboard-statement-list.md) + + [SQL Details Page](/dashboard/dashboard-statement-details.md) + + [Slow Queries Page](/dashboard/dashboard-slow-query.md) + + Cluster Diagnostics + + [Access Cluster Diagnostics Page](/dashboard/dashboard-diagnostics-access.md) + + [View Diagnostics Report](/dashboard/dashboard-diagnostics-report.md) + + [Use Diagnostics](/dashboard/dashboard-diagnostics-usage.md) + + [Search Logs Page](/dashboard/dashboard-log-search.md) + + [Profile Instances Page](/dashboard/dashboard-profiling.md) + + [FAQ](/dashboard/dashboard-faq.md) + + CLI + + [tikv-ctl](/tikv-control.md) + + [pd-ctl](/pd-control.md) + + [tidb-ctl](/tidb-control.md) + + [pd-recover](/pd-recover.md) + + Command Line Flags + + [tidb-server](/command-line-flags-for-tidb-configuration.md) + + [tikv-server](/command-line-flags-for-tikv-configuration.md) + + [tiflash-server](/tiflash/tiflash-command-line-flags.md) + + [pd-server](/command-line-flags-for-pd-configuration.md) + + Configuration File Parameters + + [tidb-server](/tidb-configuration-file.md) + + [tikv-server](/tikv-configuration-file.md) + + [tiflash-server](/tiflash/tiflash-configuration.md) + + [pd-server](/pd-configuration-file.md) + + [System Variables](/system-variables.md) + + Storage Engines + + TiKV + + [TiKV Overview](/tikv-overview.md) + + [RocksDB Overview](/storage-engine/rocksdb-overview.md) + + TiFlash + + [Overview](/tiflash/tiflash-overview.md) + + [Use TiFlash](/tiflash/use-tiflash.md) + + TiUP + + [Documentation Guide](/tiup/tiup-documentation-guide.md) + + [Overview](/tiup/tiup-overview.md) + + [Terminology and Concepts](/tiup/tiup-terminology-and-concepts.md) + + [Manage TiUP Components](/tiup/tiup-component-management.md) + + [FAQ](/tiup/tiup-faq.md) + + [Troubleshooting Guide](/tiup/tiup-troubleshooting-guide.md) + + TiUP Components + + [tiup-playground](/tiup/tiup-playground.md) + + [tiup-cluster](/tiup/tiup-cluster.md) + + [tiup-mirror](/tiup/tiup-mirror.md) + + [tiup-bench](/tiup/tiup-bench.md) + + [Telemetry](/telemetry.md) + + [Errors Codes](/error-codes.md) + + [TiCDC Overview](/ticdc/ticdc-overview.md) + + [TiCDC Open Protocol](/ticdc/ticdc-open-protocol.md) + + [Table Filter](/table-filter.md) + + [Schedule Replicas by Topology Labels](/schedule-replicas-by-topology-labels.md) +>>>>>>> 24f31fb... improve location-awareness (#3825) + FAQs - [TiDB FAQs](/faq/tidb-faq.md) - [TiDB Lightning FAQs](/tidb-lightning/tidb-lightning-faq.md) diff --git a/configure-placement-rules.md b/configure-placement-rules.md new file mode 100644 index 0000000000000..ddaa9608bbbe7 --- /dev/null +++ b/configure-placement-rules.md @@ -0,0 +1,436 @@ +--- +title: Placement Rules +summary: Learn how to configure Placement Rules. +aliases: ['/docs/dev/configure-placement-rules/','/docs/dev/how-to/configure/placement-rules/'] +--- + +# Placement Rules + +> **Warning:** +> +> In the scenario of using TiFlash, the Placement Rules feature has been extensively tested and can be used in the production environment. Except for the scenario where TiFlash is used, using Placement Rules alone has not been extensively tested, so it is not recommended to enable this feature separately in the production environment. + +Placement Rules is an experimental feature of the Placement Driver (PD) introduced in v4.0. It is a replica rule system that guides PD to generate corresponding schedules for different types of data. By combining different scheduling rules, you can finely control the attributes of any continuous data range, such as the number of replicas, the storage location, the host type, whether to participate in Raft election, and whether to act as the Raft leader. + +## Rule system + +The configuration of the whole rule system consists of multiple rules. Each rule can specify attributes such as the number of replicas, the Raft role, the placement location, and the key range in which this rule takes effect. When PD is performing schedule, it first finds the rule corresponding to the Region in the rule system according to the key range of the Region, and then generates the corresponding schedule to make the distribution of the Region replica comply with the rule. + +The key ranges of multiple rules can have overlapping parts, which means that a Region can match multiple rules. In this case, PD decides whether the rules overwrite each other or take effect at the same time according to the attributes of rules. If multiple rules take effect at the same time, PD will generate schedules in sequence according to the stacking order of the rules for rule matching. + +In addition, to meet the requirement that rules from different sources are isolated from each other, these rules can be organized in a more flexible way. Therefore, the concept of "Group" is introduced. Generally, users can place rules in different groups according to different sources. + +![Placement rules overview](/media/placement-rules-1.png) + +### Rule fields + +The following table shows the meaning of each field in a rule: + +| Field name | Type and restriction | Description | +| :--- | :--- | :--- | +| `GroupID` | `string` | The group ID that marks the source of the rule. | +| `ID` | `string` | The unique ID of a rule in a group. | +| `Index` | `int` | The stacking sequence of rules in a group. | +| `Override` | `true`/`false` | Whether to overwrite rules with smaller index (in a group). | +| `StartKey` | `string`, in hexadecimal form | Applies to the starting key of a range. | +| `EndKey` | `string`, in hexadecimal form | Applies to the ending key of a range. | +| `Role` | `string` | Replica roles, including leader/follower/learner. | +| `Count` | `int`, positive integer | The number of replicas. | +| `LabelConstraint` | `[]Constraint` | Filers nodes based on the label. | +| `LocationLabels` | `[]string` | Used for physical isolation. | +| `IsolationLevel` | `string` | Used to set the minimum physical isolation level + +`LabelConstraint` is similar to the function in Kubernetes that filters labels based on these four primitives: `in`, `notIn`, `exists`, and `notExists`. The meanings of these four primitives are as follows: + ++ `in`: the label value of the given key is included in the given list. ++ `notIn`: the label value of the given key is not included in the given list. ++ `exists`: includes the given label key. ++ `notExists`: does not include the given label key. + +The meaning and function of `LocationLabels` are the same with those earlier than v4.0. For example, if you have deployed `[zone,rack,host]` that defines a three-layer topology: the cluster has multiple zones (Availability Zones), each zone has multiple racks, and each rack has multiple hosts. When performing schedule, PD first tries to place the Region's peers in different zones. If this try fails (such as there are three replicas but only two zones in total), PD guarantees to place these replicas in different racks. If the number of racks is not enough to guarantee isolation, then PD tries the host-level isolation. + +The meaning and function of `IsolationLevel` is elaborated in [Cluster topology configuration](/schedule-replicas-by-topology-labels.md). For example, if you have deployed `[zone,rack,host]` that defines a three-layer topology with `LocationLabels` and set `IsolationLevel` to `zone`, then PD ensures that all peers of each Region are placed in different zones during the scheduling. If the minimum isolation level restriction on `IsolationLevel` cannot be met (for example, 3 replicas are configured but there are only 2 data zones in total), PD will not try to make up to meet this restriction. The default value of `IsolationLevel` is an empty string, which means that it is disabled. + +### Fields of the rule group + +The following table shows the description of each field in a rule group: + +| Field name | Type and restriction | Description | +| :--- | :--- | :--- | +| `ID` | `string` | The group ID that marks the source of the rule. | +| `Index` | `int` | The stacking sequence of different groups. | +| `Override` | `true`/`false` | Whether to override groups with smaller indexes. | + +## Configure rules + +The operations in this section are based on [pd-ctl](/pd-control.md), and the commands involved in the operations also support calls via HTTP API. + +### Enable Placement Rules + +By default, the Placement Rules feature is disabled. To enable this feature, you can modify the PD configuration file as follows before initializing the cluster: + +{{< copyable "" >}} + +```toml +[replication] +enable-placement-rules = true +``` + +In this way, PD enables this feature after the cluster is successfully bootstrapped and generates corresponding rules according to the `max-replicas` and `location-labels` configurations: + +{{< copyable "" >}} + +```json +{ + "group_id": "pd", + "id": "default", + "start_key": "", + "end_key": "", + "role": "voter", + "count": 3, + "location_labels": ["zone", "rack", "host"], + "isolation_level": "" +} +``` + +For a bootstrapped cluster, you can also enable Placement Rules online through pd-ctl: + +{{< copyable "shell-regular" >}} + +```bash +pd-ctl config placement-rules enable +``` + +PD also generates default rules based on the `max-replicas` and `location-labels` configurations. + +> **Note:** +> +> After enabling Placement Rules, the previously configured `max-replicas` and `location-labels` no longer take effect. To adjust the replica policy, use the interface related to Placement Rules. + +### Disable Placement Rules + +You can use pd-ctl to disable the Placement Rules feature and switch to the previous scheduling strategy. + +{{< copyable "shell-regular" >}} + +```bash +pd-ctl config placement-rules disable +``` + +> **Note:** +> +> After disabling Placement Rules, PD uses the original `max-replicas` and `location-labels` configurations. The modification of rules (when Placement Rules is enabled) will not update these two configurations in real time. In addition, all the rules that have been configured remain in PD and will be used the next time you enable Placement Rules. + +### Set rules using pd-ctl + +> **Note:** +> +> The change of rules affects the PD scheduling in real time. Improper rule setting might result in fewer replicas and affect the high availability of the system. + +pd-ctl supports using the following methods to view rules in the system, and the output is a JSON-format rule or a rule list. + +- To view the list of all rules: + + {{< copyable "shell-regular" >}} + + ```bash + pd-ctl config placement-rules show + ``` + +- To view the list of all rules in a PD Group: + + {{< copyable "shell-regular" >}} + + ```bash + pd-ctl config placement-rules show --group=pd + ``` + +- To view the rule of a specific ID in a Group: + + {{< copyable "shell-regular" >}} + + ```bash + pd-ctl config placement-rules show --group=pd --id=default + ``` + +- To view the rule list that matches a Region: + + {{< copyable "shell-regular" >}} + + ```bash + pd-ctl config placement-rules show --region=2 + ``` + + In the above example, `2` is the Region ID. + +Adding rules and editing rules are similar. You need to write the corresponding rules into a file and then use the `save` command to save the rules to PD: + +{{< copyable "shell-regular" >}} + +```bash +cat > rules.json <}} + +```bash +cat > rules.json <}} + +```bash +pd-ctl config placement-rules load +``` + +Executing the above command saves all rules to the `rules.json` file. + +{{< copyable "shell-regular" >}} + +```bash +pd-ctl config placement-rules load --group=pd --out=rule.txt +``` + +The above command saves the rules of a PD Group to the `rules.txt` file. + +### Use pd-ctl to configure rule groups + +- To view the list of all rule groups: + + {{< copyable "shell-regular" >}} + + ```bash + pd-ctl config placement-rules rule-group show + ``` + +- To view the rule group of a specific ID: + + {{< copyable "shell-regular" >}} + + ```bash + pd-ctl config placement-rules rule-group show pd + ``` + +- To set the `index` and `override` attributes of the rule group: + + {{< copyable "shell-regular" >}} + + ```bash + pd-ctl config placement-rules rule-group set pd 100 true + ``` + +- To delete the configuration of a rule group (use the default group configuration if there is any rule in the group): + + {{< copyable "shell-regular" >}} + + ```bash + pd-ctl config placement-rules rule-group delete pd + ``` + +### Use tidb-ctl to query the table-related key range + +If you need special configuration for metadata or a specific table, you can execute the [`keyrange` command](https://github.com/pingcap/tidb-ctl/blob/master/doc/tidb-ctl_keyrange.md) in [tidb-ctl](https://github.com/pingcap/tidb-ctl) to query related keys. Remember to add `--encode` at the end of the command. + +{{< copyable "shell-regular" >}} + +```bash +tidb-ctl keyrange --database test --table ttt --encode +``` + +```text +global ranges: + meta: (6d00000000000000f8, 6e00000000000000f8) + table: (7400000000000000f8, 7500000000000000f8) +table ttt ranges: (NOTE: key range might be changed after DDL) + table: (7480000000000000ff2d00000000000000f8, 7480000000000000ff2e00000000000000f8) + table indexes: (7480000000000000ff2d5f690000000000fa, 7480000000000000ff2d5f720000000000fa) + index c2: (7480000000000000ff2d5f698000000000ff0000010000000000fa, 7480000000000000ff2d5f698000000000ff0000020000000000fa) + index c3: (7480000000000000ff2d5f698000000000ff0000020000000000fa, 7480000000000000ff2d5f698000000000ff0000030000000000fa) + index c4: (7480000000000000ff2d5f698000000000ff0000030000000000fa, 7480000000000000ff2d5f698000000000ff0000040000000000fa) + table rows: (7480000000000000ff2d5f720000000000fa, 7480000000000000ff2e00000000000000f8) +``` + +> **Note:** +> +> DDL and other operations can cause table ID changes, so you need to update the corresponding rules at the same time. + +## Typical usage scenarios + +This section introduces the typical usage scenarios of Placement Rules. + +### Scenario 1: Use three replicas for normal tables and five replicas for the metadata to improve cluster disaster tolerance + +You only need to add a rule that limits the key range to the range of metadata, and set the value of `count` to `5`. Here is an example of this rule: + +{{< copyable "" >}} + +```json +{ + "group_id": "pd", + "id": "meta", + "index": 1, + "override": true, + "start_key": "6d00000000000000f8", + "end_key": "6e00000000000000f8", + "role": "voter", + "count": "5", + "location_labels": ["zone", "rack", "host"] +} +``` + +### Scenario 2: Place five replicas in three data centers in the proportion of 2:2:1, and the Leader should not be in the third data center + +Create three rules. Set the number of replicas to `2`, `2`, and `1` respectively. Limit the replicas to the corresponding data centers through `label_constraints` in each rule. In addition, change `role` to `follower` for the data center that does not need a Leader. + +{{< copyable "" >}} + +```json +[ + { + "group_id": "pd", + "id": "zone1", + "start_key": "", + "end_key": "", + "role": "voter", + "count": 2, + "label_constraints": [ + {"key": "zone", "op": "in", "values": ["zone1"]} + ], + "location_labels": ["rack", "host"] + }, + { + "group_id": "pd", + "id": "zone2", + "start_key": "", + "end_key": "", + "role": "voter", + "count": 2, + "label_constraints": [ + {"key": "zone", "op": "in", "values": ["zone2"]} + ], + "location_labels": ["rack", "host"] + }, + { + "group_id": "pd", + "id": "zone3", + "start_key": "", + "end_key": "", + "role": "follower", + "count": 1, + "label_constraints": [ + {"key": "zone", "op": "in", "values": ["zone3"]} + ], + "location_labels": ["rack", "host"] + } +] +``` + +### Scenario 3: Add two TiFlash replicas for a table + +Add a separate rule for the row key of the table and limit `count` to `2`. Use `label_constraints` to ensure that the replicas are generated on the node of `engine = tiflash`. Note that a separate `group_id` is used here to ensure that this rule does not overlap or conflict with rules from other sources in the system. + +{{< copyable "" >}} + +```json +{ + "group_id": "tiflash", + "id": "learner-replica-table-ttt", + "start_key": "7480000000000000ff2d5f720000000000fa", + "end_key": "7480000000000000ff2e00000000000000f8", + "role": "learner", + "count": 2, + "label_constraints": [ + {"key": "engine", "op": "in", "values": ["tiflash"]} + ], + "location_labels": ["host"] +} +``` + +### Scenario 4: Add two follower replicas for a table in the Beijing node with high-performance disks + +The following example shows a more complicated `label_constraints` configuration. In this rule, the replicas must be placed in the `bj1` or `bj2` machine room, and the disk type must not be `hdd`. + +{{< copyable "" >}} + +```json +{ + "group_id": "follower-read", + "id": "follower-read-table-ttt", + "start_key": "7480000000000000ff2d00000000000000f8", + "end_key": "7480000000000000ff2e00000000000000f8", + "role": "follower", + "count": 2, + "label_constraints": [ + {"key": "zone", "op": "in", "values": ["bj1", "bj2"]}, + {"key": "disk", "op": "notIn", "values": ["hdd"]} + ], + "location_labels": ["host"] +} +``` + +### Scenario 5: Migrate a table to the TiFlash cluster + +Different from scenario 3, this scenario is not to add new replica(s) on the basis of the existing configuration, but to forcibly override the other configuration of a data range. So you need to specify an `index` value large enough and set `override` to `true` in the rule group configuration to override the existing rule. + +The rule: + +{{< copyable "" >}} + +```json +{ + "group_id": "tiflash-override", + "id": "learner-replica-table-ttt", + "start_key": "7480000000000000ff2d5f720000000000fa", + "end_key": "7480000000000000ff2e00000000000000f8", + "role": "voter", + "count": 3, + "label_constraints": [ + {"key": "engine", "op": "in", "values": ["tiflash"]} + ], + "location_labels": ["host"] +} +``` + +The rule group: + +{{< copyable "" >}} + +```json +{ + "id": "tiflash-override", + "index": 1024, + "override": true, +} +``` diff --git a/pd-configuration-file.md b/pd-configuration-file.md index f84131eb5fad9..46933d0c504ce 100644 --- a/pd-configuration-file.md +++ b/pd-configuration-file.md @@ -56,7 +56,282 @@ PD is configurable using command-line flags and environment variables. pd1=http://192.168.100.113:2380, pd2=http://192.168.100.114:2380, pd3=192.168.100.115:2380 ``` +<<<<<<< HEAD ## `--join` +======= +### `initial-cluster-state` + ++ The initial state of the cluster ++ Default value: `"new"` + +### `initial-cluster-token` + ++ Identifies different clusters during the bootstrap phase. ++ Default value: `"pd-cluster"` ++ If multiple clusters that have nodes with same configurations are deployed successively, you must specify different tokens to isolate different cluster nodes. + +### `lease` + ++ The timeout of the PD Leader Key lease. After the timeout, the system re-elects a Leader. ++ Default value: `3` ++ Unit: second + +### `tso-save-interval` + ++ The interval for PD to allocate TSOs for persistent storage in etcd ++ Default value: `3` ++ Unit: second + +### `enable-prevote` + ++ Enables or disables `raft prevote` ++ Default value: `true` + +### `quota-backend-bytes` + ++ The storage size of the meta-information database, which is 2GB by default ++ Default value: `2147483648` + +### `auto-compaction-mod` + ++ The automatic compaction modes of the meta-information database ++ Available options: `periodic` (by cycle) and `revision` (by version number). ++ Default value: `periodic` + +### `auto-compaction-retention` + ++ The time interval for automatic compaction of the meta-information database when `auto-compaction-retention` is `periodic`. When the compaction mode is set to `revision`, this parameter indicates the version number for the automatic compaction. ++ Default value: 1h + +### `force-new-cluster` + ++ Determines whether to force PD to start as a new cluster and modify the number of Raft members to `1` ++ Default value: `false` + +### `tick-interval` + ++ The tick period of etcd Raft ++ Default value: `100ms` + +### `election-interval` + ++ The timeout for the etcd leader election ++ Default value: `3s` + +### `use-region-storage` + ++ Enables or disables independent Region storage ++ Default value: `false` + +## security + +Configuration items related to security + +### `cacert-path` + ++ The path of the CA file ++ Default value: "" + +### `cert-path` + ++ The path of the Privacy Enhanced Mail (PEM) file that contains the X509 certificate ++ Default value: "" + +### `key-path` + ++ The path of the PEM file that contains the X509 key ++ Default value: "" + +## `log` + +Configuration items related to log + +### `format` + ++ The log format, which can be specified as "text", "json", or "console" ++ Default value: `text` + +### `disable-timestamp` + ++ Whether to disable the automatically generated timestamp in the log ++ Default value: `false` + +## `log.file` + +Configuration items related to the log file + +### `max-size` + ++ The maximum size of a single log file. When this value is exceeded, the system automatically splits the log into several files. ++ Default value: `300` ++ Unit: MiB ++ Minimum value: `1` + +### `max-days` + ++ The maximum number of days in which a log is kept ++ Default value: `28` ++ Minimum value: `1` + +### `max-backups` + ++ The maximum number of log files to keep ++ Default value: `7` ++ Minimum value: `1` + +## `metric` + +Configuration items related to monitoring + +### `interval` + ++ The interval at which monitoring metric data is pushed to Promethus ++ Default value: `15s` + +## `schedule` + +Configuration items related to scheduling + +### `max-merge-region-size` + ++ Controls the size limit of `Region Merge`. When the Region size is greater than the specified value, PD does not merge the Region with the adjacent Regions. ++ Default value: `20` + +### `max-merge-region-keys` + ++ Specifies the upper limit of the `Region Merge` key. When the Region key is greater than the specified value, the PD does not merge the Region with its adjacent Regions. ++ Default value: `200000` + +### `patrol-region-interval` + ++ Controls the running frequency at which `replicaChecker` checks the health state of a Region. The smaller this value is, the faster `replicaChecker` runs. Normally, you do not need to adjust this parameter. ++ Default value: `100ms` + +### `split-merge-interval` + ++ Controls the time interval between the `split` and `merge` operations on the same Region. That means a newly split Region will not be merged for a while. ++ Default value: `1h` + +### `max-snapshot-count` + ++ Control the maximum number of snapshots that a single store receives or sends at the same time. PD schedulers depend on this configuration to prevent the resources used for normal traffic from being preempted. ++ Default value value: `3` + +### `max-pending-peer-count` + ++ Controls the maximum number of pending peers in a single store. PD schedulers depend on this configuration to prevent too many Regions with outdated logs from being generated on some nodes. ++ Default value: `16` + +### `max-store-down-time` + ++ The downtime after which PD judges that the disconnected store can not be recovered. When PD fails to receive the heartbeat from a store after the specified period of time, it adds replicas at other nodes. ++ Default value: `30m` + +### `leader-schedule-limit` + ++ The number of Leader scheduling tasks performed at the same time ++ Default value: `4` + +### `region-schedule-limit` + ++ The number of Region scheduling tasks performed at the same time ++ Default value: `2048` + +### `replica-schedule-limit` + ++ The number of Replica scheduling tasks performed at the same time ++ Default value: `64` + +### `merge-schedule-limit` + ++ The number of the `Region Merge` scheduling tasks performed at the same time. Set this parameter to `0` to disable `Region Merge`. ++ Default value: `8` + +### `high-space-ratio` + ++ The threshold ratio below which the capacity of the store is sufficient ++ Default value: `0.7` ++ Minimum value: greater than `0` ++ Maximum value: less than `1` + +### `low-space-ratio` + ++ The threshold ratio above which the capacity of the store is insufficient ++ Default value: `0.8` ++ Minimum value: greater than `0` ++ Maximum value: less than `1` + +### `tolerant-size-ratio` + ++ Controls the `balance` buffer size ++ Default value: `0` (automatically adjusts the buffer size) ++ Minimum value: `0` + +### `disable-remove-down-replica` + ++ Determines whether to disable the feature that automatically removes `DownReplica`. When this parameter is set to `true`, PD does not automatically clean up the copy in the down state. ++ Default value: `false` + +### `disable-replace-offline-replica` + ++ Determines whether to disable the feature that migrates `OfflineReplica`. When this parameter is set to `true`, PD does not migrate the replicas in the offline state. ++ Default value: `false` + +### `disable-make-up-replica` + ++ Determines whether to disable the feature that automatically supplements replicas. When this parameter is set to `true`, PD does not supplement replicas for the Region with insufficient replicas. ++ Default value: `false` + +### `disable-remove-extra-replica` + ++ Determines whether to disable the feature that removes extra replicas. When this parameter is set to `true`, PD does not remove the extra replicas from the Region with excessive replicas. ++ Default value: `false` + +### `disable-location-replacement` + ++ Determines whether to disable isolation level check. When this parameter is set to `true`, PD does not increase the isolation level of the Region replicas through scheduling. ++ Default value: `false` + +### `store-balance-rate` + ++ Determines the maximum number of operations related to adding peers within a minute ++ Default value: `15` + +## `replication` + +Configuration items related to replicas + +### `max-replicas` + ++ The number of replicas ++ Default value: `3` + +### `location-labels` + ++ The topology information of a TiKV cluster ++ Default value: `[]` ++ [Cluster topology configuration](/schedule-replicas-by-topology-labels.md) + +### `isolation-level` + ++ The minimum topological isolation level of a TiKV cluster ++ Default value: `""` ++ [Cluster topology configuration](/schedule-replicas-by-topology-labels.md) + +### `strictly-match-label` + ++ Enables the strict check for whether the TiKV label matches PD's `location-labels`. ++ Default value: `false` + +### `enable-placement-rules` + ++ Enables `placement-rules`. ++ Default value: `false` ++ See [Placement Rules](/configure-placement-rules.md). ++ An experimental feature of TiDB 4.0. + +## `label-property` +>>>>>>> 24f31fb... improve location-awareness (#3825) - Join the cluster dynamically - Default: "" diff --git a/schedule-replicas-by-topology-labels.md b/schedule-replicas-by-topology-labels.md new file mode 100644 index 0000000000000..546bab38743aa --- /dev/null +++ b/schedule-replicas-by-topology-labels.md @@ -0,0 +1,194 @@ +--- +title: Schedule Replicas by Topology Labels +summary: Learn how to schedule replicas by topology labels. +aliases: ['/docs/dev/location-awareness/','/docs/dev/how-to/deploy/geographic-redundancy/location-awareness/','/tidb/dev/location-awareness'] +--- + +# Schedule Replicas by Topology Labels + +To improve the high availability and disaster recovery capability of TiDB clusters, it is recommended that TiKV nodes are physically scattered as much as possible. For example, TiKV nodes can be distributed on different racks or even in different data centers. According to the topology information of TiKV, the PD scheduler automatically performs scheduling at the background to isolate each replica of a Region as much as possible, which maximizes the capability of disaster recovery. + +To make this mechanism effective, you need to properly configure TiKV and PD so that the topology information of the cluster, especially the TiKV location information, is reported to PD during deployment. Before you begin, see [Deploy TiDB Using TiUP](/production-deployment-using-tiup.md) first. + +## Configure `labels` based on the cluster topology + +### Configure `labels` for TiKV + +You can use the command-line flag or set the TiKV configuration file to bind some attributes in the form of key-value pairs. These attributes are called `labels`. After TiKV is started, it reports its `labels` to PD so users can identify the location of TiKV nodes. + +Assume that the topology has three layers: zone > rack > host, and you can use these labels (zone, rack, host) to set the TiKV location in one of the following methods: + ++ Use the command-line flag: + + {{< copyable "" >}} + + ``` + tikv-server --labels zone=,rack=,host= + ``` + ++ Configure in the TiKV configuration file: + + {{< copyable "" >}} + + ```toml + [server] + labels = "zone=,rack=,host=" + ``` + +### Configure `location-labels` for PD + +According to the description above, the label can be any key-value pair used to describe TiKV attributes. But PD cannot identify the location-related labels and the layer relationship of these labels. Therefore, you need to make the following configuration for PD to understand the TiKV node topology. + ++ If the PD cluster is not initialized, configure `location-labels` in the PD configuration file: + + {{< copyable "" >}} + + ```toml + [replication] + location-labels = ["zone", "rack", "host"] + ``` + ++ If the PD cluster is already initialized, use the pd-ctl tool to make online changes: + + {{< copyable "shell-regular" >}} + + ```bash + pd-ctl config set location-labels zone,rack,host + ``` + +The `location-labels` configuration is an array of strings, and each item corresponds to the key of TiKV `labels`. The sequence of each key represents the layer relationship of different labels. + +> **Note:** +> +> You must configure `location-labels` for PD and `labels` for TiKV at the same time for the configurations to take effect. Otherwise, PD does not perform scheduling according to the topology. + +### Configure `isolation-level` for PD + +If `location-labels` has been configured, you can further enhance the topological isolation requirements on TiKV clusters by configuring `isolation-level` in the PD configuration file. + +Assume that you have made a three-layer cluster topology by configuring `location-labels` according to the instructions above: zone -> rack -> host, you can configure the `isolation-level` to `zone` as follows: + +{{< copyable "" >}} + +```toml +[replication] +isolation-level = "zone" +``` + +If the PD cluster is already initialized, you need to use the pd-ctl tool to make online changes: + +{{< copyable "shell-regular" >}} + +```bash +pd-ctl config set isolation-level zone +``` + +The `location-level` configuration is an array of strings, which needs to correspond to a key of `location-labels`. This parameter limits the minimum and mandatory isolation level requirements on TiKV topology clusters. + +> **Note:** +> +> `isolation-level` is empty by default, which means there is no mandatory restriction on the isolation level. To set it, you need to configure `location-labels` for PD and ensure that the value of `isolation-level` is one of `location-labels` names. + +### Configure a cluster using TiUP (recommended) + +When using TiUP to deploy a cluster, you can configure the TiKV location in the [initialization configuration file](/production-deployment-using-tiup.md#step-3-edit-the-initialization-configuration-file). TiUP will generate the corresponding TiKV and PD configuration files during deployment. + +In the following example, a two-layer topology of `zone/host` is defined. The TiKV nodes of the cluster are distributed among three zones, each zone with two hosts. In z1, two TiKV instances are deployed per host. In z2 and z3, one TiKV instance is deployed per host. In the following example, `tikv-n` represents the IP address of the `n`th TiKV node. + +``` +server_configs: + pd: + replication.location-labels: ["zone", "host"] + +tikv_servers: +# z1 + - host: tikv-1 + config: + server.labels: + zone: z1 + host: h1 + - host: tikv-2 + config: + server.labels: + zone: z1 + host: h1 + - host: tikv-3 + config: + server.labels: + zone: z1 + host: h2 + - host: tikv-4 + config: + server.labels: + zone: z1 + host: h2 +# z2 + - host: tikv-5 + config: + server.labels: + zone: z2 + host: h1 + - host: tikv-6 + config: + server.labels: + zone: z2 + host: h2 +# z3 + - host: tikv-7 + config: + server.labels: + zone: z3 + host: h1 + - host: tikv-8 + config: + server.labels: + zone: z3 + host: h2s +``` + +For details, see [Geo-distributed Deployment topology](/geo-distributed-deployment-topology.md). + +
+ Configure a cluster using TiDB Ansible + +When using TiDB Ansible to deploy a cluster, you can directly configure the TiKV location in the `inventory.ini` file. `tidb-ansible` will generate the corresponding TiKV and PD configuration files during deployment. + +In the following example, a two-layer topology of `zone/host` is defined. The TiKV nodes of the cluster are distributed among three zones, each zone with two hosts. In z1, two TiKV instances are deployed per host. In z2 and z3, one TiKV instance is deployed per host. + +``` +[tikv_servers] +# z1 +tikv-1 labels="zone=z1,host=h1" +tikv-2 labels="zone=z1,host=h1" +tikv-3 labels="zone=z1,host=h2" +tikv-4 labels="zone=z1,host=h2" +# z2 +tikv-5 labels="zone=z2,host=h1" +tikv-6 labels="zone=z2,host=h2" +# z3 +tikv-7 labels="zone=z3,host=h1" +tikv-8 labels="zone=z3,host=h2" + +[pd_servers:vars] +location_labels = ["zone", "host"] +``` + +
+ +## PD schedules based on topology label + +PD schedules replicas according to the label layer to make sure that different replicas of the same data are scattered as much as possible. + +Take the topology in the previous section as an example. + +Assume that the number of cluster replicas is 3 (`max-replicas=3`). Because there are 3 zones in total, PD ensures that the 3 replicas of each Region are respectively placed in z1, z2, and z3. In this way, the TiDB cluster is still available when one data center fails. + +Then, assume that the number of cluster replicas is 5 (`max-replicas=5`). Because there are only 3 zones in total, PD cannot guarantee the isolation of each replica at the zone level. In this situation, the PD scheduler will ensure replica isolation at the host level. In other words, multiple replicas of a Region might be distributed in the same zone but not on the same host. + +In the case of the 5-replica configuration, if z3 fails or is isolated as a whole, and cannot be recovered after a period of time (controlled by `max-store-down-time`), PD will make up the 5 replicas through scheduling. At this time, only 3 hosts are available. This means that host-level isolation cannot be guaranteed and that multiple replicas might be scheduled to the same host. But if the `isolation-level` value is set to `zone` instead of being left empty, this specifies the minimum physical isolation requirements for Region replicas. That is to say, PD will ensure that replicas of the same Region are scattered among different zones. PD will not perform corresponding scheduling even if following this isolation restriction does not meet the requirement of `max-replicas` for multiple replicas. + +For example, a TiKV cluster is distributed across three data zones z1, z2, and z3. Each Region has three replicas as required, and PD distributes the three replicas of the same Region to these three data zones respectively. If a power outage occurs in z1 and cannot be recovered after a period of time, PD determines that the Region replicas on z1 are no longer available. However, because `isolation-level` is set to `zone`, PD needs to strictly guarantee that different replicas of the same Region will not be scheduled on the same data zone. Because both z2 and z3 already have replicas, PD will not perform any scheduling under the minimum isolation level restriction of `isolation-level`, even if there are only two replicas at this moment. + +Similarly, when `isolation-level` is set to `rack`, the minimum isolation level applies to different racks in the same data center. With this configuration, the isolation at the zone layer is guaranteed first if possible. When the isolation at the zone level cannot be guaranteed, PD tries to avoid scheduling different replicas to the same rack in the same zone. The scheduling works similarly when `isolation-level` is set to `host` where PD first guarantees the isolation level of rack, and then the level of host. + +In summary, PD maximizes the disaster recovery of the cluster according to the current topology. Therefore, if you want to achieve a certain level of disaster recovery, deploy more machines on different sites according to the topology than the number of `max-replicas`. TiDB also provides mandatory configuration items such as `isolation-level` for you to more flexibly control the topological isolation level of data according to different scenarios. diff --git a/tidb-scheduling.md b/tidb-scheduling.md new file mode 100644 index 0000000000000..84e9cae9f2225 --- /dev/null +++ b/tidb-scheduling.md @@ -0,0 +1,141 @@ +--- +title: TiDB Scheduling +summary: Introduces the PD scheduling component in a TiDB cluster. +--- + +# TiDB Scheduling + +PD works as the manager in a TiDB cluster, and it also schedules Regions in the cluster. This article introduces the design and core concepts of the PD scheduling component. + +## Scheduling situations + +TiKV is the distributed key-value storage engine used by TiDB. In TiKV, data is organized as Regions, which are replicated on several stores. In all replicas, a leader is responsible for reading and writing, and followers are responsible for replicating Raft logs from the leader. + +Now consider about the following situations: + +* To utilize storage space in a high-efficient way, multiple Replicas of the same Region need to be properly distributed on different nodes according to the Region size; +* For multiple data center topologies, one data center failure only causes one replica to fail for all Regions; +* When a new TiKV store is added, data can be rebalanced to it; +* When a TiKV store fails, PD needs to consider: + * Recovery time of the failed store. + * If it's short (e.g. the service is restarted), whether scheduling is necessary or not. + * If it's long (e.g. disk fault, data is lost, etc.), how to do scheduling. + * Replicas of all Regions. + * If the number of replicas is not enough for some Regions, PD needs to complete them. + * If the number of replicas is more than expected (e.g. the failed store re-joins into the cluster after recovery), PD needs to delete them. +* Read/Write operations are performed on leaders, which can not be distributed only on a few individual stores; +* Not all Regions are hot, so loads of all TiKV stores need to be balanced; +* When Regions are in balancing, data transferring utilizes much network/disk traffic and CPU time, which can influence online services. + +These situations can occur at the same time, which makes it harder to resolve. Also, the whole system is changing dynamically, so a scheduler is needed to collect all information about the cluster, and then adjust the cluster. So, PD is introduced into the TiDB cluster. + +## Scheduling requirements + +The above situations can be classified into two types: + +1. A distributed and highly available storage system must meet the following requirements: + + * The right number of replicas. + * Replicas need to be distributed on different machines according to different topologies. + * The cluster can perform automatic disaster recovery from TiKV peers' failure. + +2. A good distributed system needs to have the following optimizations: + + * All Region leaders are distributed evenly on stores; + * Storage capacity of all TiKV peers are balanced; + * Hot spots are balanced; + * Speed of load balancing for the Regions needs to be limited to ensure that online services are stable; + * Maintainers are able to to take peers online/offline manually. + +After the first type of requirements is satisfied, the system will be failure tolerable. After the second type of requirements is satisfied, resources will be utilized more efficiently and the system will have better scalability. + +To achieve the goals, PD needs to collect information firstly, such as state of peers, information about Raft groups and the statistics of accessing the peers. Then we need to specify some strategies for PD, so that PD can make scheduling plans from these information and strategies. Finally, PD distributes some operators to TiKV peers to complete scheduling plans. + +## Basic scheduling operators + +All scheduling plans contain three basic operators: + +* Add a new replica +* Remove a replica +* Transfer a Region leader between replicas in a Raft group + +They are implemented by the Raft commands `AddReplica`, `RemoveReplica`, and `TransferLeader`. + +## Information collection + +Scheduling is based on information collection. In short, the PD scheduling component needs to know the states of all TiKV peers and all Regions. TiKV peers report the following information to PD: + +- State information reported by each TiKV peer: + + Each TiKV peer sends heartbeats to PD periodically. PD not only checks whether the store is alive, but also collects [`StoreState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L421) in the heartbeat message. `StoreState` includes: + + * Total disk space + * Available disk space + * The number of Regions + * Data read/write speed + * The number of snapshots that are sent/received (The data might be replicated between replicas through snapshots) + * Whether the store is overloaded + * Labels (See [Perception of Topology](/schedule-replicas-by-topology-labels.md)) + +- Information reported by Region leaders: + + Each Region leader sends heartbeats to PD periodically to report [`RegionState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L271), including: + + * Position of the leader itself + * Positions of other replicas + * The number of offline replicas + * data read/write speed + +PD collects cluster information by the two types of heartbeats and then makes decision based on it. + +Besides, PD can get more information from an expanded interface to make a more precise decision. For example, if a store's heartbeats are broken, PD can't know whether the peer steps down temporarily or forever. It just waits a while (by default 30min) and then treats the store as offline if there are still no heartbeats received. Then PD balances all regions on the store to other stores. + +But sometimes stores are manually set offline by a maintainer, so the maintainer can tell PD this by the PD control interface. Then PD can balance all regions immediately. + +## Scheduling strategies + +After collecting the information, PD needs some strategies to make scheduling plans. + +**Strategy 1: The number of replicas of a Region needs to be correct** + +PD can know that the replica count of a Region is incorrect from the Region leader's heartbeat. If it happens, PD can adjust the replica count by adding/removing replica(s). The reason for incorrect replica count can be: + +* Store failure, so some Region's replica count is less than expected; +* Store recovery after failure, so some Region's replica count could be more than expected; +* [`max-replicas`](https://github.com/pingcap/pd/blob/v4.0.0-beta/conf/config.toml#L95) is changed. + +**Strategy 2: Replicas of a Region need to be at different positions** + +Note that here "position" is different from "machine". Generally PD can only ensure that replicas of a Region are not at a same peer to avoid that the peer's failure causes more than one replicas to become lost. However in production, you might have the following requirements: + +* Multiple TiKV peers are on one machine; +* TiKV peers are on multiple racks, and the system is expected to be available even if a rack fails; +* TiKV peers are in multiple data centers, and the system is expected to be available even if a data center fails; + +The key to these requirements is that peers can have the same "position", which is the smallest unit for failure toleration. Replicas of a Region must not be in one unit. So, we can configure [labels](https://github.com/tikv/tikv/blob/v4.0.0-beta/etc/config-template.toml#L140) for the TiKV peers, and set [location-labels](https://github.com/pingcap/pd/blob/v4.0.0-beta/conf/config.toml#L100) on PD to specify which labels are used for marking positions. + +**Strategy 3: Replicas need to be balanced between stores** + +The size limit of a Region replica is fixed, so keeping the replicas balanced between stores is helpful for data size balance. + +**Strategy 4: Leaders need to be balanced between stores** + +Read and write operations are performed on leaders according to the Raft protocol, so that PD needs to distribute leaders into the whole cluster instead of several peers. + +**Strategy 5: Hot spots need to be balanced between stores** + +PD can detect hot spots from store heartbeats and Region heartbeats, so that PD can distribute hot spots. + +**Strategy 6: Storage size needs to be balanced between stores** + +When started up, a TiKV store reports `capacity` of storage, which indicates the store's space limit. PD will consider this when scheduling. + +**Strategy 7: Adjust scheduling speed to stabilize online services** + +Scheduling utilizes CPU, memory, network and I/O traffic. Too much resource utilization will influence online services. Therefore, PD needs to limit the number of the concurrent scheduling tasks. By default this strategy is conservative, while it can be changed if quicker scheduling is required. + +## Scheduling implementation + +PD collects cluster information from store heartbeats and Region heartbeats, and then makes scheduling plans from the information and strategies. Scheduling plans are a sequence of basic operators. Every time PD receives a Region heartbeat from a Region leader, it checks whether there is a pending operator on the Region or not. If PD needs to dispatch a new operator to a Region, it puts the operator into heartbeat responses, and monitors the operator by checking follow-up Region heartbeats. + +Note that here "operators" are only suggestions to the Region leader, which can be skipped by Regions. Leader of Regions can decide whether to skip a scheduling operator or not based on its current status. From eefcb083726115b502cb86d9442a7c75b6bdd557 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Thu, 17 Sep 2020 13:21:50 +0800 Subject: [PATCH 2/2] 2.1-changes --- TOC.md | 142 +------- configure-placement-rules.md | 436 ------------------------ location-awareness.md | 91 ----- pd-configuration-file.md | 275 --------------- schedule-replicas-by-topology-labels.md | 103 +----- tidb-scheduling.md | 141 -------- 6 files changed, 6 insertions(+), 1182 deletions(-) delete mode 100644 configure-placement-rules.md delete mode 100644 location-awareness.md delete mode 100644 tidb-scheduling.md diff --git a/TOC.md b/TOC.md index 2bdc6a3731709..9fbbcbea2d5c0 100644 --- a/TOC.md +++ b/TOC.md @@ -54,7 +54,7 @@ - [Docker Deployment](/test-deployment-using-docker.md) + Geographic Redundancy - [Overview](/geo-redundancy-deployment.md) - - [Configure Location Awareness](/location-awareness.md) + - [Schedule Replicas by Topology Labels](/schedule-replicas-by-topology-labels.md) - [Data Migration with Ansible](https://pingcap.com/docs/tidb-data-migration/stable/deploy-a-dm-cluster-using-ansible/) + Configure - [Time Zone](/configure-time-zone.md) @@ -137,7 +137,6 @@ - [`SET`](/data-type-string.md#set-type) - [JSON Type](/data-type-json.md) + Functions and Operators -<<<<<<< HEAD - [Function and Operator Reference](/functions-and-operators/functions-and-operators-overview.md) - [Type Conversion in Expression Evaluation](/functions-and-operators/type-conversion-in-expression-evaluation.md) - [Operators](/functions-and-operators/operators.md) @@ -321,145 +320,6 @@ - [PD Recover](/pd-recover.md) - [TiKV Control](/tikv-control.md) - [TiDB Control](/tidb-control.md) -======= - + [Overview](/functions-and-operators/functions-and-operators-overview.md) - + [Type Conversion in Expression Evaluation](/functions-and-operators/type-conversion-in-expression-evaluation.md) - + [Operators](/functions-and-operators/operators.md) - + [Control Flow Functions](/functions-and-operators/control-flow-functions.md) - + [String Functions](/functions-and-operators/string-functions.md) - + [Numeric Functions and Operators](/functions-and-operators/numeric-functions-and-operators.md) - + [Date and Time Functions](/functions-and-operators/date-and-time-functions.md) - + [Bit Functions and Operators](/functions-and-operators/bit-functions-and-operators.md) - + [Cast Functions and Operators](/functions-and-operators/cast-functions-and-operators.md) - + [Encryption and Compression Functions](/functions-and-operators/encryption-and-compression-functions.md) - + [Information Functions](/functions-and-operators/information-functions.md) - + [JSON Functions](/functions-and-operators/json-functions.md) - + [Aggregate (GROUP BY) Functions](/functions-and-operators/aggregate-group-by-functions.md) - + [Window Functions](/functions-and-operators/window-functions.md) - + [Miscellaneous Functions](/functions-and-operators/miscellaneous-functions.md) - + [Precision Math](/functions-and-operators/precision-math.md) - + [List of Expressions for Pushdown](/functions-and-operators/expressions-pushed-down.md) - + [Constraints](/constraints.md) - + [Generated Columns](/generated-columns.md) - + [SQL Mode](/sql-mode.md) - + Transactions - + [Overview](/transaction-overview.md) - + [Isolation Levels](/transaction-isolation-levels.md) - + [Optimistic Transactions](/optimistic-transaction.md) - + [Pessimistic Transactions](/pessimistic-transaction.md) - + Garbage Collection (GC) - + [Overview](/garbage-collection-overview.md) - + [Configuration](/garbage-collection-configuration.md) - + [Views](/views.md) - + [Partitioning](/partitioned-table.md) - + [Character Set and Collation](/character-set-and-collation.md) - + System Tables - + [`mysql`](/mysql-schema.md) - + INFORMATION_SCHEMA - + [Overview](/information-schema/information-schema.md) - + [`ANALYZE_STATUS`](/information-schema/information-schema-analyze-status.md) - + [`CHARACTER_SETS`](/information-schema/information-schema-character-sets.md) - + [`CLUSTER_CONFIG`](/information-schema/information-schema-cluster-config.md) - + [`CLUSTER_HARDWARE`](/information-schema/information-schema-cluster-hardware.md) - + [`CLUSTER_INFO`](/information-schema/information-schema-cluster-info.md) - + [`CLUSTER_LOAD`](/information-schema/information-schema-cluster-load.md) - + [`CLUSTER_LOG`](/information-schema/information-schema-cluster-log.md) - + [`CLUSTER_SYSTEMINFO`](/information-schema/information-schema-cluster-systeminfo.md) - + [`COLLATIONS`](/information-schema/information-schema-collations.md) - + [`COLLATION_CHARACTER_SET_APPLICABILITY`](/information-schema/information-schema-collation-character-set-applicability.md) - + [`COLUMNS`](/information-schema/information-schema-columns.md) - + [`DDL_JOBS`](/information-schema/information-schema-ddl-jobs.md) - + [`ENGINES`](/information-schema/information-schema-engines.md) - + [`INSPECTION_RESULT`](/information-schema/information-schema-inspection-result.md) - + [`INSPECTION_RULES`](/information-schema/information-schema-inspection-rules.md) - + [`INSPECTION_SUMMARY`](/information-schema/information-schema-inspection-summary.md) - + [`KEY_COLUMN_USAGE`](/information-schema/information-schema-key-column-usage.md) - + [`METRICS_SUMMARY`](/information-schema/information-schema-metrics-summary.md) - + [`METRICS_TABLES`](/information-schema/information-schema-metrics-tables.md) - + [`PARTITIONS`](/information-schema/information-schema-partitions.md) - + [`PROCESSLIST`](/information-schema/information-schema-processlist.md) - + [`SCHEMATA`](/information-schema/information-schema-schemata.md) - + [`SEQUENCES`](/information-schema/information-schema-sequences.md) - + [`SESSION_VARIABLES`](/information-schema/information-schema-session-variables.md) - + [`SLOW_QUERY`](/information-schema/information-schema-slow-query.md) - + [`STATISTICS`](/information-schema/information-schema-statistics.md) - + [`TABLES`](/information-schema/information-schema-tables.md) - + [`TABLE_CONSTRAINTS`](/information-schema/information-schema-table-constraints.md) - + [`TABLE_STORAGE_STATS`](/information-schema/information-schema-table-storage-stats.md) - + [`TIDB_HOT_REGIONS`](/information-schema/information-schema-tidb-hot-regions.md) - + [`TIDB_INDEXES`](/information-schema/information-schema-tidb-indexes.md) - + [`TIDB_SERVERS_INFO`](/information-schema/information-schema-tidb-servers-info.md) - + [`TIFLASH_REPLICA`](/information-schema/information-schema-tiflash-replica.md) - + [`TIKV_REGION_PEERS`](/information-schema/information-schema-tikv-region-peers.md) - + [`TIKV_REGION_STATUS`](/information-schema/information-schema-tikv-region-status.md) - + [`TIKV_STORE_STATUS`](/information-schema/information-schema-tikv-store-status.md) - + [`USER_PRIVILEGES`](/information-schema/information-schema-user-privileges.md) - + [`VIEWS`](/information-schema/information-schema-views.md) - + [`METRICS_SCHEMA`](/metrics-schema.md) - + UI - + TiDB Dashboard - + [Overview](/dashboard/dashboard-intro.md) - + Maintain - + [Deploy](/dashboard/dashboard-ops-deploy.md) - + [Reverse Proxy](/dashboard/dashboard-ops-reverse-proxy.md) - + [Secure](/dashboard/dashboard-ops-security.md) - + [Access](/dashboard/dashboard-access.md) - + [Overview Page](/dashboard/dashboard-overview.md) - + [Cluster Info Page](/dashboard/dashboard-cluster-info.md) - + [Key Visualizer Page](/dashboard/dashboard-key-visualizer.md) - + SQL Statements Analysis - + [SQL Statements Page](/dashboard/dashboard-statement-list.md) - + [SQL Details Page](/dashboard/dashboard-statement-details.md) - + [Slow Queries Page](/dashboard/dashboard-slow-query.md) - + Cluster Diagnostics - + [Access Cluster Diagnostics Page](/dashboard/dashboard-diagnostics-access.md) - + [View Diagnostics Report](/dashboard/dashboard-diagnostics-report.md) - + [Use Diagnostics](/dashboard/dashboard-diagnostics-usage.md) - + [Search Logs Page](/dashboard/dashboard-log-search.md) - + [Profile Instances Page](/dashboard/dashboard-profiling.md) - + [FAQ](/dashboard/dashboard-faq.md) - + CLI - + [tikv-ctl](/tikv-control.md) - + [pd-ctl](/pd-control.md) - + [tidb-ctl](/tidb-control.md) - + [pd-recover](/pd-recover.md) - + Command Line Flags - + [tidb-server](/command-line-flags-for-tidb-configuration.md) - + [tikv-server](/command-line-flags-for-tikv-configuration.md) - + [tiflash-server](/tiflash/tiflash-command-line-flags.md) - + [pd-server](/command-line-flags-for-pd-configuration.md) - + Configuration File Parameters - + [tidb-server](/tidb-configuration-file.md) - + [tikv-server](/tikv-configuration-file.md) - + [tiflash-server](/tiflash/tiflash-configuration.md) - + [pd-server](/pd-configuration-file.md) - + [System Variables](/system-variables.md) - + Storage Engines - + TiKV - + [TiKV Overview](/tikv-overview.md) - + [RocksDB Overview](/storage-engine/rocksdb-overview.md) - + TiFlash - + [Overview](/tiflash/tiflash-overview.md) - + [Use TiFlash](/tiflash/use-tiflash.md) - + TiUP - + [Documentation Guide](/tiup/tiup-documentation-guide.md) - + [Overview](/tiup/tiup-overview.md) - + [Terminology and Concepts](/tiup/tiup-terminology-and-concepts.md) - + [Manage TiUP Components](/tiup/tiup-component-management.md) - + [FAQ](/tiup/tiup-faq.md) - + [Troubleshooting Guide](/tiup/tiup-troubleshooting-guide.md) - + TiUP Components - + [tiup-playground](/tiup/tiup-playground.md) - + [tiup-cluster](/tiup/tiup-cluster.md) - + [tiup-mirror](/tiup/tiup-mirror.md) - + [tiup-bench](/tiup/tiup-bench.md) - + [Telemetry](/telemetry.md) - + [Errors Codes](/error-codes.md) - + [TiCDC Overview](/ticdc/ticdc-overview.md) - + [TiCDC Open Protocol](/ticdc/ticdc-open-protocol.md) - + [Table Filter](/table-filter.md) - + [Schedule Replicas by Topology Labels](/schedule-replicas-by-topology-labels.md) ->>>>>>> 24f31fb... improve location-awareness (#3825) + FAQs - [TiDB FAQs](/faq/tidb-faq.md) - [TiDB Lightning FAQs](/tidb-lightning/tidb-lightning-faq.md) diff --git a/configure-placement-rules.md b/configure-placement-rules.md deleted file mode 100644 index ddaa9608bbbe7..0000000000000 --- a/configure-placement-rules.md +++ /dev/null @@ -1,436 +0,0 @@ ---- -title: Placement Rules -summary: Learn how to configure Placement Rules. -aliases: ['/docs/dev/configure-placement-rules/','/docs/dev/how-to/configure/placement-rules/'] ---- - -# Placement Rules - -> **Warning:** -> -> In the scenario of using TiFlash, the Placement Rules feature has been extensively tested and can be used in the production environment. Except for the scenario where TiFlash is used, using Placement Rules alone has not been extensively tested, so it is not recommended to enable this feature separately in the production environment. - -Placement Rules is an experimental feature of the Placement Driver (PD) introduced in v4.0. It is a replica rule system that guides PD to generate corresponding schedules for different types of data. By combining different scheduling rules, you can finely control the attributes of any continuous data range, such as the number of replicas, the storage location, the host type, whether to participate in Raft election, and whether to act as the Raft leader. - -## Rule system - -The configuration of the whole rule system consists of multiple rules. Each rule can specify attributes such as the number of replicas, the Raft role, the placement location, and the key range in which this rule takes effect. When PD is performing schedule, it first finds the rule corresponding to the Region in the rule system according to the key range of the Region, and then generates the corresponding schedule to make the distribution of the Region replica comply with the rule. - -The key ranges of multiple rules can have overlapping parts, which means that a Region can match multiple rules. In this case, PD decides whether the rules overwrite each other or take effect at the same time according to the attributes of rules. If multiple rules take effect at the same time, PD will generate schedules in sequence according to the stacking order of the rules for rule matching. - -In addition, to meet the requirement that rules from different sources are isolated from each other, these rules can be organized in a more flexible way. Therefore, the concept of "Group" is introduced. Generally, users can place rules in different groups according to different sources. - -![Placement rules overview](/media/placement-rules-1.png) - -### Rule fields - -The following table shows the meaning of each field in a rule: - -| Field name | Type and restriction | Description | -| :--- | :--- | :--- | -| `GroupID` | `string` | The group ID that marks the source of the rule. | -| `ID` | `string` | The unique ID of a rule in a group. | -| `Index` | `int` | The stacking sequence of rules in a group. | -| `Override` | `true`/`false` | Whether to overwrite rules with smaller index (in a group). | -| `StartKey` | `string`, in hexadecimal form | Applies to the starting key of a range. | -| `EndKey` | `string`, in hexadecimal form | Applies to the ending key of a range. | -| `Role` | `string` | Replica roles, including leader/follower/learner. | -| `Count` | `int`, positive integer | The number of replicas. | -| `LabelConstraint` | `[]Constraint` | Filers nodes based on the label. | -| `LocationLabels` | `[]string` | Used for physical isolation. | -| `IsolationLevel` | `string` | Used to set the minimum physical isolation level - -`LabelConstraint` is similar to the function in Kubernetes that filters labels based on these four primitives: `in`, `notIn`, `exists`, and `notExists`. The meanings of these four primitives are as follows: - -+ `in`: the label value of the given key is included in the given list. -+ `notIn`: the label value of the given key is not included in the given list. -+ `exists`: includes the given label key. -+ `notExists`: does not include the given label key. - -The meaning and function of `LocationLabels` are the same with those earlier than v4.0. For example, if you have deployed `[zone,rack,host]` that defines a three-layer topology: the cluster has multiple zones (Availability Zones), each zone has multiple racks, and each rack has multiple hosts. When performing schedule, PD first tries to place the Region's peers in different zones. If this try fails (such as there are three replicas but only two zones in total), PD guarantees to place these replicas in different racks. If the number of racks is not enough to guarantee isolation, then PD tries the host-level isolation. - -The meaning and function of `IsolationLevel` is elaborated in [Cluster topology configuration](/schedule-replicas-by-topology-labels.md). For example, if you have deployed `[zone,rack,host]` that defines a three-layer topology with `LocationLabels` and set `IsolationLevel` to `zone`, then PD ensures that all peers of each Region are placed in different zones during the scheduling. If the minimum isolation level restriction on `IsolationLevel` cannot be met (for example, 3 replicas are configured but there are only 2 data zones in total), PD will not try to make up to meet this restriction. The default value of `IsolationLevel` is an empty string, which means that it is disabled. - -### Fields of the rule group - -The following table shows the description of each field in a rule group: - -| Field name | Type and restriction | Description | -| :--- | :--- | :--- | -| `ID` | `string` | The group ID that marks the source of the rule. | -| `Index` | `int` | The stacking sequence of different groups. | -| `Override` | `true`/`false` | Whether to override groups with smaller indexes. | - -## Configure rules - -The operations in this section are based on [pd-ctl](/pd-control.md), and the commands involved in the operations also support calls via HTTP API. - -### Enable Placement Rules - -By default, the Placement Rules feature is disabled. To enable this feature, you can modify the PD configuration file as follows before initializing the cluster: - -{{< copyable "" >}} - -```toml -[replication] -enable-placement-rules = true -``` - -In this way, PD enables this feature after the cluster is successfully bootstrapped and generates corresponding rules according to the `max-replicas` and `location-labels` configurations: - -{{< copyable "" >}} - -```json -{ - "group_id": "pd", - "id": "default", - "start_key": "", - "end_key": "", - "role": "voter", - "count": 3, - "location_labels": ["zone", "rack", "host"], - "isolation_level": "" -} -``` - -For a bootstrapped cluster, you can also enable Placement Rules online through pd-ctl: - -{{< copyable "shell-regular" >}} - -```bash -pd-ctl config placement-rules enable -``` - -PD also generates default rules based on the `max-replicas` and `location-labels` configurations. - -> **Note:** -> -> After enabling Placement Rules, the previously configured `max-replicas` and `location-labels` no longer take effect. To adjust the replica policy, use the interface related to Placement Rules. - -### Disable Placement Rules - -You can use pd-ctl to disable the Placement Rules feature and switch to the previous scheduling strategy. - -{{< copyable "shell-regular" >}} - -```bash -pd-ctl config placement-rules disable -``` - -> **Note:** -> -> After disabling Placement Rules, PD uses the original `max-replicas` and `location-labels` configurations. The modification of rules (when Placement Rules is enabled) will not update these two configurations in real time. In addition, all the rules that have been configured remain in PD and will be used the next time you enable Placement Rules. - -### Set rules using pd-ctl - -> **Note:** -> -> The change of rules affects the PD scheduling in real time. Improper rule setting might result in fewer replicas and affect the high availability of the system. - -pd-ctl supports using the following methods to view rules in the system, and the output is a JSON-format rule or a rule list. - -- To view the list of all rules: - - {{< copyable "shell-regular" >}} - - ```bash - pd-ctl config placement-rules show - ``` - -- To view the list of all rules in a PD Group: - - {{< copyable "shell-regular" >}} - - ```bash - pd-ctl config placement-rules show --group=pd - ``` - -- To view the rule of a specific ID in a Group: - - {{< copyable "shell-regular" >}} - - ```bash - pd-ctl config placement-rules show --group=pd --id=default - ``` - -- To view the rule list that matches a Region: - - {{< copyable "shell-regular" >}} - - ```bash - pd-ctl config placement-rules show --region=2 - ``` - - In the above example, `2` is the Region ID. - -Adding rules and editing rules are similar. You need to write the corresponding rules into a file and then use the `save` command to save the rules to PD: - -{{< copyable "shell-regular" >}} - -```bash -cat > rules.json <}} - -```bash -cat > rules.json <}} - -```bash -pd-ctl config placement-rules load -``` - -Executing the above command saves all rules to the `rules.json` file. - -{{< copyable "shell-regular" >}} - -```bash -pd-ctl config placement-rules load --group=pd --out=rule.txt -``` - -The above command saves the rules of a PD Group to the `rules.txt` file. - -### Use pd-ctl to configure rule groups - -- To view the list of all rule groups: - - {{< copyable "shell-regular" >}} - - ```bash - pd-ctl config placement-rules rule-group show - ``` - -- To view the rule group of a specific ID: - - {{< copyable "shell-regular" >}} - - ```bash - pd-ctl config placement-rules rule-group show pd - ``` - -- To set the `index` and `override` attributes of the rule group: - - {{< copyable "shell-regular" >}} - - ```bash - pd-ctl config placement-rules rule-group set pd 100 true - ``` - -- To delete the configuration of a rule group (use the default group configuration if there is any rule in the group): - - {{< copyable "shell-regular" >}} - - ```bash - pd-ctl config placement-rules rule-group delete pd - ``` - -### Use tidb-ctl to query the table-related key range - -If you need special configuration for metadata or a specific table, you can execute the [`keyrange` command](https://github.com/pingcap/tidb-ctl/blob/master/doc/tidb-ctl_keyrange.md) in [tidb-ctl](https://github.com/pingcap/tidb-ctl) to query related keys. Remember to add `--encode` at the end of the command. - -{{< copyable "shell-regular" >}} - -```bash -tidb-ctl keyrange --database test --table ttt --encode -``` - -```text -global ranges: - meta: (6d00000000000000f8, 6e00000000000000f8) - table: (7400000000000000f8, 7500000000000000f8) -table ttt ranges: (NOTE: key range might be changed after DDL) - table: (7480000000000000ff2d00000000000000f8, 7480000000000000ff2e00000000000000f8) - table indexes: (7480000000000000ff2d5f690000000000fa, 7480000000000000ff2d5f720000000000fa) - index c2: (7480000000000000ff2d5f698000000000ff0000010000000000fa, 7480000000000000ff2d5f698000000000ff0000020000000000fa) - index c3: (7480000000000000ff2d5f698000000000ff0000020000000000fa, 7480000000000000ff2d5f698000000000ff0000030000000000fa) - index c4: (7480000000000000ff2d5f698000000000ff0000030000000000fa, 7480000000000000ff2d5f698000000000ff0000040000000000fa) - table rows: (7480000000000000ff2d5f720000000000fa, 7480000000000000ff2e00000000000000f8) -``` - -> **Note:** -> -> DDL and other operations can cause table ID changes, so you need to update the corresponding rules at the same time. - -## Typical usage scenarios - -This section introduces the typical usage scenarios of Placement Rules. - -### Scenario 1: Use three replicas for normal tables and five replicas for the metadata to improve cluster disaster tolerance - -You only need to add a rule that limits the key range to the range of metadata, and set the value of `count` to `5`. Here is an example of this rule: - -{{< copyable "" >}} - -```json -{ - "group_id": "pd", - "id": "meta", - "index": 1, - "override": true, - "start_key": "6d00000000000000f8", - "end_key": "6e00000000000000f8", - "role": "voter", - "count": "5", - "location_labels": ["zone", "rack", "host"] -} -``` - -### Scenario 2: Place five replicas in three data centers in the proportion of 2:2:1, and the Leader should not be in the third data center - -Create three rules. Set the number of replicas to `2`, `2`, and `1` respectively. Limit the replicas to the corresponding data centers through `label_constraints` in each rule. In addition, change `role` to `follower` for the data center that does not need a Leader. - -{{< copyable "" >}} - -```json -[ - { - "group_id": "pd", - "id": "zone1", - "start_key": "", - "end_key": "", - "role": "voter", - "count": 2, - "label_constraints": [ - {"key": "zone", "op": "in", "values": ["zone1"]} - ], - "location_labels": ["rack", "host"] - }, - { - "group_id": "pd", - "id": "zone2", - "start_key": "", - "end_key": "", - "role": "voter", - "count": 2, - "label_constraints": [ - {"key": "zone", "op": "in", "values": ["zone2"]} - ], - "location_labels": ["rack", "host"] - }, - { - "group_id": "pd", - "id": "zone3", - "start_key": "", - "end_key": "", - "role": "follower", - "count": 1, - "label_constraints": [ - {"key": "zone", "op": "in", "values": ["zone3"]} - ], - "location_labels": ["rack", "host"] - } -] -``` - -### Scenario 3: Add two TiFlash replicas for a table - -Add a separate rule for the row key of the table and limit `count` to `2`. Use `label_constraints` to ensure that the replicas are generated on the node of `engine = tiflash`. Note that a separate `group_id` is used here to ensure that this rule does not overlap or conflict with rules from other sources in the system. - -{{< copyable "" >}} - -```json -{ - "group_id": "tiflash", - "id": "learner-replica-table-ttt", - "start_key": "7480000000000000ff2d5f720000000000fa", - "end_key": "7480000000000000ff2e00000000000000f8", - "role": "learner", - "count": 2, - "label_constraints": [ - {"key": "engine", "op": "in", "values": ["tiflash"]} - ], - "location_labels": ["host"] -} -``` - -### Scenario 4: Add two follower replicas for a table in the Beijing node with high-performance disks - -The following example shows a more complicated `label_constraints` configuration. In this rule, the replicas must be placed in the `bj1` or `bj2` machine room, and the disk type must not be `hdd`. - -{{< copyable "" >}} - -```json -{ - "group_id": "follower-read", - "id": "follower-read-table-ttt", - "start_key": "7480000000000000ff2d00000000000000f8", - "end_key": "7480000000000000ff2e00000000000000f8", - "role": "follower", - "count": 2, - "label_constraints": [ - {"key": "zone", "op": "in", "values": ["bj1", "bj2"]}, - {"key": "disk", "op": "notIn", "values": ["hdd"]} - ], - "location_labels": ["host"] -} -``` - -### Scenario 5: Migrate a table to the TiFlash cluster - -Different from scenario 3, this scenario is not to add new replica(s) on the basis of the existing configuration, but to forcibly override the other configuration of a data range. So you need to specify an `index` value large enough and set `override` to `true` in the rule group configuration to override the existing rule. - -The rule: - -{{< copyable "" >}} - -```json -{ - "group_id": "tiflash-override", - "id": "learner-replica-table-ttt", - "start_key": "7480000000000000ff2d5f720000000000fa", - "end_key": "7480000000000000ff2e00000000000000f8", - "role": "voter", - "count": 3, - "label_constraints": [ - {"key": "engine", "op": "in", "values": ["tiflash"]} - ], - "location_labels": ["host"] -} -``` - -The rule group: - -{{< copyable "" >}} - -```json -{ - "id": "tiflash-override", - "index": 1024, - "override": true, -} -``` diff --git a/location-awareness.md b/location-awareness.md deleted file mode 100644 index 5c9373ee35c37..0000000000000 --- a/location-awareness.md +++ /dev/null @@ -1,91 +0,0 @@ ---- -title: Cluster Topology Configuration -summary: Learn to configure cluster topology to maximize the capacity for disaster recovery. -aliases: ['/docs/v2.1/location-awareness/','/docs/v2.1/how-to/deploy/geographic-redundancy/location-awareness/'] ---- - -# Cluster Topology Configuration - -## Overview - -PD schedules according to the topology of the TiKV cluster to maximize the TiKV's capability for disaster recovery. - -Before you begin, see [Deploy TiDB Using TiDB Ansible (Recommended)](/online-deployment-using-ansible.md) and [Deploy TiDB Using Docker](/test-deployment-using-docker.md). - -## TiKV reports the topological information - -TiKV reports the topological information to PD according to the startup parameter or configuration of TiKV. - -Assuming that the topology has three structures: zone > rack > host, use lables to specify the following information: - -Startup parameter: - -``` -tikv-server --labels zone=,rack=,host= -``` - -Configuration: - -``` toml -[server] -labels = "zone=,rack=,host=" -``` - -## PD understands the TiKV topology - -PD gets the topology of TiKV cluster through the PD configuration. - -``` toml -[replication] -max-replicas = 3 -location-labels = ["zone", "rack", "host"] -``` - -`location-labels` needs to correspond to the TiKV `labels` name so that PD can understand that the `labels` represents the TiKV topology. - -> **Note:** -> -> You must configure `location-labels` for PD and `labels` for TiKV at the same time for `labels` to take effect. - -## PD schedules based on the TiKV topology - -PD makes optimal scheduling according to the topological information. You just need to care about what kind of topology can achieve the desired effect. - -If you use 3 replicas and hope that the TiDB cluster is always highly available even when a data zone goes down, you need at least 4 data zones. - -Assume that you have 4 data zones, each zone has 2 racks, and each rack has 2 hosts. You can start 2 TiKV instances on each host: - -``` -# zone=z1 -tikv-server --labels zone=z1,rack=r1,host=h1 -tikv-server --labels zone=z1,rack=r1,host=h2 -tikv-server --labels zone=z1,rack=r2,host=h1 -tikv-server --labels zone=z1,rack=r2,host=h2 - -# zone=z2 -tikv-server --labels zone=z2,rack=r1,host=h1 -tikv-server --labels zone=z2,rack=r1,host=h2 -tikv-server --labels zone=z2,rack=r2,host=h1 -tikv-server --labels zone=z2,rack=r2,host=h2 - -# zone=z3 -tikv-server --labels zone=z3,rack=r1,host=h1 -tikv-server --labels zone=z3,rack=r1,host=h2 -tikv-server --labels zone=z3,rack=r2,host=h1 -tikv-server --labels zone=z3,rack=r2,host=h2 - -# zone=z4 -tikv-server --labels zone=z4,rack=r1,host=h1 -tikv-server --labels zone=z4,rack=r1,host=h2 -tikv-server --labels zone=z4,rack=r2,host=h1 -tikv-server --labels zone=z4,rack=r2,host=h2 -``` - -In other words, 16 TiKV instances are distributed across 4 data zones, 8 racks and 16 machines. - -In this case, PD will schedule different replicas of each datum to different data zones. - -- If one of the data zones goes down, the high availability of the TiDB cluster is not affected. -- If the data zone cannot recover within a period of time, PD will remove the replica from this data zone. - -To sum up, PD maximizes the disaster recovery of the cluster according to the current topology. Therefore, if you want to reach a certain level of disaster recovery, deploy many machines in different sites according to the topology. The number of machines must be more than the number of `max-replicas`. diff --git a/pd-configuration-file.md b/pd-configuration-file.md index 46933d0c504ce..f84131eb5fad9 100644 --- a/pd-configuration-file.md +++ b/pd-configuration-file.md @@ -56,282 +56,7 @@ PD is configurable using command-line flags and environment variables. pd1=http://192.168.100.113:2380, pd2=http://192.168.100.114:2380, pd3=192.168.100.115:2380 ``` -<<<<<<< HEAD ## `--join` -======= -### `initial-cluster-state` - -+ The initial state of the cluster -+ Default value: `"new"` - -### `initial-cluster-token` - -+ Identifies different clusters during the bootstrap phase. -+ Default value: `"pd-cluster"` -+ If multiple clusters that have nodes with same configurations are deployed successively, you must specify different tokens to isolate different cluster nodes. - -### `lease` - -+ The timeout of the PD Leader Key lease. After the timeout, the system re-elects a Leader. -+ Default value: `3` -+ Unit: second - -### `tso-save-interval` - -+ The interval for PD to allocate TSOs for persistent storage in etcd -+ Default value: `3` -+ Unit: second - -### `enable-prevote` - -+ Enables or disables `raft prevote` -+ Default value: `true` - -### `quota-backend-bytes` - -+ The storage size of the meta-information database, which is 2GB by default -+ Default value: `2147483648` - -### `auto-compaction-mod` - -+ The automatic compaction modes of the meta-information database -+ Available options: `periodic` (by cycle) and `revision` (by version number). -+ Default value: `periodic` - -### `auto-compaction-retention` - -+ The time interval for automatic compaction of the meta-information database when `auto-compaction-retention` is `periodic`. When the compaction mode is set to `revision`, this parameter indicates the version number for the automatic compaction. -+ Default value: 1h - -### `force-new-cluster` - -+ Determines whether to force PD to start as a new cluster and modify the number of Raft members to `1` -+ Default value: `false` - -### `tick-interval` - -+ The tick period of etcd Raft -+ Default value: `100ms` - -### `election-interval` - -+ The timeout for the etcd leader election -+ Default value: `3s` - -### `use-region-storage` - -+ Enables or disables independent Region storage -+ Default value: `false` - -## security - -Configuration items related to security - -### `cacert-path` - -+ The path of the CA file -+ Default value: "" - -### `cert-path` - -+ The path of the Privacy Enhanced Mail (PEM) file that contains the X509 certificate -+ Default value: "" - -### `key-path` - -+ The path of the PEM file that contains the X509 key -+ Default value: "" - -## `log` - -Configuration items related to log - -### `format` - -+ The log format, which can be specified as "text", "json", or "console" -+ Default value: `text` - -### `disable-timestamp` - -+ Whether to disable the automatically generated timestamp in the log -+ Default value: `false` - -## `log.file` - -Configuration items related to the log file - -### `max-size` - -+ The maximum size of a single log file. When this value is exceeded, the system automatically splits the log into several files. -+ Default value: `300` -+ Unit: MiB -+ Minimum value: `1` - -### `max-days` - -+ The maximum number of days in which a log is kept -+ Default value: `28` -+ Minimum value: `1` - -### `max-backups` - -+ The maximum number of log files to keep -+ Default value: `7` -+ Minimum value: `1` - -## `metric` - -Configuration items related to monitoring - -### `interval` - -+ The interval at which monitoring metric data is pushed to Promethus -+ Default value: `15s` - -## `schedule` - -Configuration items related to scheduling - -### `max-merge-region-size` - -+ Controls the size limit of `Region Merge`. When the Region size is greater than the specified value, PD does not merge the Region with the adjacent Regions. -+ Default value: `20` - -### `max-merge-region-keys` - -+ Specifies the upper limit of the `Region Merge` key. When the Region key is greater than the specified value, the PD does not merge the Region with its adjacent Regions. -+ Default value: `200000` - -### `patrol-region-interval` - -+ Controls the running frequency at which `replicaChecker` checks the health state of a Region. The smaller this value is, the faster `replicaChecker` runs. Normally, you do not need to adjust this parameter. -+ Default value: `100ms` - -### `split-merge-interval` - -+ Controls the time interval between the `split` and `merge` operations on the same Region. That means a newly split Region will not be merged for a while. -+ Default value: `1h` - -### `max-snapshot-count` - -+ Control the maximum number of snapshots that a single store receives or sends at the same time. PD schedulers depend on this configuration to prevent the resources used for normal traffic from being preempted. -+ Default value value: `3` - -### `max-pending-peer-count` - -+ Controls the maximum number of pending peers in a single store. PD schedulers depend on this configuration to prevent too many Regions with outdated logs from being generated on some nodes. -+ Default value: `16` - -### `max-store-down-time` - -+ The downtime after which PD judges that the disconnected store can not be recovered. When PD fails to receive the heartbeat from a store after the specified period of time, it adds replicas at other nodes. -+ Default value: `30m` - -### `leader-schedule-limit` - -+ The number of Leader scheduling tasks performed at the same time -+ Default value: `4` - -### `region-schedule-limit` - -+ The number of Region scheduling tasks performed at the same time -+ Default value: `2048` - -### `replica-schedule-limit` - -+ The number of Replica scheduling tasks performed at the same time -+ Default value: `64` - -### `merge-schedule-limit` - -+ The number of the `Region Merge` scheduling tasks performed at the same time. Set this parameter to `0` to disable `Region Merge`. -+ Default value: `8` - -### `high-space-ratio` - -+ The threshold ratio below which the capacity of the store is sufficient -+ Default value: `0.7` -+ Minimum value: greater than `0` -+ Maximum value: less than `1` - -### `low-space-ratio` - -+ The threshold ratio above which the capacity of the store is insufficient -+ Default value: `0.8` -+ Minimum value: greater than `0` -+ Maximum value: less than `1` - -### `tolerant-size-ratio` - -+ Controls the `balance` buffer size -+ Default value: `0` (automatically adjusts the buffer size) -+ Minimum value: `0` - -### `disable-remove-down-replica` - -+ Determines whether to disable the feature that automatically removes `DownReplica`. When this parameter is set to `true`, PD does not automatically clean up the copy in the down state. -+ Default value: `false` - -### `disable-replace-offline-replica` - -+ Determines whether to disable the feature that migrates `OfflineReplica`. When this parameter is set to `true`, PD does not migrate the replicas in the offline state. -+ Default value: `false` - -### `disable-make-up-replica` - -+ Determines whether to disable the feature that automatically supplements replicas. When this parameter is set to `true`, PD does not supplement replicas for the Region with insufficient replicas. -+ Default value: `false` - -### `disable-remove-extra-replica` - -+ Determines whether to disable the feature that removes extra replicas. When this parameter is set to `true`, PD does not remove the extra replicas from the Region with excessive replicas. -+ Default value: `false` - -### `disable-location-replacement` - -+ Determines whether to disable isolation level check. When this parameter is set to `true`, PD does not increase the isolation level of the Region replicas through scheduling. -+ Default value: `false` - -### `store-balance-rate` - -+ Determines the maximum number of operations related to adding peers within a minute -+ Default value: `15` - -## `replication` - -Configuration items related to replicas - -### `max-replicas` - -+ The number of replicas -+ Default value: `3` - -### `location-labels` - -+ The topology information of a TiKV cluster -+ Default value: `[]` -+ [Cluster topology configuration](/schedule-replicas-by-topology-labels.md) - -### `isolation-level` - -+ The minimum topological isolation level of a TiKV cluster -+ Default value: `""` -+ [Cluster topology configuration](/schedule-replicas-by-topology-labels.md) - -### `strictly-match-label` - -+ Enables the strict check for whether the TiKV label matches PD's `location-labels`. -+ Default value: `false` - -### `enable-placement-rules` - -+ Enables `placement-rules`. -+ Default value: `false` -+ See [Placement Rules](/configure-placement-rules.md). -+ An experimental feature of TiDB 4.0. - -## `label-property` ->>>>>>> 24f31fb... improve location-awareness (#3825) - Join the cluster dynamically - Default: "" diff --git a/schedule-replicas-by-topology-labels.md b/schedule-replicas-by-topology-labels.md index 546bab38743aa..8e82085c55ebf 100644 --- a/schedule-replicas-by-topology-labels.md +++ b/schedule-replicas-by-topology-labels.md @@ -1,14 +1,14 @@ --- title: Schedule Replicas by Topology Labels summary: Learn how to schedule replicas by topology labels. -aliases: ['/docs/dev/location-awareness/','/docs/dev/how-to/deploy/geographic-redundancy/location-awareness/','/tidb/dev/location-awareness'] +aliases: ['/docs/v2.1/location-awareness/','/docs/v2.1/how-to/deploy/geographic-redundancy/location-awareness/','/tidb/v2.1/location-awareness'] --- # Schedule Replicas by Topology Labels To improve the high availability and disaster recovery capability of TiDB clusters, it is recommended that TiKV nodes are physically scattered as much as possible. For example, TiKV nodes can be distributed on different racks or even in different data centers. According to the topology information of TiKV, the PD scheduler automatically performs scheduling at the background to isolate each replica of a Region as much as possible, which maximizes the capability of disaster recovery. -To make this mechanism effective, you need to properly configure TiKV and PD so that the topology information of the cluster, especially the TiKV location information, is reported to PD during deployment. Before you begin, see [Deploy TiDB Using TiUP](/production-deployment-using-tiup.md) first. +To make this mechanism effective, you need to properly configure TiKV and PD so that the topology information of the cluster, especially the TiKV location information, is reported to PD during deployment. Before you begin, see [Deploy TiDB Using TiDB Ansible](/online-deployment-using-ansible.md) first. ## Configure `labels` based on the cluster topology @@ -62,94 +62,7 @@ The `location-labels` configuration is an array of strings, and each item corres > > You must configure `location-labels` for PD and `labels` for TiKV at the same time for the configurations to take effect. Otherwise, PD does not perform scheduling according to the topology. -### Configure `isolation-level` for PD - -If `location-labels` has been configured, you can further enhance the topological isolation requirements on TiKV clusters by configuring `isolation-level` in the PD configuration file. - -Assume that you have made a three-layer cluster topology by configuring `location-labels` according to the instructions above: zone -> rack -> host, you can configure the `isolation-level` to `zone` as follows: - -{{< copyable "" >}} - -```toml -[replication] -isolation-level = "zone" -``` - -If the PD cluster is already initialized, you need to use the pd-ctl tool to make online changes: - -{{< copyable "shell-regular" >}} - -```bash -pd-ctl config set isolation-level zone -``` - -The `location-level` configuration is an array of strings, which needs to correspond to a key of `location-labels`. This parameter limits the minimum and mandatory isolation level requirements on TiKV topology clusters. - -> **Note:** -> -> `isolation-level` is empty by default, which means there is no mandatory restriction on the isolation level. To set it, you need to configure `location-labels` for PD and ensure that the value of `isolation-level` is one of `location-labels` names. - -### Configure a cluster using TiUP (recommended) - -When using TiUP to deploy a cluster, you can configure the TiKV location in the [initialization configuration file](/production-deployment-using-tiup.md#step-3-edit-the-initialization-configuration-file). TiUP will generate the corresponding TiKV and PD configuration files during deployment. - -In the following example, a two-layer topology of `zone/host` is defined. The TiKV nodes of the cluster are distributed among three zones, each zone with two hosts. In z1, two TiKV instances are deployed per host. In z2 and z3, one TiKV instance is deployed per host. In the following example, `tikv-n` represents the IP address of the `n`th TiKV node. - -``` -server_configs: - pd: - replication.location-labels: ["zone", "host"] - -tikv_servers: -# z1 - - host: tikv-1 - config: - server.labels: - zone: z1 - host: h1 - - host: tikv-2 - config: - server.labels: - zone: z1 - host: h1 - - host: tikv-3 - config: - server.labels: - zone: z1 - host: h2 - - host: tikv-4 - config: - server.labels: - zone: z1 - host: h2 -# z2 - - host: tikv-5 - config: - server.labels: - zone: z2 - host: h1 - - host: tikv-6 - config: - server.labels: - zone: z2 - host: h2 -# z3 - - host: tikv-7 - config: - server.labels: - zone: z3 - host: h1 - - host: tikv-8 - config: - server.labels: - zone: z3 - host: h2s -``` - -For details, see [Geo-distributed Deployment topology](/geo-distributed-deployment-topology.md). - -
- Configure a cluster using TiDB Ansible +### Configure a cluster using TiDB Ansible When using TiDB Ansible to deploy a cluster, you can directly configure the TiKV location in the `inventory.ini` file. `tidb-ansible` will generate the corresponding TiKV and PD configuration files during deployment. @@ -173,8 +86,6 @@ tikv-8 labels="zone=z3,host=h2" location_labels = ["zone", "host"] ``` -
- ## PD schedules based on topology label PD schedules replicas according to the label layer to make sure that different replicas of the same data are scattered as much as possible. @@ -185,10 +96,6 @@ Assume that the number of cluster replicas is 3 (`max-replicas=3`). Because ther Then, assume that the number of cluster replicas is 5 (`max-replicas=5`). Because there are only 3 zones in total, PD cannot guarantee the isolation of each replica at the zone level. In this situation, the PD scheduler will ensure replica isolation at the host level. In other words, multiple replicas of a Region might be distributed in the same zone but not on the same host. -In the case of the 5-replica configuration, if z3 fails or is isolated as a whole, and cannot be recovered after a period of time (controlled by `max-store-down-time`), PD will make up the 5 replicas through scheduling. At this time, only 3 hosts are available. This means that host-level isolation cannot be guaranteed and that multiple replicas might be scheduled to the same host. But if the `isolation-level` value is set to `zone` instead of being left empty, this specifies the minimum physical isolation requirements for Region replicas. That is to say, PD will ensure that replicas of the same Region are scattered among different zones. PD will not perform corresponding scheduling even if following this isolation restriction does not meet the requirement of `max-replicas` for multiple replicas. - -For example, a TiKV cluster is distributed across three data zones z1, z2, and z3. Each Region has three replicas as required, and PD distributes the three replicas of the same Region to these three data zones respectively. If a power outage occurs in z1 and cannot be recovered after a period of time, PD determines that the Region replicas on z1 are no longer available. However, because `isolation-level` is set to `zone`, PD needs to strictly guarantee that different replicas of the same Region will not be scheduled on the same data zone. Because both z2 and z3 already have replicas, PD will not perform any scheduling under the minimum isolation level restriction of `isolation-level`, even if there are only two replicas at this moment. - -Similarly, when `isolation-level` is set to `rack`, the minimum isolation level applies to different racks in the same data center. With this configuration, the isolation at the zone layer is guaranteed first if possible. When the isolation at the zone level cannot be guaranteed, PD tries to avoid scheduling different replicas to the same rack in the same zone. The scheduling works similarly when `isolation-level` is set to `host` where PD first guarantees the isolation level of rack, and then the level of host. +In the case of the 5-replica configuration, if z3 fails or is isolated as a whole, and cannot be recovered after a period of time (controlled by `max-store-down-time`), PD will make up the 5 replicas through scheduling. At this time, only 3 hosts are available. This means that host-level isolation cannot be guaranteed and that multiple replicas might be scheduled to the same host. -In summary, PD maximizes the disaster recovery of the cluster according to the current topology. Therefore, if you want to achieve a certain level of disaster recovery, deploy more machines on different sites according to the topology than the number of `max-replicas`. TiDB also provides mandatory configuration items such as `isolation-level` for you to more flexibly control the topological isolation level of data according to different scenarios. +In summary, PD maximizes the disaster recovery of the cluster according to the current topology. Therefore, if you want to achieve a certain level of disaster recovery, deploy more machines on different sites according to the topology than the number of `max-replicas`. diff --git a/tidb-scheduling.md b/tidb-scheduling.md deleted file mode 100644 index 84e9cae9f2225..0000000000000 --- a/tidb-scheduling.md +++ /dev/null @@ -1,141 +0,0 @@ ---- -title: TiDB Scheduling -summary: Introduces the PD scheduling component in a TiDB cluster. ---- - -# TiDB Scheduling - -PD works as the manager in a TiDB cluster, and it also schedules Regions in the cluster. This article introduces the design and core concepts of the PD scheduling component. - -## Scheduling situations - -TiKV is the distributed key-value storage engine used by TiDB. In TiKV, data is organized as Regions, which are replicated on several stores. In all replicas, a leader is responsible for reading and writing, and followers are responsible for replicating Raft logs from the leader. - -Now consider about the following situations: - -* To utilize storage space in a high-efficient way, multiple Replicas of the same Region need to be properly distributed on different nodes according to the Region size; -* For multiple data center topologies, one data center failure only causes one replica to fail for all Regions; -* When a new TiKV store is added, data can be rebalanced to it; -* When a TiKV store fails, PD needs to consider: - * Recovery time of the failed store. - * If it's short (e.g. the service is restarted), whether scheduling is necessary or not. - * If it's long (e.g. disk fault, data is lost, etc.), how to do scheduling. - * Replicas of all Regions. - * If the number of replicas is not enough for some Regions, PD needs to complete them. - * If the number of replicas is more than expected (e.g. the failed store re-joins into the cluster after recovery), PD needs to delete them. -* Read/Write operations are performed on leaders, which can not be distributed only on a few individual stores; -* Not all Regions are hot, so loads of all TiKV stores need to be balanced; -* When Regions are in balancing, data transferring utilizes much network/disk traffic and CPU time, which can influence online services. - -These situations can occur at the same time, which makes it harder to resolve. Also, the whole system is changing dynamically, so a scheduler is needed to collect all information about the cluster, and then adjust the cluster. So, PD is introduced into the TiDB cluster. - -## Scheduling requirements - -The above situations can be classified into two types: - -1. A distributed and highly available storage system must meet the following requirements: - - * The right number of replicas. - * Replicas need to be distributed on different machines according to different topologies. - * The cluster can perform automatic disaster recovery from TiKV peers' failure. - -2. A good distributed system needs to have the following optimizations: - - * All Region leaders are distributed evenly on stores; - * Storage capacity of all TiKV peers are balanced; - * Hot spots are balanced; - * Speed of load balancing for the Regions needs to be limited to ensure that online services are stable; - * Maintainers are able to to take peers online/offline manually. - -After the first type of requirements is satisfied, the system will be failure tolerable. After the second type of requirements is satisfied, resources will be utilized more efficiently and the system will have better scalability. - -To achieve the goals, PD needs to collect information firstly, such as state of peers, information about Raft groups and the statistics of accessing the peers. Then we need to specify some strategies for PD, so that PD can make scheduling plans from these information and strategies. Finally, PD distributes some operators to TiKV peers to complete scheduling plans. - -## Basic scheduling operators - -All scheduling plans contain three basic operators: - -* Add a new replica -* Remove a replica -* Transfer a Region leader between replicas in a Raft group - -They are implemented by the Raft commands `AddReplica`, `RemoveReplica`, and `TransferLeader`. - -## Information collection - -Scheduling is based on information collection. In short, the PD scheduling component needs to know the states of all TiKV peers and all Regions. TiKV peers report the following information to PD: - -- State information reported by each TiKV peer: - - Each TiKV peer sends heartbeats to PD periodically. PD not only checks whether the store is alive, but also collects [`StoreState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L421) in the heartbeat message. `StoreState` includes: - - * Total disk space - * Available disk space - * The number of Regions - * Data read/write speed - * The number of snapshots that are sent/received (The data might be replicated between replicas through snapshots) - * Whether the store is overloaded - * Labels (See [Perception of Topology](/schedule-replicas-by-topology-labels.md)) - -- Information reported by Region leaders: - - Each Region leader sends heartbeats to PD periodically to report [`RegionState`](https://github.com/pingcap/kvproto/blob/release-3.1/proto/pdpb.proto#L271), including: - - * Position of the leader itself - * Positions of other replicas - * The number of offline replicas - * data read/write speed - -PD collects cluster information by the two types of heartbeats and then makes decision based on it. - -Besides, PD can get more information from an expanded interface to make a more precise decision. For example, if a store's heartbeats are broken, PD can't know whether the peer steps down temporarily or forever. It just waits a while (by default 30min) and then treats the store as offline if there are still no heartbeats received. Then PD balances all regions on the store to other stores. - -But sometimes stores are manually set offline by a maintainer, so the maintainer can tell PD this by the PD control interface. Then PD can balance all regions immediately. - -## Scheduling strategies - -After collecting the information, PD needs some strategies to make scheduling plans. - -**Strategy 1: The number of replicas of a Region needs to be correct** - -PD can know that the replica count of a Region is incorrect from the Region leader's heartbeat. If it happens, PD can adjust the replica count by adding/removing replica(s). The reason for incorrect replica count can be: - -* Store failure, so some Region's replica count is less than expected; -* Store recovery after failure, so some Region's replica count could be more than expected; -* [`max-replicas`](https://github.com/pingcap/pd/blob/v4.0.0-beta/conf/config.toml#L95) is changed. - -**Strategy 2: Replicas of a Region need to be at different positions** - -Note that here "position" is different from "machine". Generally PD can only ensure that replicas of a Region are not at a same peer to avoid that the peer's failure causes more than one replicas to become lost. However in production, you might have the following requirements: - -* Multiple TiKV peers are on one machine; -* TiKV peers are on multiple racks, and the system is expected to be available even if a rack fails; -* TiKV peers are in multiple data centers, and the system is expected to be available even if a data center fails; - -The key to these requirements is that peers can have the same "position", which is the smallest unit for failure toleration. Replicas of a Region must not be in one unit. So, we can configure [labels](https://github.com/tikv/tikv/blob/v4.0.0-beta/etc/config-template.toml#L140) for the TiKV peers, and set [location-labels](https://github.com/pingcap/pd/blob/v4.0.0-beta/conf/config.toml#L100) on PD to specify which labels are used for marking positions. - -**Strategy 3: Replicas need to be balanced between stores** - -The size limit of a Region replica is fixed, so keeping the replicas balanced between stores is helpful for data size balance. - -**Strategy 4: Leaders need to be balanced between stores** - -Read and write operations are performed on leaders according to the Raft protocol, so that PD needs to distribute leaders into the whole cluster instead of several peers. - -**Strategy 5: Hot spots need to be balanced between stores** - -PD can detect hot spots from store heartbeats and Region heartbeats, so that PD can distribute hot spots. - -**Strategy 6: Storage size needs to be balanced between stores** - -When started up, a TiKV store reports `capacity` of storage, which indicates the store's space limit. PD will consider this when scheduling. - -**Strategy 7: Adjust scheduling speed to stabilize online services** - -Scheduling utilizes CPU, memory, network and I/O traffic. Too much resource utilization will influence online services. Therefore, PD needs to limit the number of the concurrent scheduling tasks. By default this strategy is conservative, while it can be changed if quicker scheduling is required. - -## Scheduling implementation - -PD collects cluster information from store heartbeats and Region heartbeats, and then makes scheduling plans from the information and strategies. Scheduling plans are a sequence of basic operators. Every time PD receives a Region heartbeat from a Region leader, it checks whether there is a pending operator on the Region or not. If PD needs to dispatch a new operator to a Region, it puts the operator into heartbeat responses, and monitors the operator by checking follow-up Region heartbeats. - -Note that here "operators" are only suggestions to the Region leader, which can be skipped by Regions. Leader of Regions can decide whether to skip a scheduling operator or not based on its current status.