Add scheduler concept guide (#14637)

* Add scheduler concept page * Rename scheduling overview * Fix non-ASCII colon symbols * Reword scheduler concept page * Move scheduler performance tuning into scheduling * Signpost from overview to kube-scheduler, etc * Add whatsnext section to scheduler concept * Restructure scheduling concept Now there's a concept page for scheduling, some of the details in the performance tuning page can have a better home. * Omit link to (unwritten) scheduling extensions page * Drop deprecated / superseded filtering rules
kubernetes · Jul 9, 2019 · d550d11 · d550d11
1 parent 6b0711a
commit d550d11
Show file tree

Hide file tree

Showing 5 changed files with 211 additions and 10 deletions.
diff --git a/content/en/docs/concepts/overview/components.md b/content/en/docs/concepts/overview/components.md
@@ -111,5 +111,8 @@ A [Cluster-level logging](/docs/concepts/cluster-administration/logging/) mechan
 saving container logs to a central log store with search/browsing interface.
 
 {{% /capture %}}
-
-
+{{% capture whatsnext %}}
+* Learn about [Nodes](/docs/concepts/architecture/nodes/)
+* Learn about [kube-scheduler](/docs/concepts/scheduling/kube-scheduler/)
+* Read etcd's official [documentation](https://etcd.io/docs/)
+{{% /capture %}}
diff --git a/content/en/docs/concepts/scheduling/_index.md b/content/en/docs/concepts/scheduling/_index.md
@@ -0,0 +1,5 @@
+---
+title: "Scheduling"
+weight: 90
+---
+
diff --git a/content/en/docs/concepts/scheduling/kube-scheduler.md b/content/en/docs/concepts/scheduling/kube-scheduler.md
@@ -0,0 +1,186 @@
+---
+title: Kubernetes Scheduler
+content_template: templates/concept
+weight: 60
+---
+
+{{% capture overview %}}
+
+In Kubernetes, _scheduling_ refers to making sure that {{< glossary_tooltip text="Pods" term_id="pod" >}}
+are matched to {{< glossary_tooltip text="Nodes" term_id="node" >}} so that
+{{< glossary_tooltip term_id="kubelet" >}} can run them.
+
+{{% /capture %}}
+
+{{% capture body %}}
+
+## Scheduling overview {#scheduling}
+
+A scheduler watches for newly created Pods that have no Node assigned. For
+every Pod that the scheduler discovers, the scheduler becomes responsible
+for finding the best Node for that Pod to run on. The scheduler reaches
+this placement decision taking into account the scheduling principles
+described below.
+
+If you want to understand why Pods are placed onto a particular Node,
+or if you're planning to implement a custom scheduler yourself, this
+page will help you learn about scheduling.
+
+## kube-scheduler
+
+[kube-scheduler](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/)
+is the default scheduler for Kubernetes and runs as part of the
+{{< glossary_tooltip text="control plane" term_id="control-plane" >}}.
+kube-scheduler is designed so that, if you want and need to, you can
+write your own scheduling component and use that instead.
+
+For every newly created pods or other unscheduled pods, kube-scheduler
+selects a optimal node for them to run on.  However, every container in
+pods has different requirements for resources and every pod also has
+different requirements. Therefore, existing nodes need to be filtered
+according to the specific scheduling requirements.
+
+In a cluster, Nodes that meet the scheduling requirements for a Pod
+are called _feasible_ nodes. If none of the nodes are suitable, the pod
+remains unscheduled until the scheduler is able to place it.
+
+The scheduler finds feasible Nodes for a Pod and then runs a set of
+functions to score the feasible Nodes and picks a Node with the highest
+score among the feasible ones to run the Pod. The scheduler then notifies
+the API server about this decision in a process called _binding_.
+
+Factors that need taken into account for scheduling decisions include
+individual and collective resource requirements, hardware / software /
+policy constraints, affinity and anti-affinity specifications, data
+locality, inter-workload interference, and so on.
+
+## Scheduling with kube-scheduler {#kube-scheduler-implementation}
+
+kube-scheduler selects a node for the pod in a 2-step operation:
+
+1. Filtering
+
+2. Scoring
+
+
+The _filtering_ step finds the set of Nodes where it's feasible to
+schedule the Pod. For example, the PodFitsResources filter checks whether a
+candidate Node has enough available resource to meet a Pod's specific
+resource requests. After this step, the node list contains any suitable
+Nodes; often, there will be more than one. If the list is empty, that
+Pod isn't (yet) schedulable.
+
+In the _scoring_ step, the scheduler ranks the remaining nodes to choose
+the most suitable Pod placement. The scheduler assigns a score to each Node
+that survived filtering, basing this score on the active scoring rules.
+
+Finally, kube-scheduler assigns the Pod to the Node with the highest ranking.
+If there is more than one node with equal scores, kube-scheduler selects
+one of these at random.
+
+
+### Default policies
+
+kube-scheduler has a default set of scheduling policies.
+
+### Filtering
+
+- `PodFitsHostPorts`: Checks if a Node has free ports (the network protocol kind)
+  for the Pod ports the the Pod is requesting.
+
+- `PodFitsHost`: Checks if a Pod specifies a specific Node by it hostname.
+
+- `PodFitsResources`: Checks if the Node has free resources (eg, CPU and Memory)
+  to meet the requirement of the Pod.
+
+- `PodMatchNodeSelector`: Checks if a Pod's Node {{< glossary_tooltip term_id="selector" >}}
+   matches the Node's {{< glossary_tooltip text="label(s)" term_id="label" >}}.
+
+- `NoVolumeZoneConflict`: Evaluate if the {{< glossary_tooltip text="Volumes" term_id="volume" >}}
+  that a Pod requests are available on the Node, given the failure zone restrictions for
+  that storage.
+
+- `NoDiskConflict`: Evaluates if a Pod can fit on a Node due to the volumes it requests,
+   and those that are already mounted.
+
+- `MaxCSIVolumeCount`: Decides how many {{< glossary_tooltip text="CSI" term_id="csi" >}}
+  volumes should be attached, and whether that's over a configured limit.
+
+- `CheckNodeMemoryPressure`: If a Node is reporting memory pressure, and there's no
+  configured exception, the Pod won't be scheduled there.
+
+- `CheckNodePIDPressure`: If a Node is reporting that process IDs are scarce, and
+  there's no configured exception, the Pod won't be scheduled there.
+
+- `CheckNodeDiskPressure`: If a Node is reporting storage pressure (a filesystem that
+   is full or nearly full), and there's no configured exception, the Pod won't be
+   scheduled there.
+
+- `CheckNodeCondition`: Nodes can report that they have a completely full filesystem,
+  that networking isn't available or that kubelet is otherwise not ready to run Pods.
+  If such a condition is set for a Node, and there's no configured exception, the Pod
+  won't be scheduled there.
+
+- `PodToleratesNodeTaints`: checks if a Pod's {{< glossary_tooltip text="tolerations" term_id="toleration" >}}
+  can tolerate the Node's {{< glossary_tooltip text="taints" term_id="taint" >}}.
+
+- `CheckVolumeBinding`: Evaluates if a Pod can fit due to the volumes it requests.
+  This applies for both bound and unbound
+  {{< glossary_tooltip text="PVCs" term_id="persistent-volume-claim" >}}
+
+### Scoring
+
+- `SelectorSpreadPriority`: Spreads Pods across hosts, considering Pods that
+   belonging to the same {{< glossary_tooltip text="Service" term_id="service" >}},
+   {{< glossary_tooltip term_id="statefulset" >}} or
+   {{< glossary_tooltip term_id="replica-set" >}}.
+
+- `InterPodAffinityPriority`: Computes a sum by iterating through the elements
+  of weightedPodAffinityTerm and adding “weight” to the sum if the corresponding
+  PodAffinityTerm is satisfied for that node; the node(s) with the highest sum
+  are the most preferred.
+
+- `LeastRequestedPriority`: Favors nodes with fewer requested resources. In other
+  words, the more Pods that are placed on a Node, and the more resources those
+  Pods use, the lower the ranking this policy will give.
+
+- `MostRequestedPriority`: Favors nodes with most requested resources. This policy
+  will fit the scheduled Pods onto the smallest number of Nodes needed to run your
+  overall set of workloads.
+
+- `RequestedToCapacityRatioPriority`: Creates a requestedToCapacity based ResourceAllocationPriority using default resource scoring function shape.
+
+- `BalancedResourceAllocation`: Favors nodes with balanced resource usage.
+
+- `NodePreferAvoidPodsPriority`: Priorities nodes according to the node annotation
+  `scheduler.alpha.kubernetes.io/preferAvoidPods`. You can use this to hint that
+  two different Pods shouldn't run on the same Node.
+
+- `NodeAffinityPriority`: Prioritizes nodes according to node affinity scheduling
+   preferences indicated in PreferredDuringSchedulingIgnoredDuringExecution.
+   You can read more about this in [Assigning Pods to Nodes](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/)
+
+- `TaintTolerationPriority`: Prepares the priority list for all the nodes, based on
+  the number of intolerable taints on the node. This policy adjusts a node's rank
+  taking that list into account.
+
+- `ImageLocalityPriority`: Favors nodes that already have the
+  {{< glossary_tooltip text="container images" term_id="image" >}} for that
+  Pod cached locally.
+
+- `ServiceSpreadingPriority`: For a given Service, this policy aims to make sure that
+  the Pods for the Service run on different nodes. It favouring scheduling onto nodes
+  that don't have Pods for the service already assigned there. The overall outcome is
+  that the Service becomes more resilient to a single Node failure.
+
+- `CalculateAntiAffinityPriorityMap`: This policy helps implement
+  [pod anti-affinity](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity).
+
+- `EqualPriorityMap`: Gives an equal weight of one to all nodes.
+
+{{% /capture %}}
+{{% capture whatsnext %}}
+* Read about [scheduler performance tuning](/docs/concepts/scheduling/scheduler-perf-tuning/)
+* Read the [reference documentation](/docs/reference/command-line-tools-reference/kube-scheduler/) for kube-scheduler
+* Learn about [configuring multiple schedulers](https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/)
+{{% /capture %}}
diff --git a/...ts/configuration/scheduler-perf-tuning.md → ...cepts/scheduling/scheduler-perf-tuning.md b/...ts/configuration/scheduler-perf-tuning.md → ...cepts/scheduling/scheduler-perf-tuning.md
@@ -10,13 +10,19 @@ weight: 70
 
 {{< feature-state for_k8s_version="1.14" state="beta" >}}
 
-Kube-scheduler is the Kubernetes default scheduler. It is responsible for
-placement of Pods on Nodes in a cluster. Nodes in a cluster that meet the
-scheduling requirements of a Pod are called "feasible" Nodes for the Pod. The
-scheduler finds feasible Nodes for a Pod and then runs a set of functions to
-score the feasible Nodes and picks a Node with the highest score among the
-feasible ones to run the Pod. The scheduler then notifies the API server about
-this decision in a process called "Binding".
+[kube-scheduler](/docs/concepts/scheduling/kube-scheduler/#kube-scheduler)
+is the Kubernetes default scheduler. It is responsible for placement of Pods
+on Nodes in a cluster.
+
+Nodes in a cluster that meet the scheduling requirements of a Pod are
+called _feasible_ Nodes for the Pod. The scheduler finds feasible Nodes
+for a Pod and then runs a set of functions to score the feasible Nodes,
+picking a Node with the highest score among the feasible ones to run
+the Pod. The scheduler then notifies the API server about this decision
+in a process called _Binding_.
+
+This page explains performance tuning optimizations that are relevant for
+large Kubernetes clusters.
 
 {{% /capture %}}
 
@@ -37,7 +43,7 @@ size of the cluster if it is not specified in the configuration. It uses a
 linear formula which yields 50% for a 100-node cluster. The formula yields 10%
 for a 5000-node cluster. The lower bound for the automatic value is 5%. In other
 words, the scheduler always scores at least 5% of the cluster no matter how
-large the cluster is, unless the user provides the config option with a value 
+large the cluster is, unless the user provides the config option with a value
 smaller than 5.
 
 Below is an example configuration that sets `percentageOfNodesToScore` to 50%.

diff --git a/static/_redirects b/static/_redirects
@@ -103,6 +103,7 @@
 /docs/concepts/clusters/logging/     /docs/concepts/cluster-administration/logging/ 301
 /docs/concepts/configuration/container-command-arg/     /docs/tasks/inject-data-application/define-command-argument-container/ 301
 /docs/concepts/configuration/container-command-args/     /docs/tasks/inject-data-application/define-command-argument-container/     301
+/docs/concepts/configuration/scheduler-perf-tuning/      /docs/concepts/scheduling/scheduler-perf-tuning/    301
 /docs/concepts/ecosystem/thirdpartyresource/     /docs/tasks/access-kubernetes-api/extend-api-third-party-resource/ 301
 /docs/concepts/jobs/cron-jobs/     /docs/concepts/workloads/controllers/cron-jobs/ 301
 /docs/concepts/jobs/run-to-completion-finite-workloads/     /docs/concepts/workloads/controllers/jobs-run-to-completion/ 301