Skip to content

Commit

Permalink
Add scheduler concept guide (#14637)
Browse files Browse the repository at this point in the history
* Add scheduler concept page

* Rename scheduling overview

* Fix non-ASCII colon symbols

* Reword scheduler concept page

* Move scheduler performance tuning into scheduling

* Signpost from overview to kube-scheduler, etc

* Add whatsnext section to scheduler concept

* Restructure scheduling concept

Now there's a concept page for scheduling, some of the details
in the performance tuning page can have a better home.

* Omit link to (unwritten) scheduling extensions page

* Drop deprecated / superseded filtering rules
  • Loading branch information
sftim authored and k8s-ci-robot committed Jul 9, 2019
1 parent 6b0711a commit d550d11
Show file tree
Hide file tree
Showing 5 changed files with 211 additions and 10 deletions.
7 changes: 5 additions & 2 deletions content/en/docs/concepts/overview/components.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,5 +111,8 @@ A [Cluster-level logging](/docs/concepts/cluster-administration/logging/) mechan
saving container logs to a central log store with search/browsing interface.

{{% /capture %}}


{{% capture whatsnext %}}
* Learn about [Nodes](/docs/concepts/architecture/nodes/)
* Learn about [kube-scheduler](/docs/concepts/scheduling/kube-scheduler/)
* Read etcd's official [documentation](https://etcd.io/docs/)
{{% /capture %}}
5 changes: 5 additions & 0 deletions content/en/docs/concepts/scheduling/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
title: "Scheduling"
weight: 90
---

186 changes: 186 additions & 0 deletions content/en/docs/concepts/scheduling/kube-scheduler.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
---
title: Kubernetes Scheduler
content_template: templates/concept
weight: 60
---

{{% capture overview %}}

In Kubernetes, _scheduling_ refers to making sure that {{< glossary_tooltip text="Pods" term_id="pod" >}}
are matched to {{< glossary_tooltip text="Nodes" term_id="node" >}} so that
{{< glossary_tooltip term_id="kubelet" >}} can run them.

{{% /capture %}}

{{% capture body %}}

## Scheduling overview {#scheduling}

A scheduler watches for newly created Pods that have no Node assigned. For
every Pod that the scheduler discovers, the scheduler becomes responsible
for finding the best Node for that Pod to run on. The scheduler reaches
this placement decision taking into account the scheduling principles
described below.

If you want to understand why Pods are placed onto a particular Node,
or if you're planning to implement a custom scheduler yourself, this
page will help you learn about scheduling.

## kube-scheduler

[kube-scheduler](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/)
is the default scheduler for Kubernetes and runs as part of the
{{< glossary_tooltip text="control plane" term_id="control-plane" >}}.
kube-scheduler is designed so that, if you want and need to, you can
write your own scheduling component and use that instead.

For every newly created pods or other unscheduled pods, kube-scheduler
selects a optimal node for them to run on. However, every container in
pods has different requirements for resources and every pod also has
different requirements. Therefore, existing nodes need to be filtered
according to the specific scheduling requirements.

In a cluster, Nodes that meet the scheduling requirements for a Pod
are called _feasible_ nodes. If none of the nodes are suitable, the pod
remains unscheduled until the scheduler is able to place it.

The scheduler finds feasible Nodes for a Pod and then runs a set of
functions to score the feasible Nodes and picks a Node with the highest
score among the feasible ones to run the Pod. The scheduler then notifies
the API server about this decision in a process called _binding_.

Factors that need taken into account for scheduling decisions include
individual and collective resource requirements, hardware / software /
policy constraints, affinity and anti-affinity specifications, data
locality, inter-workload interference, and so on.

## Scheduling with kube-scheduler {#kube-scheduler-implementation}

kube-scheduler selects a node for the pod in a 2-step operation:

1. Filtering

2. Scoring


The _filtering_ step finds the set of Nodes where it's feasible to
schedule the Pod. For example, the PodFitsResources filter checks whether a
candidate Node has enough available resource to meet a Pod's specific
resource requests. After this step, the node list contains any suitable
Nodes; often, there will be more than one. If the list is empty, that
Pod isn't (yet) schedulable.

In the _scoring_ step, the scheduler ranks the remaining nodes to choose
the most suitable Pod placement. The scheduler assigns a score to each Node
that survived filtering, basing this score on the active scoring rules.

Finally, kube-scheduler assigns the Pod to the Node with the highest ranking.
If there is more than one node with equal scores, kube-scheduler selects
one of these at random.


### Default policies

kube-scheduler has a default set of scheduling policies.

### Filtering

- `PodFitsHostPorts`: Checks if a Node has free ports (the network protocol kind)
for the Pod ports the the Pod is requesting.

- `PodFitsHost`: Checks if a Pod specifies a specific Node by it hostname.

- `PodFitsResources`: Checks if the Node has free resources (eg, CPU and Memory)
to meet the requirement of the Pod.

- `PodMatchNodeSelector`: Checks if a Pod's Node {{< glossary_tooltip term_id="selector" >}}
matches the Node's {{< glossary_tooltip text="label(s)" term_id="label" >}}.

- `NoVolumeZoneConflict`: Evaluate if the {{< glossary_tooltip text="Volumes" term_id="volume" >}}
that a Pod requests are available on the Node, given the failure zone restrictions for
that storage.

- `NoDiskConflict`: Evaluates if a Pod can fit on a Node due to the volumes it requests,
and those that are already mounted.

- `MaxCSIVolumeCount`: Decides how many {{< glossary_tooltip text="CSI" term_id="csi" >}}
volumes should be attached, and whether that's over a configured limit.

- `CheckNodeMemoryPressure`: If a Node is reporting memory pressure, and there's no
configured exception, the Pod won't be scheduled there.

- `CheckNodePIDPressure`: If a Node is reporting that process IDs are scarce, and
there's no configured exception, the Pod won't be scheduled there.

- `CheckNodeDiskPressure`: If a Node is reporting storage pressure (a filesystem that
is full or nearly full), and there's no configured exception, the Pod won't be
scheduled there.

- `CheckNodeCondition`: Nodes can report that they have a completely full filesystem,
that networking isn't available or that kubelet is otherwise not ready to run Pods.
If such a condition is set for a Node, and there's no configured exception, the Pod
won't be scheduled there.

- `PodToleratesNodeTaints`: checks if a Pod's {{< glossary_tooltip text="tolerations" term_id="toleration" >}}
can tolerate the Node's {{< glossary_tooltip text="taints" term_id="taint" >}}.

- `CheckVolumeBinding`: Evaluates if a Pod can fit due to the volumes it requests.
This applies for both bound and unbound
{{< glossary_tooltip text="PVCs" term_id="persistent-volume-claim" >}}

### Scoring

- `SelectorSpreadPriority`: Spreads Pods across hosts, considering Pods that
belonging to the same {{< glossary_tooltip text="Service" term_id="service" >}},
{{< glossary_tooltip term_id="statefulset" >}} or
{{< glossary_tooltip term_id="replica-set" >}}.

- `InterPodAffinityPriority`: Computes a sum by iterating through the elements
of weightedPodAffinityTerm and adding “weight” to the sum if the corresponding
PodAffinityTerm is satisfied for that node; the node(s) with the highest sum
are the most preferred.

- `LeastRequestedPriority`: Favors nodes with fewer requested resources. In other
words, the more Pods that are placed on a Node, and the more resources those
Pods use, the lower the ranking this policy will give.

- `MostRequestedPriority`: Favors nodes with most requested resources. This policy
will fit the scheduled Pods onto the smallest number of Nodes needed to run your
overall set of workloads.

- `RequestedToCapacityRatioPriority`: Creates a requestedToCapacity based ResourceAllocationPriority using default resource scoring function shape.

- `BalancedResourceAllocation`: Favors nodes with balanced resource usage.

- `NodePreferAvoidPodsPriority`: Priorities nodes according to the node annotation
`scheduler.alpha.kubernetes.io/preferAvoidPods`. You can use this to hint that
two different Pods shouldn't run on the same Node.

- `NodeAffinityPriority`: Prioritizes nodes according to node affinity scheduling
preferences indicated in PreferredDuringSchedulingIgnoredDuringExecution.
You can read more about this in [Assigning Pods to Nodes](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/)

- `TaintTolerationPriority`: Prepares the priority list for all the nodes, based on
the number of intolerable taints on the node. This policy adjusts a node's rank
taking that list into account.

- `ImageLocalityPriority`: Favors nodes that already have the
{{< glossary_tooltip text="container images" term_id="image" >}} for that
Pod cached locally.

- `ServiceSpreadingPriority`: For a given Service, this policy aims to make sure that
the Pods for the Service run on different nodes. It favouring scheduling onto nodes
that don't have Pods for the service already assigned there. The overall outcome is
that the Service becomes more resilient to a single Node failure.

- `CalculateAntiAffinityPriorityMap`: This policy helps implement
[pod anti-affinity](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity).

- `EqualPriorityMap`: Gives an equal weight of one to all nodes.

{{% /capture %}}
{{% capture whatsnext %}}
* Read about [scheduler performance tuning](/docs/concepts/scheduling/scheduler-perf-tuning/)
* Read the [reference documentation](/docs/reference/command-line-tools-reference/kube-scheduler/) for kube-scheduler
* Learn about [configuring multiple schedulers](https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/)
{{% /capture %}}
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,19 @@ weight: 70

{{< feature-state for_k8s_version="1.14" state="beta" >}}

Kube-scheduler is the Kubernetes default scheduler. It is responsible for
placement of Pods on Nodes in a cluster. Nodes in a cluster that meet the
scheduling requirements of a Pod are called "feasible" Nodes for the Pod. The
scheduler finds feasible Nodes for a Pod and then runs a set of functions to
score the feasible Nodes and picks a Node with the highest score among the
feasible ones to run the Pod. The scheduler then notifies the API server about
this decision in a process called "Binding".
[kube-scheduler](/docs/concepts/scheduling/kube-scheduler/#kube-scheduler)
is the Kubernetes default scheduler. It is responsible for placement of Pods
on Nodes in a cluster.

Nodes in a cluster that meet the scheduling requirements of a Pod are
called _feasible_ Nodes for the Pod. The scheduler finds feasible Nodes
for a Pod and then runs a set of functions to score the feasible Nodes,
picking a Node with the highest score among the feasible ones to run
the Pod. The scheduler then notifies the API server about this decision
in a process called _Binding_.

This page explains performance tuning optimizations that are relevant for
large Kubernetes clusters.

{{% /capture %}}

Expand All @@ -37,7 +43,7 @@ size of the cluster if it is not specified in the configuration. It uses a
linear formula which yields 50% for a 100-node cluster. The formula yields 10%
for a 5000-node cluster. The lower bound for the automatic value is 5%. In other
words, the scheduler always scores at least 5% of the cluster no matter how
large the cluster is, unless the user provides the config option with a value
large the cluster is, unless the user provides the config option with a value
smaller than 5.

Below is an example configuration that sets `percentageOfNodesToScore` to 50%.
Expand Down
1 change: 1 addition & 0 deletions static/_redirects
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@
/docs/concepts/clusters/logging/ /docs/concepts/cluster-administration/logging/ 301
/docs/concepts/configuration/container-command-arg/ /docs/tasks/inject-data-application/define-command-argument-container/ 301
/docs/concepts/configuration/container-command-args/ /docs/tasks/inject-data-application/define-command-argument-container/ 301
/docs/concepts/configuration/scheduler-perf-tuning/ /docs/concepts/scheduling/scheduler-perf-tuning/ 301
/docs/concepts/ecosystem/thirdpartyresource/ /docs/tasks/access-kubernetes-api/extend-api-third-party-resource/ 301
/docs/concepts/jobs/cron-jobs/ /docs/concepts/workloads/controllers/cron-jobs/ 301
/docs/concepts/jobs/run-to-completion-finite-workloads/ /docs/concepts/workloads/controllers/jobs-run-to-completion/ 301
Expand Down

0 comments on commit d550d11

Please sign in to comment.