Skip to content

Commit

Permalink
[zh] sync pod-scheduling-readiness.md
Browse files Browse the repository at this point in the history
  • Loading branch information
windsonsea committed Apr 13, 2023
1 parent 1bb518d commit 8e89830
Show file tree
Hide file tree
Showing 2 changed files with 81 additions and 20 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ title: Pod 调度就绪态
content_type: concept
weight: 40
---

<!--
title: Pod Scheduling Readiness
content_type: concept
Expand All @@ -27,7 +26,7 @@ to be considered for scheduling.
Pod 一旦创建就被认为准备好进行调度。
Kubernetes 调度程序尽职尽责地寻找节点来放置所有待处理的 Pod。
然而,在实际环境中,会有一些 Pod 可能会长时间处于"缺少必要资源"状态。
这些 Pod 实际上以一种不必要的方式扰乱了调度器(以及下游的集成方,如 Cluster AutoScaler)。
这些 Pod 实际上以一种不必要的方式扰乱了调度器(以及 Cluster AutoScaler 这类下游的集成方)。

通过指定或删除 Pod 的 `.spec.schedulingGates`,可以控制 Pod 何时准备好被纳入考量进行调度。

Expand All @@ -47,7 +46,8 @@ each schedulingGate can be removed in arbitrary order, but addition of a new sch
该字段只能在创建 Pod 时初始化(由客户端创建,或在准入期间更改)。
创建后,每个 schedulingGate 可以按任意顺序删除,但不允许添加新的调度门控。

{{< figure src="/docs/images/podSchedulingGates.svg" alt="pod-scheduling-gates-diagram" caption="<!--Figure. Pod SchedulingGates-->数字。Pod SchedulingGates" class="diagram-large" link="https://mermaid.live/edit#pako:eNplkktTwyAUhf8KgzuHWpukaYszutGlK3caFxQuCVMCGSDVTKf_XfKyPlhxz4HDB9wT5lYAptgHFuBRsdKxenFMClMYFIdfUdRYgbiD6ItJTEbR8wpEq5UpUfnDTf-5cbPoJjcbXdcaE61RVJIiqJvQ_Y30D-OCt-t3tFjcR5wZayiVnIGmkv4NiEfX9jijKTmmRH5jf0sRugOP0HyHUc1m6KGMFP27cM28fwSJDluPpNKaXqVJzmFNfHD2APRKSjnNFx9KhIpmzSfhVls3eHdTRrwG8QnxKfEZUUNeYTDBNbiaKRF_5dSfX-BQQQ0FpnEqQLJWhwIX5hyXsjbYl85wTINrgeC2EZd_xFQy7b_VJ6GCdd-itkxALE84dE3fAqXyIUZya6Qqe711OspVCI2ny2Vv35QqVO3-htt66ZWomAvVcZcv8yTfsiSFfJOydZoKvl_ttjLJVlJsblcJw-czwQ0zr9ZeqGDgeR77b2jD8xdtjtDn" >}}
{{< figure src="/docs/images/podSchedulingGates.svg" alt="pod-scheduling-gates-diagram" caption="<!--Figure. Pod SchedulingGates-->图:Pod SchedulingGates" class="diagram-large" link="https://mermaid.live/edit#pako:eNplkktTwyAUhf8KgzuHWpukaYszutGlK3caFxQuCVMCGSDVTKf_XfKyPlhxz4HDB9wT5lYAptgHFuBRsdKxenFMClMYFIdfUdRYgbiD6ItJTEbR8wpEq5UpUfnDTf-5cbPoJjcbXdcaE61RVJIiqJvQ_Y30D-OCt-t3tFjcR5wZayiVnIGmkv4NiEfX9jijKTmmRH5jf0sRugOP0HyHUc1m6KGMFP27cM28fwSJDluPpNKaXqVJzmFNfHD2APRKSjnNFx9KhIpmzSfhVls3eHdTRrwG8QnxKfEZUUNeYTDBNbiaKRF_5dSfX-BQQQ0FpnEqQLJWhwIX5hyXsjbYl85wTINrgeC2EZd_xFQy7b_VJ6GCdd-itkxALE84dE3fAqXyIUZya6Qqe711OspVCI2ny2Vv35QqVO3-htt66ZWomAvVcZcv8yTfsiSFfJOydZoKvl_ttjLJVlJsblcJw-czwQ0zr9ZeqGDgeR77b2jD8xdtjtDn" >}}

<!--
## Usage example
Expand Down Expand Up @@ -93,7 +93,7 @@ The output is:
输出是:

```none
[{"name":"foo"},{"name":"bar"}]
[{"name":"example.com/foo"},{"name":"example.com/bar"}]
```

<!--
Expand Down Expand Up @@ -126,7 +126,8 @@ kubectl get pod test-pod -o wide
Given the test-pod doesn't request any CPU/memory resources, it's expected that this Pod's state get
transited from previous `SchedulingGated` to `Running`:
-->
鉴于 test-pod 不请求任何 CPU/内存资源,预计此 Pod 的状态会从之前的 `SchedulingGated` 转变为 `Running`
鉴于 test-pod 不请求任何 CPU/内存资源,预计此 Pod 的状态会从之前的
`SchedulingGated` 转变为 `Running`

```none
NAME READY STATUS RESTARTS AGE IP NODE
Expand All @@ -146,9 +147,61 @@ scheduling. You can use `scheduler_pending_pods{queue="gated"}` to check the met
以区分 Pod 是否已尝试调度但被宣称不可调度,或明确标记为未准备好调度。
你可以使用 `scheduler_pending_pods{queue="gated"}` 来检查指标结果。

<!--
## Mutable Pod Scheduling Directives
-->
## 可变 Pod 调度指令 {#mutable-pod-scheduling-directives}

{{< feature-state for_k8s_version="v1.27" state="beta" >}}

<!--
You can mutate scheduling directives of Pods while they have scheduling gates, with certain constraints.
At a high level, you can only tighten the scheduling directives of a Pod. In other words, the updated
directives would cause the Pods to only be able to be scheduled on a subset of the nodes that it would
previously match. More concretely, the rules for updating a Pod's scheduling directives are as follows:
-->
当 Pod 具有调度门控时,你可以在某些约束条件下改变 Pod 的调度指令。
在高层次上,你只能收紧 Pod 的调度指令。换句话说,更新后的指令将导致
Pod 只能被调度到它之前匹配的节点子集上。
更具体地说,更新 Pod 的调度指令的规则如下:

<!--
1. For `.spec.nodeSelector`, only additions are allowed. If absent, it will be allowed to be set.
2. For `spec.affinity.nodeAffinity`, if nil, then setting anything is allowed.
-->
1. 对于 `.spec.nodeSelector`,只允许增加。如果原来未设置,则允许设置此字段。

2. 对于 `spec.affinity.nodeAffinity`,如果当前值为 nil,则允许设置为任意值。

<!--
3. If `NodeSelectorTerms` was empty, it will be allowed to be set.
If not empty, then only additions of `NodeSelectorRequirements` to `matchExpressions`
or `fieldExpressions` are allowed, and no changes to existing `matchExpressions`
and `fieldExpressions` will be allowed. This is because the terms in
`.requiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms`, are ORed
while the expressions in `nodeSelectorTerms[].matchExpressions` and
`nodeSelectorTerms[].fieldExpressions` are ANDed.
-->
3. 如果 `NodeSelectorTerms` 之前为空,则允许设置该字段。
如果之前不为空,则仅允许增加 `NodeSelectorRequirements``matchExpressions`
`fieldExpressions`,且不允许更改当前的 `matchExpressions``fieldExpressions`
这是因为 `.requiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms`
中的条目被执行逻辑或运算,而 `nodeSelectorTerms[].matchExpressions`
`nodeSelectorTerms[].fieldExpressions` 中的表达式被执行逻辑与运算。

<!--
4. For `.preferredDuringSchedulingIgnoredDuringExecution`, all updates are allowed.
This is because preferred terms are not authoritative, and so policy controllers
don't validate those terms.
-->
4. 对于 `.preferredDuringSchedulingIgnoredDuringExecution`,所有更新都被允许。
这是因为首选条目不具有权威性,因此策略控制器不会验证这些条目。

## {{% heading "whatsnext" %}}

<!--
* Read the [PodSchedulingReadiness KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/3521-pod-scheduling-readiness) for more details
-->
* 阅读 [PodSchedulingReadiness KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/3521-pod-scheduling-readiness) 了解更多详情
* 阅读 [PodSchedulingReadiness KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/3521-pod-scheduling-readiness)
了解更多详情
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,8 @@ For example,
-->
## 概念 {#concepts}

你可以使用命令 [kubectl taint](/docs/reference/generated/kubectl/kubectl-commands#taint) 给节点增加一个污点。比如,
你可以使用命令 [kubectl taint](/docs/reference/generated/kubectl/kubectl-commands#taint)
给节点增加一个污点。比如:

```shell
kubectl taint nodes node1 key1=value1:NoSchedule
Expand Down Expand Up @@ -82,7 +83,7 @@ to schedule onto `node1`:
-->
你可以在 Pod 规约中为 Pod 设置容忍度。
下面两个容忍度均与上面例子中使用 `kubectl taint` 命令创建的污点相匹配,
因此如果一个 Pod 拥有其中的任何一个容忍度,都能够被调度到 `node1`
因此如果一个 Pod 拥有其中的任何一个容忍度,都能够被调度到 `node1`

```yaml
tolerations:
Expand Down Expand Up @@ -119,11 +120,10 @@ A toleration "matches" a taint if the keys are the same and the effects are the
-->
一个容忍度和一个污点相“匹配”是指它们有一样的键名和效果,并且:

* 如果 `operator``Exists` (此时容忍度不能指定 `value`),或者
* 如果 `operator``Equal` ,则它们的 `value` 应该相等
* 如果 `operator``Exists`(此时容忍度不能指定 `value`),或者
* 如果 `operator``Equal`,则它们的 `value` 应该相等

{{< note >}}

<!--
There are two special cases:
Expand Down Expand Up @@ -182,7 +182,7 @@ scheduled onto the node (if it is not yet running on the node).
<!--
For example, imagine you taint a node like this
-->
例如,假设你给一个节点添加了如下污点
例如,假设你给一个节点添加了如下污点

```shell
kubectl taint nodes node1 key1=value1:NoSchedule
Expand Down Expand Up @@ -279,7 +279,7 @@ onto nodes labeled with `dedicated=groupName`.
很容易就能做到)。
拥有上述容忍度的 Pod 就能够被调度到上述专用节点,同时也能够被调度到集群中的其它节点。
如果你希望这些 Pod 只能被调度到上述专用节点,
那么你还需要给这些专用节点另外添加一个和上述污点类似的 label (例如:`dedicated=groupName`),
那么你还需要给这些专用节点另外添加一个和上述污点类似的 label(例如:`dedicated=groupName`),
同时还要在上述准入控制器中给 Pod 增加节点亲和性要求,要求上述 Pod 只能被调度到添加了
`dedicated=groupName` 标签的节点上。

Expand Down Expand Up @@ -310,7 +310,7 @@ manually add tolerations to your pods.
我们希望不需要这类硬件的 Pod 不要被调度到这些特殊节点,以便为后继需要这类硬件的 Pod 保留资源。
要达到这个目的,可以先给配备了特殊硬件的节点添加污点
(例如 `kubectl taint nodes nodename special=true:NoSchedule`
`kubectl taint nodes nodename special=true:PreferNoSchedule`)
`kubectl taint nodes nodename special=true:PreferNoSchedule`
然后给使用了这类特殊硬件的 Pod 添加一个相匹配的容忍度。
和专用节点的例子类似,添加这个容忍度的最简单的方法是使用自定义
[准入控制器](/zh-cn/docs/reference/access-authn-authz/admission-controllers/)
Expand All @@ -333,7 +333,7 @@ when there are node problems, which is described in the next section.
<!--
## Taint based Evictions
-->
## 基于污点的驱逐 {#taint-based-evictions}
## 基于污点的驱逐 {#taint-based-evictions}

{{< feature-state for_k8s_version="v1.18" state="stable" >}}

Expand All @@ -347,7 +347,7 @@ running on the node as follows
* pods that tolerate the taint with a specified `tolerationSeconds` remain
bound for the specified amount of time
-->
前文提到过污点的效果值 `NoExecute` 会影响已经在节点上运行的 Pod,如下
前文提到过污点的效果值 `NoExecute` 会影响已经在节点上运行的如下 Pod

* 如果 Pod 不能忍受这类污点,Pod 会马上被驱逐。
* 如果 Pod 能够忍受这类污点,但是在容忍度定义中没有指定 `tolerationSeconds`
Expand Down Expand Up @@ -384,8 +384,8 @@ are true. The following taints are built in:
* `node.kubernetes.io/network-unavailable`:节点网络不可用。
* `node.kubernetes.io/unschedulable`: 节点不可调度。
* `node.cloudprovider.kubernetes.io/uninitialized`:如果 kubelet 启动时指定了一个“外部”云平台驱动,
它将给当前节点添加一个污点将其标志为不可用。在 cloud-controller-manager
的一个控制器初始化这个节点后,kubelet 将删除这个污点。
它将给当前节点添加一个污点将其标志为不可用。在 cloud-controller-manager
的一个控制器初始化这个节点后,kubelet 将删除这个污点。

<!--
In case a node is to be evicted, the node controller or the kubelet adds relevant taints
Expand All @@ -395,6 +395,16 @@ controller can remove the relevant taint(s).
在节点被驱逐时,节点控制器或者 kubelet 会添加带有 `NoExecute` 效果的相关污点。
如果异常状态恢复正常,kubelet 或节点控制器能够移除相关的污点。

<!--
In some cases when the node is unreachable, the API server is unable to communicate
with the kubelet on the node. The decision to delete the pods cannot be communicated to
the kubelet until communication with the API server is re-established. In the meantime,
the pods that are scheduled for deletion may continue to run on the partitioned node.
-->
在某些情况下,当节点不可达时,API 服务器无法与节点上的 kubelet 进行通信。
在与 API 服务器的通信被重新建立之前,删除 Pod 的决定无法传递到 kubelet。
同时,被调度进行删除的那些 Pod 可能会继续运行在分区后的节点上。

{{< note >}}
<!--
The control plane limits the rate of adding node new taints to nodes. This rate limiting
Expand Down Expand Up @@ -518,7 +528,6 @@ tolerations to all daemons, to prevent DaemonSets from breaking.
* `node.kubernetes.io/unschedulable` (1.10 or later)
* `node.kubernetes.io/network-unavailable` (*host network only*)
-->

DaemonSet 控制器自动为所有守护进程添加如下 `NoSchedule` 容忍度,以防 DaemonSet 崩溃:

* `node.kubernetes.io/memory-pressure`
Expand All @@ -531,7 +540,6 @@ DaemonSet 控制器自动为所有守护进程添加如下 `NoSchedule` 容忍
Adding these tolerations ensures backward compatibility. You can also add
arbitrary tolerations to DaemonSets.
-->

添加上述容忍度确保了向后兼容,你也可以选择自由向 DaemonSet 添加容忍度。

## {{% heading "whatsnext" %}}
Expand Down

0 comments on commit 8e89830

Please sign in to comment.