New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New pod incorrectly gets scheduled on the node when there is no capacity #106946
Comments
/sig node |
cc @deads2k |
/sig scheduling |
In the order described above, you created normal pods that got scheduled on a node. Then you added a static pod, which is not "scheduled" (the kubelet directly receives that pod, so the scheduler has to react). In that case, the static pod should start, and I would generally expect 1 other pod on that node to get OutOfPods (because the static pod "wins"). However, why is pod 2 in your list in "Error"? It's possible that the explanation for your "crash and burn" is that pod 2 failed (due to kubelet incorrectly evicting it, or its own process exited), and then scheduler saw there was a gap, and tried to place 3 or 4, which the kubelet immediately rejected (because the static pod was starting but the scheduler hadn't seen it) as OutOfPods. So if we know why pod 2 is in |
Pod 2 with errorapiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2021-12-10T14:56:09Z"
labels:
run: busybox-2
name: busybox-2
namespace: default
resourceVersion: "469"
uid: b63af64b-6d00-468e-836a-2e264a0b5e15
spec:
automountServiceAccountToken: false
containers:
- command:
- sleep
- inf
image: busybox
imagePullPolicy: Always
name: busybox-2
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: 127.0.0.1
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-12-10T14:56:09Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2021-12-10T14:57:08Z"
message: 'containers with unready status: [busybox-2]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2021-12-10T14:57:08Z"
message: 'containers with unready status: [busybox-2]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2021-12-10T14:56:09Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: cri-o://0820e7622875c4ad75a15130063317d5d5fabc7806215f5e7c1d870fedd68437
image: docker.io/library/busybox:latest
imageID: docker.io/library/busybox@sha256:50e44504ea4f19f141118a8a8868e6c5bb9856efa33f2183f5ccea7ac62aacc9
lastState: {}
name: busybox-2
ready: false
restartCount: 0
started: false
state:
terminated:
containerID: cri-o://0820e7622875c4ad75a15130063317d5d5fabc7806215f5e7c1d870fedd68437
exitCode: 137
finishedAt: "2021-12-10T14:57:07Z"
reason: Error
startedAt: "2021-12-10T14:56:20Z"
hostIP: 127.0.0.1
message: Preempted in order to admit critical pod
phase: Failed
podIP: 10.85.0.51
podIPs:
- ip: 10.85.0.51
- ip: 1100:200::33
qosClass: BestEffort
reason: Preempting
startTime: "2021-12-10T14:56:09Z"
|
Just commenting on the scheduler behavior. Static pods are problematic and causes race conditions between the scheduler and kubelet because they don't go through the scheduler first.
This is not surprising because it depends on when the static pod made it to the api server and the scheduler got the pod add event. The race that causes the scheduler to send another pod is as follows:
|
thanks @ahg-g But in my case, I added just one static pod. So before adding static pod,
But after adding that one static pod my cluster ends up with,
Why did 2 pods scheduled by the scheduler that ended up with OutOfpods? |
Ok, looking at pod 2 this is WAD EXCEPT that the fact that this was preempted is masking the fact that it was preempted BECAUSE of OutOfpods. Roughly, the kubelet calculated busybox-2 as "preempted DUE to OutOfpods" but then recorded the reason as "Preempted". I think that's a small usability bug in kubelet - which Reason is really more important a user in this case? I would probably argue the reason for preemption, not the preemption itself, should be shown in the Reason field. Does the scheduler use Reason programmatically? If not, we can consider changing the behavior to have Kubelet record OutOfpods instead of preempt. If it does, we might want to consider having scheduler NOT depend on Reason, and instead use a more effective channel. |
/triage accepted |
No. |
/remove-sig scheduling |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
This issue has not been updated in over 1 year, and should be re-triaged. You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
/triage accepted |
What happened?
When the node is running at its full capacity and no more pods can be scheduled, the rest of the pods are in
Pending
state as expected. But at this point, if we add a static pod then one of the running pods will get evicted to make room for the incoming static pod.However, the moment pod eviction is completed scheduler will try to send one of the
Pending
pod at that host, but this will fail the pod withOutOfpods
error as the only slot that was opened by evicting the running pod on that node was for the static pod.What did you expect to happen?
When the node is running at the full capacity (max-pods), and a static pod is added, scheduler should not schedule existing
Pending
pod on that host.How can we reproduce it (as minimally and precisely as possible)?
KUBELET_FLAGS=--max-pods=3 CONTAINER_RUNTIME_ENDPOINT='unix:///var/run/crio/crio.sock' CGROUP_DRIVER=systemd CONTAINER_RUNTIME=remote hack/local-up-cluster.sh
I am using
crio
but you don't have to, the issue is with the kubelet so it doesn't matter which runtime you use.Running
and rest arePending
(depends how many are running in kube-system namespace. If you need increase --max-pods and try again). In my case there was only 1 in kube-system. So I had 2Running
and restPending
Anything else we need to know?
No response
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: