New pod incorrectly gets scheduled on the node when there is no capacity #106946

harche · 2021-12-10T10:50:20Z

What happened?

When the node is running at its full capacity and no more pods can be scheduled, the rest of the pods are in Pending state as expected. But at this point, if we add a static pod then one of the running pods will get evicted to make room for the incoming static pod.

However, the moment pod eviction is completed scheduler will try to send one of the Pending pod at that host, but this will fail the pod with OutOfpods error as the only slot that was opened by evicting the running pod on that node was for the static pod.

What did you expect to happen?

When the node is running at the full capacity (max-pods), and a static pod is added, scheduler should not schedule existing Pending pod on that host.

How can we reproduce it (as minimally and precisely as possible)?

Start local cluster with max-pod = 3 for easily testing,

KUBELET_FLAGS=--max-pods=3 CONTAINER_RUNTIME_ENDPOINT='unix:///var/run/crio/crio.sock' CGROUP_DRIVER=systemd CONTAINER_RUNTIME=remote hack/local-up-cluster.sh

I am using crio but you don't have to, the issue is with the kubelet so it doesn't matter which runtime you use.

Start about 5 pods,

kubectl run busybox-1 --image=busybox --restart=Never --overrides='{ "spec": {"automountServiceAccountToken": false } }'  --command sleep inf
kubectl run busybox-2 --image=busybox --restart=Never --overrides='{ "spec": {"automountServiceAccountToken": false } }'  --command sleep inf
kubectl run busybox-3 --image=busybox --restart=Never --overrides='{ "spec": {"automountServiceAccountToken": false } }'  --command sleep inf
kubectl run busybox-4 --image=busybox --restart=Never --overrides='{ "spec": {"automountServiceAccountToken": false } }'  --command sleep inf
kubectl run busybox-5 --image=busybox --restart=Never --overrides='{ "spec": {"automountServiceAccountToken": false } }'  --command sleep inf

Once you have some pods Running and rest are Pending (depends how many are running in kube-system namespace. If you need increase --max-pods and try again). In my case there was only 1 in kube-system. So I had 2 Running and rest Pending

$ kubectl get pods
NAME        READY   STATUS    RESTARTS   AGE
busybox-1   1/1     Running   0          33s
busybox-2   1/1     Running   0          20s
busybox-3   0/1     Pending   0          16s
busybox-4   0/1     Pending   0          7s
busybox-5   0/1     Pending   0          3s

Create a static pod,

[root@localhost static-pods]# cat > test.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: static-web
  labels:
    role: myrole
spec:
  containers:
    - name: web
      image: nginx
      ports:
        - name: web
          containerPort: 80
          protocol: TCP
[root@localhost static-pods]# pwd
/run/kubernetes/static-pods

Watch it crash and burn :-)

$ kubectl get pods
NAME                   READY   STATUS              RESTARTS   AGE
busybox-1              1/1     Running             0          88s
busybox-2              0/1     Error               0          75s
busybox-3              0/1     OutOfpods           0          71s
busybox-4              0/1     OutOfpods           0          62s
busybox-5              0/1     Pending             0          58s
static-web-127.0.0.1   0/1     ContainerCreating   0          4s

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v0.21.0-beta.1", GitCommit:"d0259f5a5ca1338a68603409a554a554d2c0f6f8", GitTreeState:"clean", BuildDate:"2021-05-21T08:44:40Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.0-alpha.1.41+cc6f12583f2b61", GitCommit:"cc6f12583f2b611e9469a6b2e0247f028aae246b", GitTreeState:"clean", BuildDate:"2021-12-10T10:31:12Z", GoVersion:"go1.17.2", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (0.21) and server (1.24) exceeds the supported minor version skew of +/-1

Cloud provider

N/A

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

harche · 2021-12-10T10:51:32Z

cc @rphillips @mrunalp @smarterclayton @bobbypage @ehashman

harche · 2021-12-10T10:52:24Z

/sig node
/sig api-machinery

harche · 2021-12-10T10:52:56Z

cc @deads2k

harche · 2021-12-10T11:15:23Z

/sig scheduling

smarterclayton · 2021-12-10T14:51:02Z

When the node is running at the full capacity (max-pods), and a static pod is added, scheduler should not schedule existing Pending pod on that host.

In the order described above, you created normal pods that got scheduled on a node. Then you added a static pod, which is not "scheduled" (the kubelet directly receives that pod, so the scheduler has to react). In that case, the static pod should start, and I would generally expect 1 other pod on that node to get OutOfPods (because the static pod "wins").

However, why is pod 2 in your list in "Error"? It's possible that the explanation for your "crash and burn" is that pod 2 failed (due to kubelet incorrectly evicting it, or its own process exited), and then scheduler saw there was a gap, and tried to place 3 or 4, which the kubelet immediately rejected (because the static pod was starting but the scheduler hadn't seen it) as OutOfPods.

So if we know why pod 2 is in Error, then we can figure out what happened, but in general I'd say the "crash and burn" looks like a normal race behavior where static pod creation on kubelet and scheduler placement are racing to try to leverage the gap that the kubelet creates for the static pod (when it shuts down pod 2). However, pod 2 should definitely say OutOfPods, not Error in that scenario.

harche · 2021-12-10T15:04:42Z

Pod 2 with error

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2021-12-10T14:56:09Z"
  labels:
    run: busybox-2
  name: busybox-2
  namespace: default
  resourceVersion: "469"
  uid: b63af64b-6d00-468e-836a-2e264a0b5e15
spec:
  automountServiceAccountToken: false
  containers:
  - command:
    - sleep
    - inf
    image: busybox
    imagePullPolicy: Always
    name: busybox-2
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: 127.0.0.1
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-12-10T14:56:09Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-12-10T14:57:08Z"
    message: 'containers with unready status: [busybox-2]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-12-10T14:57:08Z"
    message: 'containers with unready status: [busybox-2]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-12-10T14:56:09Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://0820e7622875c4ad75a15130063317d5d5fabc7806215f5e7c1d870fedd68437
    image: docker.io/library/busybox:latest
    imageID: docker.io/library/busybox@sha256:50e44504ea4f19f141118a8a8868e6c5bb9856efa33f2183f5ccea7ac62aacc9
    lastState: {}
    name: busybox-2
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: cri-o://0820e7622875c4ad75a15130063317d5d5fabc7806215f5e7c1d870fedd68437
        exitCode: 137
        finishedAt: "2021-12-10T14:57:07Z"
        reason: Error
        startedAt: "2021-12-10T14:56:20Z"
  hostIP: 127.0.0.1
  message: Preempted in order to admit critical pod
  phase: Failed
  podIP: 10.85.0.51
  podIPs:
  - ip: 10.85.0.51
  - ip: 1100:200::33
  qosClass: BestEffort
  reason: Preempting
  startTime: "2021-12-10T14:56:09Z"

ahg-g · 2021-12-10T15:21:59Z

Just commenting on the scheduler behavior. Static pods are problematic and causes race conditions between the scheduler and kubelet because they don't go through the scheduler first.

However, the moment pod eviction is completed scheduler will try to send one of the Pending pod at that host, but this will fail the pod with OutOfpods error as the only slot that was opened by evicting the running pod on that node was for the static pod.

This is not surprising because it depends on when the static pod made it to the api server and the scheduler got the pod add event. The race that causes the scheduler to send another pod is as follows:

static-pod created
kubelet evicts pod1 on the node
scheduler receives pod1 remove event, and so thinks that the node has space and places another pod2
kubelet rejects pod2 because it doesn't fit when taking static-pod into account
static-pod makes it to the api server and the scheduler receives an add event, and now the scheduler is in sync with kubelet on the node state and doesn't place other pods on it

harche · 2021-12-10T15:35:41Z

thanks @ahg-g

But in my case, I added just one static pod. So before adding static pod,

$ kubectl get pods 
NAME        READY   STATUS    RESTARTS   AGE
busybox-1   1/1     Running   0          24s
busybox-2   1/1     Running   0          20s
busybox-3   0/1     Pending   0          16s
busybox-4   0/1     Pending   0          13s
busybox-5   0/1     Pending   0          9s
busybox-6   0/1     Pending   0          2s

But after adding that one static pod my cluster ends up with,

$ kubectl get pods 
NAME                       READY   STATUS      RESTARTS   AGE
busybox-1                  0/1     Error       0          2m38s
busybox-2                  1/1     Running     0          2m34s
busybox-3                  0/1     OutOfpods   0          2m30s
busybox-4                  0/1     OutOfpods   0          2m27s
busybox-5                  0/1     Pending     0          2m23s
busybox-6                  0/1     Pending     0          2m16s
busybox-static-127.0.0.1   1/1     Running     0          73s

Why did 2 pods scheduled by the scheduler that ended up with OutOfpods?

cc @smarterclayton

smarterclayton · 2021-12-10T16:05:53Z

Ok, looking at pod 2 this is WAD EXCEPT that the fact that this was preempted is masking the fact that it was preempted BECAUSE of OutOfpods. Roughly, the kubelet calculated busybox-2 as "preempted DUE to OutOfpods" but then recorded the reason as "Preempted". I think that's a small usability bug in kubelet - which Reason is really more important a user in this case? I would probably argue the reason for preemption, not the preemption itself, should be shown in the Reason field.

Does the scheduler use Reason programmatically? If not, we can consider changing the behavior to have Kubelet record OutOfpods instead of preempt. If it does, we might want to consider having scheduler NOT depend on Reason, and instead use a more effective channel.

ehashman · 2021-12-10T20:04:12Z

/triage accepted

ahg-g · 2021-12-10T21:19:15Z

Does the scheduler use Reason programmatically?

No.

ahg-g · 2021-12-10T21:19:41Z

/remove-sig scheduling

k8s-triage-robot · 2022-03-10T21:31:24Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

vaibhav2107 · 2022-03-23T20:59:55Z

/remove-lifecycle stale

k8s-triage-robot · 2022-06-21T21:27:48Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ffilippopoulos · 2022-06-22T06:59:34Z

/remove-lifecycle stale

k8s-triage-robot · 2022-09-20T07:21:20Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

george-angel · 2022-09-20T07:30:18Z

/remove-lifecycle stale

k8s-triage-robot · 2022-12-19T07:47:08Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sathyanarays · 2022-12-19T08:07:38Z

/remove-lifecycle stale

k8s-triage-robot · 2024-01-19T22:05:49Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

seans3 · 2024-02-09T05:39:16Z

/triage accepted

harche added the kind/bug Categorizes issue or PR as related to a bug. label Dec 10, 2021

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 10, 2021

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 10, 2021

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Dec 10, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 10, 2021

k8s-ci-robot removed the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Dec 10, 2021

ehashman added this to Triaged in SIG Node Bugs Dec 10, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 10, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 23, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 21, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 22, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 20, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 20, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2022

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New pod incorrectly gets scheduled on the node when there is no capacity #106946

New pod incorrectly gets scheduled on the node when there is no capacity #106946

harche commented Dec 10, 2021 •

edited

harche commented Dec 10, 2021

harche commented Dec 10, 2021

harche commented Dec 10, 2021

harche commented Dec 10, 2021

smarterclayton commented Dec 10, 2021

harche commented Dec 10, 2021

ahg-g commented Dec 10, 2021

harche commented Dec 10, 2021 •

edited

smarterclayton commented Dec 10, 2021 •

edited

ehashman commented Dec 10, 2021

ahg-g commented Dec 10, 2021

ahg-g commented Dec 10, 2021

k8s-triage-robot commented Mar 10, 2022

vaibhav2107 commented Mar 23, 2022

k8s-triage-robot commented Jun 21, 2022

ffilippopoulos commented Jun 22, 2022

k8s-triage-robot commented Sep 20, 2022

george-angel commented Sep 20, 2022

k8s-triage-robot commented Dec 19, 2022

sathyanarays commented Dec 19, 2022

k8s-triage-robot commented Jan 19, 2024

seans3 commented Feb 9, 2024

New pod incorrectly gets scheduled on the node when there is no capacity #106946

New pod incorrectly gets scheduled on the node when there is no capacity #106946

Comments

harche commented Dec 10, 2021 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

harche commented Dec 10, 2021

harche commented Dec 10, 2021

harche commented Dec 10, 2021

harche commented Dec 10, 2021

smarterclayton commented Dec 10, 2021

harche commented Dec 10, 2021

ahg-g commented Dec 10, 2021

harche commented Dec 10, 2021 • edited

smarterclayton commented Dec 10, 2021 • edited

ehashman commented Dec 10, 2021

ahg-g commented Dec 10, 2021

ahg-g commented Dec 10, 2021

k8s-triage-robot commented Mar 10, 2022

vaibhav2107 commented Mar 23, 2022

k8s-triage-robot commented Jun 21, 2022

ffilippopoulos commented Jun 22, 2022

k8s-triage-robot commented Sep 20, 2022

george-angel commented Sep 20, 2022

k8s-triage-robot commented Dec 19, 2022

sathyanarays commented Dec 19, 2022

k8s-triage-robot commented Jan 19, 2024

seans3 commented Feb 9, 2024

harche commented Dec 10, 2021 •

edited

harche commented Dec 10, 2021 •

edited

smarterclayton commented Dec 10, 2021 •

edited