Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New pod incorrectly gets scheduled on the node when there is no capacity #106946

Open
harche opened this issue Dec 10, 2021 · 22 comments
Open

New pod incorrectly gets scheduled on the node when there is no capacity #106946

harche opened this issue Dec 10, 2021 · 22 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@harche
Copy link
Contributor

harche commented Dec 10, 2021

What happened?

When the node is running at its full capacity and no more pods can be scheduled, the rest of the pods are in Pending state as expected. But at this point, if we add a static pod then one of the running pods will get evicted to make room for the incoming static pod.

However, the moment pod eviction is completed scheduler will try to send one of the Pending pod at that host, but this will fail the pod with OutOfpods error as the only slot that was opened by evicting the running pod on that node was for the static pod.

What did you expect to happen?

When the node is running at the full capacity (max-pods), and a static pod is added, scheduler should not schedule existing Pending pod on that host.

How can we reproduce it (as minimally and precisely as possible)?

  1. Start local cluster with max-pod = 3 for easily testing,

KUBELET_FLAGS=--max-pods=3 CONTAINER_RUNTIME_ENDPOINT='unix:///var/run/crio/crio.sock' CGROUP_DRIVER=systemd CONTAINER_RUNTIME=remote hack/local-up-cluster.sh

I am using crio but you don't have to, the issue is with the kubelet so it doesn't matter which runtime you use.

  1. Start about 5 pods,
kubectl run busybox-1 --image=busybox --restart=Never --overrides='{ "spec": {"automountServiceAccountToken": false } }'  --command sleep inf
kubectl run busybox-2 --image=busybox --restart=Never --overrides='{ "spec": {"automountServiceAccountToken": false } }'  --command sleep inf
kubectl run busybox-3 --image=busybox --restart=Never --overrides='{ "spec": {"automountServiceAccountToken": false } }'  --command sleep inf
kubectl run busybox-4 --image=busybox --restart=Never --overrides='{ "spec": {"automountServiceAccountToken": false } }'  --command sleep inf
kubectl run busybox-5 --image=busybox --restart=Never --overrides='{ "spec": {"automountServiceAccountToken": false } }'  --command sleep inf
  1. Once you have some pods Running and rest are Pending (depends how many are running in kube-system namespace. If you need increase --max-pods and try again). In my case there was only 1 in kube-system. So I had 2 Running and rest Pending
$ kubectl get pods
NAME        READY   STATUS    RESTARTS   AGE
busybox-1   1/1     Running   0          33s
busybox-2   1/1     Running   0          20s
busybox-3   0/1     Pending   0          16s
busybox-4   0/1     Pending   0          7s
busybox-5   0/1     Pending   0          3s
  1. Create a static pod,
[root@localhost static-pods]# cat > test.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: static-web
  labels:
    role: myrole
spec:
  containers:
    - name: web
      image: nginx
      ports:
        - name: web
          containerPort: 80
          protocol: TCP
[root@localhost static-pods]# pwd
/run/kubernetes/static-pods
  1. Watch it crash and burn :-)
$ kubectl get pods
NAME                   READY   STATUS              RESTARTS   AGE
busybox-1              1/1     Running             0          88s
busybox-2              0/1     Error               0          75s
busybox-3              0/1     OutOfpods           0          71s
busybox-4              0/1     OutOfpods           0          62s
busybox-5              0/1     Pending             0          58s
static-web-127.0.0.1   0/1     ContainerCreating   0          4s

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v0.21.0-beta.1", GitCommit:"d0259f5a5ca1338a68603409a554a554d2c0f6f8", GitTreeState:"clean", BuildDate:"2021-05-21T08:44:40Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.0-alpha.1.41+cc6f12583f2b61", GitCommit:"cc6f12583f2b611e9469a6b2e0247f028aae246b", GitTreeState:"clean", BuildDate:"2021-12-10T10:31:12Z", GoVersion:"go1.17.2", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (0.21) and server (1.24) exceeds the supported minor version skew of +/-1

Cloud provider

N/A

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@harche harche added the kind/bug Categorizes issue or PR as related to a bug. label Dec 10, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 10, 2021
@harche
Copy link
Contributor Author

harche commented Dec 10, 2021

@harche
Copy link
Contributor Author

harche commented Dec 10, 2021

/sig node
/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 10, 2021
@harche
Copy link
Contributor Author

harche commented Dec 10, 2021

cc @deads2k

@harche
Copy link
Contributor Author

harche commented Dec 10, 2021

/sig scheduling

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Dec 10, 2021
@smarterclayton
Copy link
Contributor

When the node is running at the full capacity (max-pods), and a static pod is added, scheduler should not schedule existing Pending pod on that host.

In the order described above, you created normal pods that got scheduled on a node. Then you added a static pod, which is not "scheduled" (the kubelet directly receives that pod, so the scheduler has to react). In that case, the static pod should start, and I would generally expect 1 other pod on that node to get OutOfPods (because the static pod "wins").

However, why is pod 2 in your list in "Error"? It's possible that the explanation for your "crash and burn" is that pod 2 failed (due to kubelet incorrectly evicting it, or its own process exited), and then scheduler saw there was a gap, and tried to place 3 or 4, which the kubelet immediately rejected (because the static pod was starting but the scheduler hadn't seen it) as OutOfPods.

So if we know why pod 2 is in Error, then we can figure out what happened, but in general I'd say the "crash and burn" looks like a normal race behavior where static pod creation on kubelet and scheduler placement are racing to try to leverage the gap that the kubelet creates for the static pod (when it shuts down pod 2). However, pod 2 should definitely say OutOfPods, not Error in that scenario.

@harche
Copy link
Contributor Author

harche commented Dec 10, 2021

Pod 2 with error
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2021-12-10T14:56:09Z"
  labels:
    run: busybox-2
  name: busybox-2
  namespace: default
  resourceVersion: "469"
  uid: b63af64b-6d00-468e-836a-2e264a0b5e15
spec:
  automountServiceAccountToken: false
  containers:
  - command:
    - sleep
    - inf
    image: busybox
    imagePullPolicy: Always
    name: busybox-2
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: 127.0.0.1
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-12-10T14:56:09Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-12-10T14:57:08Z"
    message: 'containers with unready status: [busybox-2]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-12-10T14:57:08Z"
    message: 'containers with unready status: [busybox-2]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-12-10T14:56:09Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://0820e7622875c4ad75a15130063317d5d5fabc7806215f5e7c1d870fedd68437
    image: docker.io/library/busybox:latest
    imageID: docker.io/library/busybox@sha256:50e44504ea4f19f141118a8a8868e6c5bb9856efa33f2183f5ccea7ac62aacc9
    lastState: {}
    name: busybox-2
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: cri-o://0820e7622875c4ad75a15130063317d5d5fabc7806215f5e7c1d870fedd68437
        exitCode: 137
        finishedAt: "2021-12-10T14:57:07Z"
        reason: Error
        startedAt: "2021-12-10T14:56:20Z"
  hostIP: 127.0.0.1
  message: Preempted in order to admit critical pod
  phase: Failed
  podIP: 10.85.0.51
  podIPs:
  - ip: 10.85.0.51
  - ip: 1100:200::33
  qosClass: BestEffort
  reason: Preempting
  startTime: "2021-12-10T14:56:09Z"

@ahg-g
Copy link
Member

ahg-g commented Dec 10, 2021

Just commenting on the scheduler behavior. Static pods are problematic and causes race conditions between the scheduler and kubelet because they don't go through the scheduler first.

However, the moment pod eviction is completed scheduler will try to send one of the Pending pod at that host, but this will fail the pod with OutOfpods error as the only slot that was opened by evicting the running pod on that node was for the static pod.

This is not surprising because it depends on when the static pod made it to the api server and the scheduler got the pod add event. The race that causes the scheduler to send another pod is as follows:

  1. static-pod created
  2. kubelet evicts pod1 on the node
  3. scheduler receives pod1 remove event, and so thinks that the node has space and places another pod2
  4. kubelet rejects pod2 because it doesn't fit when taking static-pod into account
  5. static-pod makes it to the api server and the scheduler receives an add event, and now the scheduler is in sync with kubelet on the node state and doesn't place other pods on it

@harche
Copy link
Contributor Author

harche commented Dec 10, 2021

thanks @ahg-g

But in my case, I added just one static pod. So before adding static pod,

$ kubectl get pods 
NAME        READY   STATUS    RESTARTS   AGE
busybox-1   1/1     Running   0          24s
busybox-2   1/1     Running   0          20s
busybox-3   0/1     Pending   0          16s
busybox-4   0/1     Pending   0          13s
busybox-5   0/1     Pending   0          9s
busybox-6   0/1     Pending   0          2s

But after adding that one static pod my cluster ends up with,

$ kubectl get pods 
NAME                       READY   STATUS      RESTARTS   AGE
busybox-1                  0/1     Error       0          2m38s
busybox-2                  1/1     Running     0          2m34s
busybox-3                  0/1     OutOfpods   0          2m30s
busybox-4                  0/1     OutOfpods   0          2m27s
busybox-5                  0/1     Pending     0          2m23s
busybox-6                  0/1     Pending     0          2m16s
busybox-static-127.0.0.1   1/1     Running     0          73s

Why did 2 pods scheduled by the scheduler that ended up with OutOfpods?

cc @smarterclayton

@smarterclayton
Copy link
Contributor

smarterclayton commented Dec 10, 2021

Ok, looking at pod 2 this is WAD EXCEPT that the fact that this was preempted is masking the fact that it was preempted BECAUSE of OutOfpods. Roughly, the kubelet calculated busybox-2 as "preempted DUE to OutOfpods" but then recorded the reason as "Preempted". I think that's a small usability bug in kubelet - which Reason is really more important a user in this case? I would probably argue the reason for preemption, not the preemption itself, should be shown in the Reason field.

Does the scheduler use Reason programmatically? If not, we can consider changing the behavior to have Kubelet record OutOfpods instead of preempt. If it does, we might want to consider having scheduler NOT depend on Reason, and instead use a more effective channel.

@ehashman
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 10, 2021
@ahg-g
Copy link
Member

ahg-g commented Dec 10, 2021

Does the scheduler use Reason programmatically?

No.

@ahg-g
Copy link
Member

ahg-g commented Dec 10, 2021

/remove-sig scheduling

@k8s-ci-robot k8s-ci-robot removed the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Dec 10, 2021
@ehashman ehashman added this to Triaged in SIG Node Bugs Dec 10, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 10, 2022
@vaibhav2107
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 23, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 21, 2022
@ffilippopoulos
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 22, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 20, 2022
@george-angel
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 20, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2022
@sathyanarays
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2022
@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024
@seans3
Copy link
Contributor

seans3 commented Feb 9, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

No branches or pull requests