Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s-infra-prow-builds is frequently failing to schedule due to capacity #32157

Closed
BenTheElder opened this issue Mar 4, 2024 · 15 comments
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@BenTheElder
Copy link
Member

BenTheElder commented Mar 4, 2024

https://prow.k8s.io/?state=error&cluster=k8s-infra-prow-build

failures like:

There are no nodes that your pod can schedule to - check your requests, tolerations, and node selectors (0/149 nodes are available: 1 node(s) had untolerated taint {ToBeDeletedByClusterAutoscaler: 1709593618}, 1 node(s) had untolerated taint {ToBeDeletedByClusterAutoscaler: 1709593672}, 12 Insufficient memory, 142 Insufficient cpu, 5 node(s) had untolerated taint {ToBeDeletedByClusterAutoscaler: 1709593695}. preemption: 0/149 nodes are available: 142 No preemption victims found for incoming pod, 7 Preemption is not helpful for scheduling..)

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/122422/pull-kubernetes-verify/1764789037396135936

xref https://kubernetes.slack.com/archives/C09QZ4DQB/p1709577399565409

filed #32156 for quick fix

/sig testing k8s-infra

@BenTheElder BenTheElder added the kind/bug Categorizes issue or PR as related to a bug. label Mar 4, 2024
@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. labels Mar 4, 2024
@BenTheElder
Copy link
Member Author

We should probably increase the scaling limits for this, it's expected that job migrations will drive more usage ...

cc @ameukam @upodroid @dims

FYI @rjsadow 😅

@BenTheElder
Copy link
Member Author

We still have recent failure to schedule but we're only at 130 nodes currently, and AFAICT we have set the limit to 1-80 per zone ...

@BenTheElder
Copy link
Member Author

BenTheElder commented Mar 4, 2024

We had recently peaked in cluster scale though

Cluster CPU capacity_ allocatable, sum(limit)

Node - Total, Request, Allocatable CPU cores
Node - Total ephemeral storage

@BenTheElder
Copy link
Member Author

BenTheElder commented Mar 4, 2024

https://prow.k8s.io/?state=error has dropped off for the moment, possibly following #32157 🤞

Last error pod was scheduled at 3:11 Pacific, config updated at 3:24

@BenTheElder BenTheElder self-assigned this Mar 4, 2024
@BenTheElder BenTheElder added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Mar 4, 2024
@BenTheElder
Copy link
Member Author

technically unrelated but similar issue: kubernetes/k8s.io#6519 (boskos pool exhausting quota)

@BenTheElder
Copy link
Member Author

BenTheElder commented Mar 5, 2024

Still happening.

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/123568/pull-kubernetes-e2e-kind/1764802917736386560

There are no nodes that your pod can schedule to - check your requests, tolerations, and node selectors (0/141 nodes are available: 141 Insufficient cpu, 9 Insufficient memory. preemption: 0/141 nodes are available: 141 No preemption victims found for incoming pod..)

Schrödinger's scale-up??

Type	Reason	Age	Source	Message
Warning	FailedScheduling	15m	default-scheduler	0/135 nodes are available: 11 Insufficient memory, 135 Insufficient cpu. preemption: 0/135 nodes are available: 135 No preemption victims found for incoming pod..
Normal	TriggeredScaleUp	16m	cluster-autoscaler	pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/k8s-infra-prow-build/zones/us-central1-b/instanceGroups/gke-prow-build-pool5-2021092812495606-3a8095df-grp 41->42 (max: 80)}]
Normal	NotTriggerScaleUp	16m	cluster-autoscaler	pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 Insufficient cpu
Warning	FailedScheduling	15m	default-scheduler	0/136 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 10 Insufficient memory, 135 Insufficient cpu. preemption: 0/136 nodes are available: 1 Preemption is not helpful for scheduling, 135 No preemption victims found for incoming pod..
Warning	FailedScheduling	15m	default-scheduler	0/136 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/network-unavailable: }, 10 Insufficient memory, 135 Insufficient cpu. preemption: 0/136 nodes are available: 1 Preemption is not helpful for scheduling, 135 No preemption victims found for incoming pod..
Warning	FailedScheduling	14m	default-scheduler	0/136 nodes are available: 10 Insufficient memory, 136 Insufficient cpu. preemption: 0/136 nodes are available: 136 No preemption victims found for incoming pod..
Warning	FailedScheduling	14m	default-scheduler	0/136 nodes are available: 136 Insufficient cpu, 9 Insufficient memory. preemption: 0/136 nodes are available: 136 No preemption victims found for incoming pod..
Warning	FailedScheduling	14m	default-scheduler	0/137 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 10 Insufficient memory, 136 Insufficient cpu. preemption: 0/137 nodes are available: 1 Preemption is not helpful for scheduling, 136 No preemption victims found for incoming pod..
Warning	FailedScheduling	14m	default-scheduler	0/137 nodes are available: 10 Insufficient memory, 137 Insufficient cpu. preemption: 0/137 nodes are available: 137 No preemption victims found for incoming pod..
Warning	FailedScheduling	13m	default-scheduler	0/139 nodes are available: 10 Insufficient memory, 137 Insufficient cpu, 2 node(s) had untolerated taint {node.kubernetes.io/not-ready: }. preemption: 0/139 nodes are available: 137 No preemption victims found for incoming pod, 2 Preemption is not helpful for scheduling..
Warning	FailedScheduling	13m	default-scheduler	0/139 nodes are available: 10 Insufficient memory, 139 Insufficient cpu. preemption: 0/139 nodes are available: 139 No preemption victims found for incoming pod..
Warning	FailedScheduling	12m	default-scheduler	(combined from similar events): 0/141 nodes are available: 141 Insufficient cpu, 9 Insufficient memory. preemption: 0/141 nodes are available: 141 No preemption victims found for incoming pod..
Warning	FailedScheduling	12m	default-scheduler	0/141 nodes are available: 141 Insufficient cpu, 9 Insufficient memory. preemption: 0/141 nodes are available: 141 No preemption victims found for incoming pod..

Maybe we need to increase pod_unscheduled_timeout in prow? We only allow 5m but we have 15m for pod_pending_timeout.

It seems like we're scaling up but not before Prow gives up.

pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 Insufficient cpu

Doesn't make sense. We're requesting 7 cores, the nodes are 8 core, and I don't see that we're anywhere near exhausting GCE CPU quota. And then it did also scale up ...

@BenTheElder
Copy link
Member Author

BenTheElder commented Mar 5, 2024

Possibly due to system pods ..? (confusing the system if adding a node would help as we run right up against the limit? also maybe one is using more CPU now?)

On a node successfully running pull-kubernetes-e2e-kind:

CPU | 8 CPU | 7.91 CPU | 7.87 CPU

total / allocatable / requested.

7.1 of that is the test pod, .1 of which is sidecar
The rest is system pods.

@BenTheElder
Copy link
Member Author

This impacts jobs that:

  1. run on cluster: k8s-infra-prow-builds
  2. request 7 CPU for the test container (which leads to 7.1 total)

It appears to have gotten bad sometime just before 10am pacific.

@BenTheElder
Copy link
Member Author

We could do one of:

  1. Move all of these to the EKS cluster.
  2. Reduce the CPU requests (and suffer slower jobs and possibly more flakes)
  3. Reduce the CPU requests elsewhere (sidecar, any system agents we control)
    to mitigate. But all of these are not ideal longterm.

We partially did 1) which we would have done for some of these anyhow.

Ideally we'd root cause and resolve the apparent scaling decision flap, in any case I'm logging off for the night.

All of this infra is managed in this repo or github.com/kubernetes/k8s.io at least, so someone else could pick this up in the interim. I'm going to be in meetings all morning unfortunately.

@BenTheElder
Copy link
Member Author

https://kubernetes.slack.com/archives/CCK68P2Q2/p1709622064921789?thread_ts=1709203049.908399&cid=CCK68P2Q2

this is probably due to kubernetes/k8s.io#6468

which adds a new daemonset requesting .2 CPU limit, meanwhile these jobs were requesting almost 100% of schedulable CPU by design (to avoid noisy neighbors, they're very I/O and CPU heavy)

@BenTheElder
Copy link
Member Author

Given code freeze is pending in like one day, we should probably revert for now and then evaluate follow-up options?

This is having significant impact on merging to kubernetes as required presubmits are failing to schedule.

https://www.kubernetes.dev/resources/release/

@upodroid
Copy link
Member

upodroid commented Mar 5, 2024

This should be mitigated now as I deleted the daemonset.

@BenTheElder
Copy link
Member Author

Yes, it appears to be: https://prow.k8s.io/?state=error

@BenTheElder
Copy link
Member Author

BenTheElder commented Mar 5, 2024

So to conclude:

We schedule many jobs that use ~100% of available CPU because:
a) they'll happily use it
b) they're doing builds or running kind/etcd/local-up-cluster or other I/O heavy workloads and I/O is not schedulable, but we can prevent other CI jobs by not leaving room for their CPU requests on the same node to prevent I/O contention

For a long time that has meant requesting 7 cores (+0.1 for prow's sidecar), since we've run on 8 core nodes and there are some system reserved covering part of the remaining core and no job is requesting <1 core.

Looking at k8s-infra-prow-builds right now, we have:

Resource type Capacity Allocatable Total requested  
CPU 8 CPU 7.91 CPU 7.88 CPU

So we can't fit the 200m CPU daemonset (kubernetes/k8s.io#6521) and that breaks auto-scaling.

Pods for sample node running 7.1 core prowjob:

Name Status CPU requested Memory requested Storage requested Namespace Restarts Created on
ip-masq-agent-h9tlv Running 10 mCPU 16.78 MB 0 B kube-system 0 Mar 4, 2024, 8:54:44 AM
tune-sysctls-9bwvr Running 0 CPU 0 B 0 B kube-system 0 Mar 4, 2024, 8:54:44 AM
pdcsi-node-gmm8z Running 10 mCPU 20.97 MB 0 B kube-system 0 Mar 4, 2024, 8:54:44 AM
kube-proxy-gke-prow-build-pool5-2021092812495606-e8f905a4-sjcm Running 100 mCPU 0 B 0 B kube-system 0 Mar 4, 2024, 8:54:44 AM
create-loop-devs-4rvgp Running 0 CPU 0 B 0 B kube-system 0 Mar 4, 2024, 8:54:44 AM
gke-metadata-server-m6nvx Running 100 mCPU 104.86 MB 0 B kube-system 0 Mar 4, 2024, 8:54:44 AM
netd-q2x2q Running 2 mCPU 31.46 MB 0 B kube-system 0 Mar 4, 2024, 8:54:44 AM
network-metering-agent-ttgwq Running 0 CPU 0 B 0 B kube-system 0 Mar 4, 2024, 8:54:44 AM
gke-metrics-agent-tr7hf Running 6 mCPU 104.86 MB 0 B kube-system 0 Mar 4, 2024, 8:54:44 AM
node-local-dns-l4q7j Running 25 mCPU 20.97 MB 0 B kube-system 0 Mar 4, 2024, 8:54:44 AM
fluentbit-gke-xgpxf Running 100 mCPU 209.72 MB 0 B kube-system 0 Mar 4, 2024, 8:54:44 AM
konnectivity-agent-77c57877b6-4n4jx Running 10 mCPU 31.46 MB 0 B kube-system 0 Mar 5, 2024, 6:10:59 AM
calico-node-zdsn2 Running 420 mCPU 0 B 0 B kube-system 0 Mar 5, 2024, 7:48:41 AM
0ab611de-742f-4f23-b405-fc049c25febf Running 7.1 CPU 37.58 GB 0 B test-pods 0 Mar 5, 2024, 8:25:27 AM

Least loaded node with 7.1 core prowjob:

Resource type Capacity Allocatable Total requested  
CPU 8 CPU 7.91 CPU 7.85 CPU

We only have .06 CPU overhead at most currently for nodes running these jobs.

Pods on that node:

Name Status CPU requested Memory requested Storage requested Namespace Restarts Created on
f1b12fca-5529-4b74-9bcf-ca265ed18085 Running 7.1 CPU 10.74 GB 0 B test-pods 0 Mar 5, 2024, 8:56:27 AM
calico-node-4xwzb Running 400 mCPU 0 B 0 B kube-system 0 Mar 5, 2024, 9:22:04 AM
fluentbit-gke-4dv2h Running 100 mCPU 209.72 MB 0 B kube-system 0 Mar 4, 2024, 10:13:25 PM
kube-proxy-gke-prow-build-pool5-2021092812495606-e8f905a4-z7zp Running 100 mCPU 0 B 0 B kube-system 0 Mar 4, 2024, 10:13:25 PM
gke-metadata-server-q5hld Running 100 mCPU 104.86 MB 0 B kube-system 0 Mar 4, 2024, 10:13:26 PM
node-local-dns-qhvl5 Running 25 mCPU 20.97 MB 0 B kube-system 0 Mar 4, 2024, 10:13:26 PM
pdcsi-node-5k9fs Running 10 mCPU 20.97 MB 0 B kube-system 0 Mar 4, 2024, 10:13:25 PM
ip-masq-agent-fjz9r Running 10 mCPU 16.78 MB 0 B kube-system 0 Mar 4, 2024, 10:13:26 PM
gke-metrics-agent-ktb99 Running 6 mCPU 104.86 MB 0 B kube-system 0 Mar 4, 2024, 10:13:25 PM
netd-nxr9b Running 2 mCPU 31.46 MB 0 B kube-system 0 Mar 4, 2024, 10:13:26 PM
network-metering-agent-9wcb5 Running 0 CPU 0 B 0 B kube-system 0 Mar 4, 2024, 10:13:25 PM
create-loop-devs-97mqf Running 0 CPU 0 B 0 B kube-system 0 Mar 4, 2024, 10:13:25 PM
tune-sysctls-27pwv Running 0 CPU 0 B 0 B kube-system 0 Mar 4, 2024, 10:13:25 PM

We either have to keep daemonset additions extremely negligible, or we need to reduce the CPU available to these heavy jobs (and that means identify and updating ALL of them to prevent leaving jobs failing to schedule).

Presumably, we have slightly different resources available on the EKS nodes, enough to fit this daemonset alongside while still scheduling 7.1 cores, but we fundamentally have the same risk there.

Additionally: We ensure all of our jobs have guaranteed QOS via presubmit tests for the jobs, we should probably be doing at least manually doing this for anything else we install. The create dev loops and tune sysctls are an exception because they're doing almost nothing and don't really need any guaranteed resources.

@BenTheElder
Copy link
Member Author

@upodroid points out here kubernetes/k8s.io#6525 (comment) that we should probably just disable calico network policy and get back .4 CPU/node for custom metrics daemonsets.

We are not running it on the old build cluster and I don't think we need it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

3 participants