gce-master-scale-performance is failing #100621

mborsz · 2021-03-29T12:35:02Z

Which jobs are failing:

https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-master-scale-performance

Which test(s) are failing:

Since when has it been failing:

03-27

Testgrid link:

Reason for failure:

Two different failure reasons

Anything else we need to know:

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1375855502717620224
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1376217890679689216

/assign
/milestone 1.20
/priority critical-urgent
/cc @wojtek-t

k8s-ci-robot · 2021-03-29T12:35:04Z

@mborsz: The provided milestone is not valid for this repository. Milestones in this repository: [next-candidate, v1.16, v1.17, v1.18, v1.19, v1.20, v1.21, v1.22]

Use /milestone clear to clear the milestone.

In response to this:

Which jobs are failing:

https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-master-scale-performance

Which test(s) are failing:

Since when has it been failing:

03-27

Testgrid link:

Reason for failure:

Two different failure reasons

Anything else we need to know:

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1375855502717620224
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1376217890679689216

/assign
/milestone 1.20
/priority critical-urgent
/cc @wojtek-t

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2021-03-29T12:35:10Z

@mborsz: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mborsz · 2021-03-29T12:36:15Z

/milestone v1.21

mborsz · 2021-03-29T12:36:50Z

/sig scalability
cc @kubernetes/release-team

mborsz · 2021-03-29T12:37:53Z

In both cases we have no master logs which makes it nearly impossible to debug this. kubernetes/test-infra#21553 should fix the issue with missing logs.

In the meantime, I'm running a manual run of the same config as the last failing run

mborsz · 2021-03-29T12:44:21Z

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1375855502717620224:

one of the reasons for failure is:

[measurement call WaitForControlledPodsRunning -  error: [measurement call WaitForControlledPodsRunning - WaitForRunningDaemonSets error: 1 objects timed out: DaemonSets: test-52h5qo-1/daemonset-0]]

I0327 18:05:43.909687   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 811 out of 5000 created, 385 running (385 updated), 113 pending scheduled, 313 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:05:48.926736   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 1310 out of 5000 created, 724 running (724 updated), 141 pending scheduled, 445 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:05:53.949881   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 1810 out of 5000 created, 1058 running (1058 updated), 182 pending scheduled, 570 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:05:58.985054   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 2311 out of 5000 created, 1419 running (1419 updated), 193 pending scheduled, 699 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:04.025771   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 2812 out of 5000 created, 1795 running (1795 updated), 208 pending scheduled, 809 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:09.075226   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 3314 out of 5000 created, 2186 running (2186 updated), 154 pending scheduled, 974 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:14.142296   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 3815 out of 5000 created, 2573 running (2573 updated), 150 pending scheduled, 1092 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:19.209942   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4322 out of 5000 created, 2956 running (2956 updated), 157 pending scheduled, 1209 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:24.293161   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4826 out of 5000 created, 3344 running (3344 updated), 142 pending scheduled, 1340 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:29.378822   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 3651 running (3651 updated), 134 pending scheduled, 1214 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:34.481161   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 4046 running (4046 updated), 131 pending scheduled, 822 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:39.581158   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 4425 running (4425 updated), 133 pending scheduled, 441 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:44.685215   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 4805 running (4805 updated), 140 pending scheduled, 54 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:49.795634   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 4999 running (4999 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:54.904029   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 4999 running (4999 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:07:00.011642   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 4999 running (4999 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:07:05.118317   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 4999 running (4999 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:07:10.223735   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 4999 running (4999 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:07:15.329690   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 4999 running (4999 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:07:20.434783   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 5000 out of 5000 created, 4999 running (4999 updated), 1 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:07:26.141372   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 5000 out of 5000 created, 4999 running (4999 updated), 1 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:07:31.260816   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 5000 out of 5000 created, 4999 running (4999 updated), 1 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:07:36.376997   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 5000 out of 5000 created, 4999 running (4999 updated), 1 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
(...)
I0327 18:15:09.144912   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 5000 out of 5000 created, 4999 running (4999 updated), 1 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:15:14.251111   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 5000 out of 5000 created, 4998 running (4998 updated), 1 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 1 runningButNotReady
(...)
I0327 18:20:37.173323   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 5000 out of 5000 created, 4998 running (4998 updated), 1 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 1 runningButNotReady

We also see quite a few OOMs on ~10 nodes. The processes that are oom killed looks quite random: pause pods, system-logind, systemd-journal. It's quite unlikely that those are the processes that eats all the memory

mborsz · 2021-03-29T13:24:52Z

I checked serial console logs for one of nodes where OOM has been reported and most likely metrics-server is a component that consumes a lot of memory and triggers OOM killer on nodes.

In my manual run that is currently happening we see that it fails and jumps from node to node triggering OOMs on another node:

➜  kubectl get -n kube-system pods -l k8s-app=metrics-server -o wide
NAME                                     READY   STATUS        RESTARTS   AGE     IP             NODE                                    NOMINATED NODE   READINESS GATES
metrics-server-v0.3.6-64f5546c87-2qbgx   2/2     Terminating   0          21m     10.70.187.37   gce-scale-cluster-minion-group-1-6p06   <none>           <none>
metrics-server-v0.3.6-64f5546c87-8dxcg   2/2     Terminating   0          8m56s   10.67.248.15   gce-scale-cluster-minion-group-1-dlvl   <none>           <none>
metrics-server-v0.3.6-64f5546c87-9vxhc   2/2     Terminating   0          35m     10.74.126.12   gce-scale-cluster-minion-group-k6ds     <none>           <none>
metrics-server-v0.3.6-64f5546c87-bpg5m   2/2     Terminating   0          41m     10.64.14.4     gce-scale-cluster-minion-group-1-3vb6   <none>           <none>
metrics-server-v0.3.6-64f5546c87-j4csx   2/2     Running       0          2m43s   10.69.136.15   gce-scale-cluster-minion-group-2-qmcw   <none>           <none>
metrics-server-v0.3.6-64f5546c87-lz5xv   2/2     Terminating   0          15m     10.82.240.37   gce-scale-cluster-minion-group-7sq8     <none>           <none>
metrics-server-v0.3.6-64f5546c87-mtb4c   2/2     Terminating   0          28m     10.66.50.15    gce-scale-cluster-minion-group-1-mcdt   <none>           <none>
metrics-server-v0.3.6-64f5546c87-vpdgr   2/2     Terminating   0          47m     10.78.252.12   gce-scale-cluster-minion-group-3-h4vh   <none>           <none>
metrics-server-v0.3.6-64f5546c87-x77lh   2/2     Terminating   0          54m     10.82.59.8     gce-scale-cluster-minion-group-g7nv     <none>           <none>
➜

mborsz · 2021-03-29T14:11:57Z

From successful run:

I0319 17:08:36.014075      11 factory.go:334] "Unable to schedule pod; no nodes are registered to the cluster; waiting" pod="kube-system/metrics-server-v0.3.6-64f5546c87-pwsbl"
I0319 17:08:37.774124      11 factory.go:338] "Unable to schedule pod; no fit; waiting" pod="kube-system/metrics-server-v0.3.6-64f5546c87-pwsbl" err="0/1 nodes are available: 1 node(s) were unschedulable."
I0319 17:08:40.992091      11 factory.go:338] "Unable to schedule pod; no fit; waiting" pod="kube-system/metrics-server-v0.3.6-64f5546c87-pwsbl" err="0/2 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 1 node(s) were unschedulable."
I0319 17:08:45.766992      11 factory.go:338] "Unable to schedule pod; no fit; waiting" pod="kube-system/metrics-server-v0.3.6-64f5546c87-pwsbl" err="0/2 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 1 node(s) were unschedulable."
I0319 17:08:53.776543      11 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-64f5546c87-pwsbl" node="gce-scale-cluster-minion-heapster" evaluatedNodes=79 feasibleNodes=1
I0319 17:09:05.175919      11 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-76d9bbb485-dpbnr" node="gce-scale-cluster-minion-heapster" evaluatedNodes=234 feasibleNodes=88
I0319 17:09:34.224913      11 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-8b98f98c9-hgh2b" node="gce-scale-cluster-minion-group-2-nf6k" evaluatedNodes=786 feasibleNodes=549
I0319 17:09:35.167076      11 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-74b84b78f5-drbpl" node="gce-scale-cluster-minion-heapster" evaluatedNodes=1459 feasibleNodes=1
I0319 17:10:33.359230      11 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-7566d85dbf-85h8r" node="gce-scale-cluster-minion-heapster" evaluatedNodes=3709 feasibleNodes=1
I0319 17:10:40.890062      11 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-8697cbb849-6qp7j" node="gce-scale-cluster-minion-heapster" evaluatedNodes=3740 feasibleNodes=1
I0319 17:11:33.713727      11 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-754667755-nn6kw" node="gce-scale-cluster-minion-heapster" evaluatedNodes=4221 feasibleNodes=1
I0319 17:13:17.532540      11 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-7b879c778c-zlz5x" node="gce-scale-cluster-minion-heapster" evaluatedNodes=4940 feasibleNodes=1
I0319 17:13:33.344090      11 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-58bc6d979c-8ln6p" node="gce-scale-cluster-minion-heapster" evaluatedNodes=4997 feasibleNodes=1

There are multiple instances of metrics-server, but all of them in a short period of time (when the cluster starts). They are recreated by pod_nanny.

It looks like in failed runs pod_nanny for metrics-server failed to resize metrics-server deployment to a size appropriate for 5k node cluster.

mborsz · 2021-03-29T14:19:06Z

From pod_nanny logs (in the manual run):

gce-scale-cluster-minion-heapster ~ # crictl logs 4cd514ab46db6
ERROR: logging before flag.Parse: I0329 14:16:56.950878       1 pod_nanny.go:68] Invoked by [/pod_nanny --config-dir=/etc/config --cpu=40m --extra-cpu=0.5m --memory=40Mi --extra-memory=4Mi --threshold=5 --deployment=metrics-server-v0.3.6 --container=metrics-server --poll-period=300000 --estimator=exponential --minClusterSize=16 --use-metrics=true]
ERROR: logging before flag.Parse: I0329 14:16:56.950974       1 pod_nanny.go:69] Version: 1.8.11
ERROR: logging before flag.Parse: I0329 14:16:56.951079       1 pod_nanny.go:85] Watching namespace: kube-system, pod: metrics-server-v0.3.6-64f5546c87-6nhjb, container: metrics-server.
ERROR: logging before flag.Parse: I0329 14:16:56.951097       1 pod_nanny.go:86] storage: MISSING, extra_storage: 0Gi
ERROR: logging before flag.Parse: I0329 14:16:56.953017       1 pod_nanny.go:116] cpu: 40m, extra_cpu: 0.5m, memory: 40Mi, extra_memory: 4Mi
ERROR: logging before flag.Parse: I0329 14:16:56.953051       1 pod_nanny.go:145] Resources: [{Base:{i:{value:40 scale:-3} d:{Dec:<nil>} s:40m Format:DecimalSI} ExtraPerNode:{i:{value:5 scale:-4} d:{Dec:<nil>} s: Format:DecimalSI} Name:cpu} {Base:{i:{value:41943040 scale:0} d:{Dec:<nil>} s: Format:BinarySI} ExtraPerNode:{i:{value:4194304 scale:0} d:{Dec:<nil>} s:4Mi Format:BinarySI} Name:memory}]
ERROR: logging before flag.Parse: E0329 14:17:00.459933       1 nanny_lib.go:128] metric unset
ERROR: logging before flag.Parse: E0329 14:17:13.853906       1 nanny_lib.go:128] metric unset
ERROR: logging before flag.Parse: E0329 14:17:27.859112       1 nanny_lib.go:128] metric unset
ERROR: logging before flag.Parse: E0329 14:17:41.151382       1 nanny_lib.go:128] metric unset
ERROR: logging before flag.Parse: E0329 14:17:54.653267       1 nanny_lib.go:128] metric unset
ERROR: logging before flag.Parse: E0329 14:18:08.154849       1 nanny_lib.go:128] metric unset
ERROR: logging before flag.Parse: E0329 14:18:21.555461       1 nanny_lib.go:128] metric unset
ERROR: logging before flag.Parse: E0329 14:18:34.753702       1 nanny_lib.go:128] metric unset

mborsz · 2021-03-29T14:20:51Z

Indeed, 'etcd_object_counts' disappeared:

➜ ~ kubectl get --raw /metrics | grep etcd_object_counts
➜ ~

mborsz · 2021-03-29T14:30:54Z

It looks like etcd_object_counts has been renamed (#99785) and this makes pod_nanny not able to resize metrics-server. For that reason metrics-server is not setting any resources (should be set by pod_nanny based on the value of the metric) which allows it to schedule on small nodes (other than the "heapster" one) and OOM continuously breaking other components running on that node (e.g. daemonset pods).

The questions here:

how is it possible that we discover this now as the metrics: etcd_object_counts->apiserver_storage_object_counts(stable) #99785 has been merged ~20 days ago.
Is it intentional that etcd_object_counts disappears in the same version as the new metric appears? It may be hard to adjust addon-resizer if there is no grace period when both are available.

mborsz · 2021-03-29T14:36:29Z

/cc @erain @logicalhan

Is it intentional that etcd_object_counts disappears in 1.21 version? Shouldn't we move this to at least 1.22 to give some grace period when both metrics are available?

logicalhan · 2021-03-29T14:42:10Z

@erain I can't remember, did we delete the original? We should probably retain it for a release or two prior for the reason above.

dims · 2021-03-29T14:44:18Z

cc @ehashman

logicalhan · 2021-03-29T14:52:36Z

/cc @erain @logicalhan

Is it intentional that etcd_object_counts disappears in 1.21 version? Shouldn't we move this to at least 1.22 to give some grace period when both metrics are available?

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/metrics/metrics.go#L46-L62

Are you sure they are gone?

logicalhan · 2021-03-29T14:53:16Z

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/metrics/metrics.go#L105-L109

We are double writing at the moment.

logicalhan · 2021-03-29T14:56:39Z

Ah actually I think the deprecated version was incorrect, so the metric was hidden a release too soon.

#100632

mborsz · 2021-03-29T16:02:08Z

/reopen

Thanks for a fix!
I would like to keep it open until we see at least a single green run to double check that this was the only issue in the test

k8s-ci-robot · 2021-03-29T16:02:18Z

@mborsz: Reopened this issue.

In response to this:

/reopen

Thanks for a fix!
I would like to keep it open until we see at least a single green run to double check that this was the only issue in the test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mborsz · 2021-03-30T07:02:42Z

https://gke-scalability-prow.corp.goog/view/gcs/gke-scalability-prow/logs/ci-kubernetes-e2e-gke-1.20-5000-performance/1376444124651065347 has passed

In kube-scheduler logs I see that metrics-server was correctly upsized by addon-manager (resizes happens on the very beginning of the test + hash in the name changes which means that deployment's spec has changed).

➜  tmp.zWSxVvZZhF grep metrics-server-v0.3.6 kube-scheduler.log
I0329 17:08:38.536301      12 factory.go:334] "Unable to schedule pod; no nodes are registered to the cluster; waiting" pod="kube-system/metrics-server-v0.3.6-64f5546c87-q8ddd"
I0329 17:08:39.948109      12 factory.go:338] "Unable to schedule pod; no fit; waiting" pod="kube-system/metrics-server-v0.3.6-64f5546c87-q8ddd" err="0/1 nodes are available: 1 node(s) were unschedulable."
I0329 17:08:43.158658      12 factory.go:338] "Unable to schedule pod; no fit; waiting" pod="kube-system/metrics-server-v0.3.6-64f5546c87-q8ddd" err="0/2 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 1 node(s) were unschedulable."
I0329 17:08:47.966437      12 factory.go:338] "Unable to schedule pod; no fit; waiting" pod="kube-system/metrics-server-v0.3.6-64f5546c87-q8ddd" err="0/110 nodes are available: 1 node(s) were unschedulable, 109 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate."
I0329 17:08:56.965331      12 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-64f5546c87-q8ddd" node="gce-scale-cluster-minion-group-4-p4mv" evaluatedNodes=593 feasibleNodes=143
I0329 17:09:09.498830      12 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-76d9bbb485-6qf8f" node="gce-scale-cluster-minion-group-4-2vrp" evaluatedNodes=807 feasibleNodes=499
I0329 17:09:37.583802      12 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-8b98f98c9-246h7" node="gce-scale-cluster-minion-heapster" evaluatedNodes=1096 feasibleNodes=759
I0329 17:09:54.449252      12 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-8697cbb849-2bjnv" node="gce-scale-cluster-minion-heapster" evaluatedNodes=3391 feasibleNodes=1
I0329 17:10:36.010658      12 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-754667755-82pw2" node="gce-scale-cluster-minion-heapster" evaluatedNodes=4186 feasibleNodes=1
I0329 17:13:11.174771      12 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-7b879c778c-78v2p" node="gce-scale-cluster-minion-heapster" evaluatedNodes=4900 feasibleNodes=1
I0329 17:13:36.287961      12 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-58bc6d979c-rqxjs" node="gce-scale-cluster-minion-heapster" evaluatedNodes=5001 feasibleNodes=1

mborsz · 2021-03-30T07:24:55Z

The issue seems to be resolved. I triggered another manual run in the internal project to make sure that it's not flake. We expect result in ~4h.

mborsz · 2021-03-30T12:06:16Z

The manual test has passed. I believe this is resolved.

mborsz added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Mar 29, 2021

k8s-ci-robot assigned mborsz Mar 29, 2021

k8s-ci-robot added this to the v1.21 milestone Mar 29, 2021

k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 29, 2021

logicalhan mentioned this issue Mar 29, 2021

bump the deprecated version to 1.22 #100632

Merged

logicalhan mentioned this issue Mar 29, 2021

[cherry-pick of 100632] bump the deprecated version to 1.22 #100633

Merged

k8s-ci-robot closed this as completed in #100632 Mar 29, 2021

k8s-ci-robot reopened this Mar 29, 2021

mborsz mentioned this issue Mar 30, 2021

Addon-resizer uses 'etcd_object_counts' metric which is deprecated starting from k8s 1.22 kubernetes/autoscaler#3979

Closed

mborsz closed this as completed Mar 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gce-master-scale-performance is failing #100621

gce-master-scale-performance is failing #100621

mborsz commented Mar 29, 2021

k8s-ci-robot commented Mar 29, 2021

Which jobs are failing:

Which test(s) are failing:

Since when has it been failing:

Testgrid link:

Reason for failure:

Anything else we need to know:

k8s-ci-robot commented Mar 29, 2021

mborsz commented Mar 29, 2021

mborsz commented Mar 29, 2021

mborsz commented Mar 29, 2021

mborsz commented Mar 29, 2021

mborsz commented Mar 29, 2021

mborsz commented Mar 29, 2021

mborsz commented Mar 29, 2021

mborsz commented Mar 29, 2021

mborsz commented Mar 29, 2021

mborsz commented Mar 29, 2021

logicalhan commented Mar 29, 2021

dims commented Mar 29, 2021

logicalhan commented Mar 29, 2021

logicalhan commented Mar 29, 2021

logicalhan commented Mar 29, 2021

mborsz commented Mar 29, 2021

k8s-ci-robot commented Mar 29, 2021

mborsz commented Mar 30, 2021

mborsz commented Mar 30, 2021

mborsz commented Mar 30, 2021

gce-master-scale-performance is failing #100621

gce-master-scale-performance is failing #100621

Comments

mborsz commented Mar 29, 2021

Which jobs are failing:

Which test(s) are failing:

Since when has it been failing:

Testgrid link:

Reason for failure:

Anything else we need to know:

k8s-ci-robot commented Mar 29, 2021

Which jobs are failing:

Which test(s) are failing:

Since when has it been failing:

Testgrid link:

Reason for failure:

Anything else we need to know:

k8s-ci-robot commented Mar 29, 2021

mborsz commented Mar 29, 2021

mborsz commented Mar 29, 2021

mborsz commented Mar 29, 2021

mborsz commented Mar 29, 2021

mborsz commented Mar 29, 2021

mborsz commented Mar 29, 2021

mborsz commented Mar 29, 2021

mborsz commented Mar 29, 2021

mborsz commented Mar 29, 2021

mborsz commented Mar 29, 2021

logicalhan commented Mar 29, 2021

dims commented Mar 29, 2021

logicalhan commented Mar 29, 2021

logicalhan commented Mar 29, 2021

logicalhan commented Mar 29, 2021

mborsz commented Mar 29, 2021

k8s-ci-robot commented Mar 29, 2021

mborsz commented Mar 30, 2021

mborsz commented Mar 30, 2021

mborsz commented Mar 30, 2021