Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gce-master-scale-performance is failing #100621

Closed
mborsz opened this issue Mar 29, 2021 · 22 comments · Fixed by #100632
Closed

gce-master-scale-performance is failing #100621

mborsz opened this issue Mar 29, 2021 · 22 comments · Fixed by #100632
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Milestone

Comments

@mborsz
Copy link
Member

mborsz commented Mar 29, 2021

Which jobs are failing:

https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-master-scale-performance

Which test(s) are failing:

Since when has it been failing:

03-27

Testgrid link:

Reason for failure:

Two different failure reasons

Anything else we need to know:

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1375855502717620224
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1376217890679689216

/assign
/milestone 1.20
/priority critical-urgent
/cc @wojtek-t

@mborsz mborsz added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Mar 29, 2021
@k8s-ci-robot
Copy link
Contributor

@mborsz: The provided milestone is not valid for this repository. Milestones in this repository: [next-candidate, v1.16, v1.17, v1.18, v1.19, v1.20, v1.21, v1.22]

Use /milestone clear to clear the milestone.

In response to this:

Which jobs are failing:

https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-master-scale-performance

Which test(s) are failing:

Since when has it been failing:

03-27

Testgrid link:

Reason for failure:

Two different failure reasons

Anything else we need to know:

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1375855502717620224
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1376217890679689216

/assign
/milestone 1.20
/priority critical-urgent
/cc @wojtek-t

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 29, 2021
@k8s-ci-robot
Copy link
Contributor

@mborsz: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mborsz
Copy link
Member Author

mborsz commented Mar 29, 2021

/milestone v1.21

@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Mar 29, 2021
@mborsz
Copy link
Member Author

mborsz commented Mar 29, 2021

/sig scalability
cc @kubernetes/release-team

@k8s-ci-robot k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 29, 2021
@mborsz
Copy link
Member Author

mborsz commented Mar 29, 2021

In both cases we have no master logs which makes it nearly impossible to debug this. kubernetes/test-infra#21553 should fix the issue with missing logs.

In the meantime, I'm running a manual run of the same config as the last failing run

@mborsz
Copy link
Member Author

mborsz commented Mar 29, 2021

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1375855502717620224:

one of the reasons for failure is:

[measurement call WaitForControlledPodsRunning -  error: [measurement call WaitForControlledPodsRunning - WaitForRunningDaemonSets error: 1 objects timed out: DaemonSets: test-52h5qo-1/daemonset-0]]
I0327 18:05:43.909687   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 811 out of 5000 created, 385 running (385 updated), 113 pending scheduled, 313 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:05:48.926736   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 1310 out of 5000 created, 724 running (724 updated), 141 pending scheduled, 445 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:05:53.949881   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 1810 out of 5000 created, 1058 running (1058 updated), 182 pending scheduled, 570 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:05:58.985054   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 2311 out of 5000 created, 1419 running (1419 updated), 193 pending scheduled, 699 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:04.025771   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 2812 out of 5000 created, 1795 running (1795 updated), 208 pending scheduled, 809 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:09.075226   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 3314 out of 5000 created, 2186 running (2186 updated), 154 pending scheduled, 974 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:14.142296   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 3815 out of 5000 created, 2573 running (2573 updated), 150 pending scheduled, 1092 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:19.209942   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4322 out of 5000 created, 2956 running (2956 updated), 157 pending scheduled, 1209 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:24.293161   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4826 out of 5000 created, 3344 running (3344 updated), 142 pending scheduled, 1340 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:29.378822   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 3651 running (3651 updated), 134 pending scheduled, 1214 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:34.481161   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 4046 running (4046 updated), 131 pending scheduled, 822 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:39.581158   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 4425 running (4425 updated), 133 pending scheduled, 441 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:44.685215   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 4805 running (4805 updated), 140 pending scheduled, 54 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:49.795634   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 4999 running (4999 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:06:54.904029   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 4999 running (4999 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:07:00.011642   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 4999 running (4999 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:07:05.118317   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 4999 running (4999 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:07:10.223735   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 4999 running (4999 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:07:15.329690   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 4999 out of 5000 created, 4999 running (4999 updated), 0 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:07:20.434783   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 5000 out of 5000 created, 4999 running (4999 updated), 1 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:07:26.141372   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 5000 out of 5000 created, 4999 running (4999 updated), 1 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:07:31.260816   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 5000 out of 5000 created, 4999 running (4999 updated), 1 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:07:36.376997   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 5000 out of 5000 created, 4999 running (4999 updated), 1 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
(...)
I0327 18:15:09.144912   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 5000 out of 5000 created, 4999 running (4999 updated), 1 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady
I0327 18:15:14.251111   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 5000 out of 5000 created, 4998 running (4998 updated), 1 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 1 runningButNotReady
(...)
I0327 18:20:37.173323   14612 wait_for_pods.go:91] WaitForControlledPodsRunning: namespace(test-52h5qo-1), labelSelector(name=daemonset-0): Pods: 5000 out of 5000 created, 4998 running (4998 updated), 1 pending scheduled, 0 not scheduled, 0 inactive, 0 terminating, 0 unknown, 1 runningButNotReady

We also see quite a few OOMs on ~10 nodes. The processes that are oom killed looks quite random: pause pods, system-logind, systemd-journal. It's quite unlikely that those are the processes that eats all the memory

@mborsz
Copy link
Member Author

mborsz commented Mar 29, 2021

I checked serial console logs for one of nodes where OOM has been reported and most likely metrics-server is a component that consumes a lot of memory and triggers OOM killer on nodes.

In my manual run that is currently happening we see that it fails and jumps from node to node triggering OOMs on another node:

➜  kubectl get -n kube-system pods -l k8s-app=metrics-server -o wide
NAME                                     READY   STATUS        RESTARTS   AGE     IP             NODE                                    NOMINATED NODE   READINESS GATES
metrics-server-v0.3.6-64f5546c87-2qbgx   2/2     Terminating   0          21m     10.70.187.37   gce-scale-cluster-minion-group-1-6p06   <none>           <none>
metrics-server-v0.3.6-64f5546c87-8dxcg   2/2     Terminating   0          8m56s   10.67.248.15   gce-scale-cluster-minion-group-1-dlvl   <none>           <none>
metrics-server-v0.3.6-64f5546c87-9vxhc   2/2     Terminating   0          35m     10.74.126.12   gce-scale-cluster-minion-group-k6ds     <none>           <none>
metrics-server-v0.3.6-64f5546c87-bpg5m   2/2     Terminating   0          41m     10.64.14.4     gce-scale-cluster-minion-group-1-3vb6   <none>           <none>
metrics-server-v0.3.6-64f5546c87-j4csx   2/2     Running       0          2m43s   10.69.136.15   gce-scale-cluster-minion-group-2-qmcw   <none>           <none>
metrics-server-v0.3.6-64f5546c87-lz5xv   2/2     Terminating   0          15m     10.82.240.37   gce-scale-cluster-minion-group-7sq8     <none>           <none>
metrics-server-v0.3.6-64f5546c87-mtb4c   2/2     Terminating   0          28m     10.66.50.15    gce-scale-cluster-minion-group-1-mcdt   <none>           <none>
metrics-server-v0.3.6-64f5546c87-vpdgr   2/2     Terminating   0          47m     10.78.252.12   gce-scale-cluster-minion-group-3-h4vh   <none>           <none>
metrics-server-v0.3.6-64f5546c87-x77lh   2/2     Terminating   0          54m     10.82.59.8     gce-scale-cluster-minion-group-g7nv     <none>           <none>
➜  

@mborsz
Copy link
Member Author

mborsz commented Mar 29, 2021

From successful run:

I0319 17:08:36.014075      11 factory.go:334] "Unable to schedule pod; no nodes are registered to the cluster; waiting" pod="kube-system/metrics-server-v0.3.6-64f5546c87-pwsbl"
I0319 17:08:37.774124      11 factory.go:338] "Unable to schedule pod; no fit; waiting" pod="kube-system/metrics-server-v0.3.6-64f5546c87-pwsbl" err="0/1 nodes are available: 1 node(s) were unschedulable."
I0319 17:08:40.992091      11 factory.go:338] "Unable to schedule pod; no fit; waiting" pod="kube-system/metrics-server-v0.3.6-64f5546c87-pwsbl" err="0/2 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 1 node(s) were unschedulable."
I0319 17:08:45.766992      11 factory.go:338] "Unable to schedule pod; no fit; waiting" pod="kube-system/metrics-server-v0.3.6-64f5546c87-pwsbl" err="0/2 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 1 node(s) were unschedulable."
I0319 17:08:53.776543      11 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-64f5546c87-pwsbl" node="gce-scale-cluster-minion-heapster" evaluatedNodes=79 feasibleNodes=1
I0319 17:09:05.175919      11 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-76d9bbb485-dpbnr" node="gce-scale-cluster-minion-heapster" evaluatedNodes=234 feasibleNodes=88
I0319 17:09:34.224913      11 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-8b98f98c9-hgh2b" node="gce-scale-cluster-minion-group-2-nf6k" evaluatedNodes=786 feasibleNodes=549
I0319 17:09:35.167076      11 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-74b84b78f5-drbpl" node="gce-scale-cluster-minion-heapster" evaluatedNodes=1459 feasibleNodes=1
I0319 17:10:33.359230      11 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-7566d85dbf-85h8r" node="gce-scale-cluster-minion-heapster" evaluatedNodes=3709 feasibleNodes=1
I0319 17:10:40.890062      11 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-8697cbb849-6qp7j" node="gce-scale-cluster-minion-heapster" evaluatedNodes=3740 feasibleNodes=1
I0319 17:11:33.713727      11 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-754667755-nn6kw" node="gce-scale-cluster-minion-heapster" evaluatedNodes=4221 feasibleNodes=1
I0319 17:13:17.532540      11 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-7b879c778c-zlz5x" node="gce-scale-cluster-minion-heapster" evaluatedNodes=4940 feasibleNodes=1
I0319 17:13:33.344090      11 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-58bc6d979c-8ln6p" node="gce-scale-cluster-minion-heapster" evaluatedNodes=4997 feasibleNodes=1

There are multiple instances of metrics-server, but all of them in a short period of time (when the cluster starts). They are recreated by pod_nanny.

It looks like in failed runs pod_nanny for metrics-server failed to resize metrics-server deployment to a size appropriate for 5k node cluster.

@mborsz
Copy link
Member Author

mborsz commented Mar 29, 2021

From pod_nanny logs (in the manual run):

gce-scale-cluster-minion-heapster ~ # crictl logs 4cd514ab46db6
ERROR: logging before flag.Parse: I0329 14:16:56.950878       1 pod_nanny.go:68] Invoked by [/pod_nanny --config-dir=/etc/config --cpu=40m --extra-cpu=0.5m --memory=40Mi --extra-memory=4Mi --threshold=5 --deployment=metrics-server-v0.3.6 --container=metrics-server --poll-period=300000 --estimator=exponential --minClusterSize=16 --use-metrics=true]
ERROR: logging before flag.Parse: I0329 14:16:56.950974       1 pod_nanny.go:69] Version: 1.8.11
ERROR: logging before flag.Parse: I0329 14:16:56.951079       1 pod_nanny.go:85] Watching namespace: kube-system, pod: metrics-server-v0.3.6-64f5546c87-6nhjb, container: metrics-server.
ERROR: logging before flag.Parse: I0329 14:16:56.951097       1 pod_nanny.go:86] storage: MISSING, extra_storage: 0Gi
ERROR: logging before flag.Parse: I0329 14:16:56.953017       1 pod_nanny.go:116] cpu: 40m, extra_cpu: 0.5m, memory: 40Mi, extra_memory: 4Mi
ERROR: logging before flag.Parse: I0329 14:16:56.953051       1 pod_nanny.go:145] Resources: [{Base:{i:{value:40 scale:-3} d:{Dec:<nil>} s:40m Format:DecimalSI} ExtraPerNode:{i:{value:5 scale:-4} d:{Dec:<nil>} s: Format:DecimalSI} Name:cpu} {Base:{i:{value:41943040 scale:0} d:{Dec:<nil>} s: Format:BinarySI} ExtraPerNode:{i:{value:4194304 scale:0} d:{Dec:<nil>} s:4Mi Format:BinarySI} Name:memory}]
ERROR: logging before flag.Parse: E0329 14:17:00.459933       1 nanny_lib.go:128] metric unset
ERROR: logging before flag.Parse: E0329 14:17:13.853906       1 nanny_lib.go:128] metric unset
ERROR: logging before flag.Parse: E0329 14:17:27.859112       1 nanny_lib.go:128] metric unset
ERROR: logging before flag.Parse: E0329 14:17:41.151382       1 nanny_lib.go:128] metric unset
ERROR: logging before flag.Parse: E0329 14:17:54.653267       1 nanny_lib.go:128] metric unset
ERROR: logging before flag.Parse: E0329 14:18:08.154849       1 nanny_lib.go:128] metric unset
ERROR: logging before flag.Parse: E0329 14:18:21.555461       1 nanny_lib.go:128] metric unset
ERROR: logging before flag.Parse: E0329 14:18:34.753702       1 nanny_lib.go:128] metric unset

@mborsz
Copy link
Member Author

mborsz commented Mar 29, 2021

Indeed, 'etcd_object_counts' disappeared:

➜ ~ kubectl get --raw /metrics | grep etcd_object_counts
➜ ~

@mborsz
Copy link
Member Author

mborsz commented Mar 29, 2021

It looks like etcd_object_counts has been renamed (#99785) and this makes pod_nanny not able to resize metrics-server. For that reason metrics-server is not setting any resources (should be set by pod_nanny based on the value of the metric) which allows it to schedule on small nodes (other than the "heapster" one) and OOM continuously breaking other components running on that node (e.g. daemonset pods).

The questions here:

@mborsz
Copy link
Member Author

mborsz commented Mar 29, 2021

/cc @erain @logicalhan

Is it intentional that etcd_object_counts disappears in 1.21 version? Shouldn't we move this to at least 1.22 to give some grace period when both metrics are available?

@logicalhan
Copy link
Member

@erain I can't remember, did we delete the original? We should probably retain it for a release or two prior for the reason above.

@dims
Copy link
Member

dims commented Mar 29, 2021

cc @ehashman

@logicalhan
Copy link
Member

/cc @erain @logicalhan

Is it intentional that etcd_object_counts disappears in 1.21 version? Shouldn't we move this to at least 1.22 to give some grace period when both metrics are available?

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/etcd3/metrics/metrics.go#L46-L62

Are you sure they are gone?

@logicalhan
Copy link
Member

@logicalhan
Copy link
Member

Ah actually I think the deprecated version was incorrect, so the metric was hidden a release too soon.

#100632

@mborsz
Copy link
Member Author

mborsz commented Mar 29, 2021

/reopen

Thanks for a fix!
I would like to keep it open until we see at least a single green run to double check that this was the only issue in the test

@k8s-ci-robot k8s-ci-robot reopened this Mar 29, 2021
@k8s-ci-robot
Copy link
Contributor

@mborsz: Reopened this issue.

In response to this:

/reopen

Thanks for a fix!
I would like to keep it open until we see at least a single green run to double check that this was the only issue in the test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mborsz
Copy link
Member Author

mborsz commented Mar 30, 2021

https://gke-scalability-prow.corp.goog/view/gcs/gke-scalability-prow/logs/ci-kubernetes-e2e-gke-1.20-5000-performance/1376444124651065347 has passed

In kube-scheduler logs I see that metrics-server was correctly upsized by addon-manager (resizes happens on the very beginning of the test + hash in the name changes which means that deployment's spec has changed).

➜  tmp.zWSxVvZZhF grep metrics-server-v0.3.6 kube-scheduler.log
I0329 17:08:38.536301      12 factory.go:334] "Unable to schedule pod; no nodes are registered to the cluster; waiting" pod="kube-system/metrics-server-v0.3.6-64f5546c87-q8ddd"
I0329 17:08:39.948109      12 factory.go:338] "Unable to schedule pod; no fit; waiting" pod="kube-system/metrics-server-v0.3.6-64f5546c87-q8ddd" err="0/1 nodes are available: 1 node(s) were unschedulable."
I0329 17:08:43.158658      12 factory.go:338] "Unable to schedule pod; no fit; waiting" pod="kube-system/metrics-server-v0.3.6-64f5546c87-q8ddd" err="0/2 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 1 node(s) were unschedulable."
I0329 17:08:47.966437      12 factory.go:338] "Unable to schedule pod; no fit; waiting" pod="kube-system/metrics-server-v0.3.6-64f5546c87-q8ddd" err="0/110 nodes are available: 1 node(s) were unschedulable, 109 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate."
I0329 17:08:56.965331      12 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-64f5546c87-q8ddd" node="gce-scale-cluster-minion-group-4-p4mv" evaluatedNodes=593 feasibleNodes=143
I0329 17:09:09.498830      12 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-76d9bbb485-6qf8f" node="gce-scale-cluster-minion-group-4-2vrp" evaluatedNodes=807 feasibleNodes=499
I0329 17:09:37.583802      12 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-8b98f98c9-246h7" node="gce-scale-cluster-minion-heapster" evaluatedNodes=1096 feasibleNodes=759
I0329 17:09:54.449252      12 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-8697cbb849-2bjnv" node="gce-scale-cluster-minion-heapster" evaluatedNodes=3391 feasibleNodes=1
I0329 17:10:36.010658      12 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-754667755-82pw2" node="gce-scale-cluster-minion-heapster" evaluatedNodes=4186 feasibleNodes=1
I0329 17:13:11.174771      12 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-7b879c778c-78v2p" node="gce-scale-cluster-minion-heapster" evaluatedNodes=4900 feasibleNodes=1
I0329 17:13:36.287961      12 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/metrics-server-v0.3.6-58bc6d979c-rqxjs" node="gce-scale-cluster-minion-heapster" evaluatedNodes=5001 feasibleNodes=1

@mborsz
Copy link
Member Author

mborsz commented Mar 30, 2021

The issue seems to be resolved. I triggered another manual run in the internal project to make sure that it's not flake. We expect result in ~4h.

@mborsz
Copy link
Member Author

mborsz commented Mar 30, 2021

The manual test has passed. I believe this is resolved.

@mborsz mborsz closed this as completed Mar 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants