-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gce-master-scale-performance is failing #100621
Comments
@mborsz: The provided milestone is not valid for this repository. Milestones in this repository: [ Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@mborsz: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/milestone v1.21 |
/sig scalability |
In both cases we have no master logs which makes it nearly impossible to debug this. kubernetes/test-infra#21553 should fix the issue with missing logs. In the meantime, I'm running a manual run of the same config as the last failing run |
one of the reasons for failure is:
We also see quite a few OOMs on ~10 nodes. The processes that are oom killed looks quite random: pause pods, system-logind, systemd-journal. It's quite unlikely that those are the processes that eats all the memory |
I checked serial console logs for one of nodes where OOM has been reported and most likely metrics-server is a component that consumes a lot of memory and triggers OOM killer on nodes. In my manual run that is currently happening we see that it fails and jumps from node to node triggering OOMs on another node:
|
From successful run:
There are multiple instances of metrics-server, but all of them in a short period of time (when the cluster starts). They are recreated by pod_nanny. It looks like in failed runs pod_nanny for metrics-server failed to resize metrics-server deployment to a size appropriate for 5k node cluster. |
From pod_nanny logs (in the manual run):
|
Indeed, 'etcd_object_counts' disappeared: ➜ ~ kubectl get --raw /metrics | grep etcd_object_counts |
It looks like etcd_object_counts has been renamed (#99785) and this makes pod_nanny not able to resize metrics-server. For that reason metrics-server is not setting any resources (should be set by pod_nanny based on the value of the metric) which allows it to schedule on small nodes (other than the "heapster" one) and OOM continuously breaking other components running on that node (e.g. daemonset pods). The questions here:
|
/cc @erain @logicalhan Is it intentional that etcd_object_counts disappears in 1.21 version? Shouldn't we move this to at least 1.22 to give some grace period when both metrics are available? |
@erain I can't remember, did we delete the original? We should probably retain it for a release or two prior for the reason above. |
cc @ehashman |
Are you sure they are gone? |
We are double writing at the moment. |
Ah actually I think the deprecated version was incorrect, so the metric was hidden a release too soon. |
/reopen Thanks for a fix! |
@mborsz: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In kube-scheduler logs I see that metrics-server was correctly upsized by addon-manager (resizes happens on the very beginning of the test + hash in the name changes which means that deployment's spec has changed).
|
The issue seems to be resolved. I triggered another manual run in the internal project to make sure that it's not flake. We expect result in ~4h. |
The manual test has passed. I believe this is resolved. |
Which jobs are failing:
https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-master-scale-performance
Which test(s) are failing:
Since when has it been failing:
03-27
Testgrid link:
Reason for failure:
Two different failure reasons
Anything else we need to know:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1375855502717620224
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1376217890679689216
/assign
/milestone 1.20
/priority critical-urgent
/cc @wojtek-t
The text was updated successfully, but these errors were encountered: