New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubelet OOM killing in 'g1-small' node during huge-cluster perf test #47865
Comments
/assign @dchen1107 |
If you can confirm that the reason is due to not having large-enough nodes, we can rerun with larger ones. |
This basically means that between 1.6 and 1.7 resource usage on Nodes grew enough to cause widespread OOMs on 1.7GB machines, when they're running ~30 pause Pods. |
Seems like there are nodes crashing from time to time (a bit more even for load test I guess):
Most of them OOMs. Let's try with bigger machines tomorrow and see if the problem still persists. |
Yep. There are many add-on pods, so it's hard to guess which one uses more resources in 1.7 without a side-by-side comparison. There were new daemonset (ip-masq-agent) added too, so the increase in resource usage may be expected. kube-proxy was using quite a lot of memory, but I assumed this is by design since the test created ~13k services.
The only thing that caught my attention is that ip-masq-agent got OOM killed because it exceeded its own memory limit. I think the limit might be too small for the load? /cc @dnardo |
the limits for ip-masq-agent are pretty small so I doubt it's taking too much resources. If it was oomkilled then yeah maybe it was too small, that said I doubled it from a 24 hour max so I'm a bit surprised. |
@dnardo, what is the current limit? I have asked this in another issue / pr, but no one answered my question. @matchstick Are we sure we want to enable this by default for 1.7 release? I raised my concern related to this before at #46651 (comment) @davidopp This is the concern I was talking to you yesterday about 1.7 release: newly added daemonsets on every node. This one can totally make the node useless. We need to make sure the node is large enough to include all those default daemons & daemonsets. Your spreadsheet can help answer this question. @kubernetes/kubernetes-release-managers We should include this information into our release notes. |
The memory limit is only 8MB which is pretty small. I didn't mean to say that the new daemon is the culprit. Any existing daemon on the node could've had an significant increase in the resource usage, or all of them could have collectively caused the memory to went over limit. Hard to pinpoint what's the exact cause without baseline (1.6) to compare against. |
I'm less concerned about ip-masq-agent than I am this Jun 21 17:07:46.960055 e2e-enormous-cluster-minion-group-nxl2 kernel: iptables-restor invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=996 Why is iptables-restore being killed. Wouldn't that be kube-proxy calling iptables-restore? ip-masq-agent doesn't call that. Lastly, even if the ip-masq-agent was killed, it wouldn't have caused any issues. It would have at least run once, and that would have setup the ip-masq rules. It would never have needed to change after that. |
Why iptables-restor 's oom_score_adj is so high: 996? If it is the children process of Kube-proxy, it should inherit the oom_score_adj from kube-proxy, which should be set to much lower value by me since 1.4 release as a temporary workaround before we have full story for #22212 Is there a regression in this release? We changed kube-proxy's oom_score_adj as a critical static pod? cc/ @vishh |
Automatic merge from submit-queue (batch tested with PRs 42252, 42251, 42249, 47512, 47887) Bump the memory request/limit for ip-masq-daemon. **What this PR does / why we need it**: **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes # issue #47865 **Special notes for your reviewer**: **Release note**: ```release-note ```
We'll run a test using n1-standard-1s to see if they have enough memory. |
This happens also on n1-standard-1 Nodes, which seems bad. Ref. #47899 |
@dnardo iptables-restore was in the ip-masq-agent's cgroups. That's what caused ip-masq-agent to be OOM killed. From the numbers above, the new limit would not be enough. |
I think what might be happening is that when ip-masq-agent writes out its rules, it may be reading all the ip tables rules that are currently configured. That may explain the usage here. Let me take a look and see. |
Thanks @dnardo |
@gmarek Do we have the apiserver logs available for some 5k-node run for 1.6 somewhere? I can't find them anywhere, and they'd be useful for my debugging work. Also, any way to verify if 1.6 scale tests ran with/without services? |
We had several discussions offline related to this. Here are the summary of the decision and action items what I had: cc/ @kubernetes/kubernetes-release-managers
The decision is disable ip-masq-agent for OSS k8s 1.7 release by default. @dnardo has a pending pr for this.
Also @dnardo and the network team is working on how to reduce the overhead. They have several proposals already.
|
@dchen1107 Thanks a lot for the detailed update! |
Automatic merge from submit-queue Remove limits from ip-masq-agent for now and disable ip-masq-agent in GCE ip-masq-agent when issuing an iptables-save will read any configured iptables on the node. This means that the ip-masq-agent's memory requirements would grow with the number of iptables (i.e. services) on the node. **What this PR does / why we need it**: **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes # #47865 **Special notes for your reviewer**: **Release note**: ```release-note ```
FYI, I've uploaded the logs for the current run of gce-enormous-cluster to gcs (available here) and brought down the cluster. Re-kicked a new job with services and ip-masq-agent disabled this time. Let's see how much this helps. |
@shyamjvs started test ~10 PM PDT (thanks a lot!), Load test should finish in ~12h, i.e. Friday 10 am PDT. |
@dchen1107 - Load test passed. It's highly likely that Density test will pass as well, which means we're golden. We'll try running those tests with services enabled, but that's not a blocker for release. |
Here are the resource usage stats for both 1.6.6 and 1.7.0. |
Yup, both load and density test passed, that too with no high-latency requests. List pods 99%ile latency fell all the way from 6s to ~1.5s. Will verify this weekend if just the ip_masq_agent created this mischief or services too. |
@shyamjvs and @gmarek Thanks for the test result. Please share the test result with the service enabled later. From looking at @yguo0905's data, there is not much change on memory usage footprint for both Kubelet and docker (same 1.11.2 anyway) between 1.6.6 and 1.7.0-beta3. I am closing the issue and thanks everyone! |
Automatic merge from submit-queue (batch tested with PRs 47993, 47892, 47591, 47469, 47845) Use a different env var to enable the ip-masq-agent addon. We shouldn't mix setting the non-masq-cidr with enabling the addon. **What this PR does / why we need it**: **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes # **Special notes for your reviewer**: **Release note**: ```release-note ``` #47865
And.. the load test failed with services enabled. We are seeing a high qps, like before (similar to #47899 (comment)). But disabling the ip-masq-agent helped by removing some ooms and the pod-status/events update requests arising due to it from kubelet. But now fluentd seems to do something similar (it was also there before iirc, but ip-masq-agent dominated). Out of 9800 qps of 429s, 7k are from kubelet and the rest from npd. And half of those 7k requests are due to fluentd oom-killing (which kubelets respond to by sending PUT pod-status and PATCH events). The other half are PATCH node-status calls (same for npd), but that's just a consequence iiuc. From the kernel logs on the nodes, fluentd and event-exporter seem to be oom-killed frequently. From fluentd logs it seems like it's not able to handle the log volume:
|
We can either try running fluentd with higher memory limits or try finding and reducing the source of this high logs traffic. The only difference between this run and the last run (which passed) is enabling of services, so kubeproxy should be the one doing the mischief. We have the logging verbosity level set to v1 (https://github.com/kubernetes/test-infra/blob/master/jobs/ci-kubernetes-e2e-gce-enormous-cluster.env#L23) and still kubeproxy logs are huge. |
kube-proxy.log was 920B without services and ~6-7 GBs (rotated logs included) with services. It's mainly because of printing out iptables rules which is too much to log for large clusters with many services. |
Those are different problems. Quota issues are expected to go away in the coming week, OOM issues are expected under the high load (more than 200KB/sec) |
Filed an issue.
If that's the case, we are sure to thrash fluentd even on moderately big clusters with quite some service endpoints as just this line in kube-proxy can create a logline of multiple MBs. |
/remove-priority P0 |
FYI, we are now running the test with fluentd disabled but services still enabled to check if there's any problem with kube-proxy. |
While running scalability tests today (as part of #47344) on a 4000-node GCE cluster, this happened during density test termination. Currently, load test is running.
It failed due to some density pod's condition not being updated and on digging up a bit turned out a couple of kubelets (one where the pod was running) crashed:
From the kernel logs:
The cluster is still running and to reach the node:
gcloud compute ssh e2e-enormous-cluster-minion-group-nxl2 --project kubernetes-scale --zone us-east1-a
cc @kubernetes/sig-node-bugs @kubernetes/sig-scalability-misc @dchen1107 @yujuhong @gmarek
The text was updated successfully, but these errors were encountered: