-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preliminary ACK PR produces Envoy OOM + loads of ADS disconnects. #28192
Comments
@howardjohn logs for that envoy OOM. Smells like a memory leak to me, but I don't know enough about Envoy to judge. the disconnecting ADS is probably related to the flow control not rate limiting prpoerly. Cheers, |
@sdake one possible thing is to try PROXY_XDS_VIA_AGENT=false env var on istio this is the new xds proxy we have added. One thing I noticed is it also has a 5s timeout hardcoded (this is for calls from envoy to istiod), so it may be undoing the increased timeout you did |
Turning this off causes flow control to have no visual effect, but also stops crashing the ADS connection and the OOMs stop. I attempted to increase the 5s hardcode to 35s, and same result. It appears the knative benchmark has found an Istio master bug. I wonder how much trouble it would be to run this in the long-running scale/perf tests? Its sort of like wackamole on this flow control - hard to get to a stable baseline 😫 Cheers, |
It looks like I'm having the same problem in istio 1.8.1. |
this is about an off by default experimental feature, probably not your issue. I would get a heap profile and open a new issue (https://github.com/istio/istio/wiki/Analyzing-Istio-Performance). |
Great. glad to see pprof's showing up. |
John,
This issue exhibited in the past without the flow control flag enabled. It
had the appearance of being fixed in 1.8 - although I didn't heavily verify.
Cheers,
-steve
…On Mon, Dec 21, 2020 at 9:21 PM lut777 ***@***.***> wrote:
Great. glad to see pprof's showing up.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#28192 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFYRCKBKXKJLMOPSSLQEGTSWANEXANCNFSM4S3K6OKQ>
.
|
Seeing the same symptoms on 1.8.3, disabling the xDS proxy via agent solved the issue. |
🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2021-01-03. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions. Created by the issue and PR lifecycle manager. |
(NOTE: This is used to report product bugs:
To report a security vulnerability, please visit https://istio.io/about/security-vulnerabilities/
To ask questions about how to use Istio, please visit https://discuss.istio.io
)
Bug description
The preliminary PR https://github.com/istio/istio/pull/27563/files leads to an OOM. The other problems I am less concerned with at this time.
Affected product area (please put an X in all that apply)
[ ] Docs
[ ] Installation
[X] Networking
[ ] Performance and Scalability
[ ] Extensions and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure
Affected features (please put an X in all that apply)
[ ] Multi Cluster
[ ] Virtual Machine
[ ] Multi Control Plane
Expected behavior
No OOM under heavy load.
Steps to reproduce the bug
Really hard to reproduce but is 100% reproducible. Requires 15 nodes + a unique tool. Happy to share my environment for debug via ssh public key.
Version (include the output of
istioctl version --remote
andkubectl version --short
andhelm version
if you used Helm)master on commit ceec0d6 with above PR applied. I am preparing to rebase, so YMMV :)
How was Istio installed?
with istio-operator-minimal of:
Environment where bug was observed (cloud vendor, OS, etc)
vSphere 7.0.1 15 node cluster across 3 bare metal nodes + metallb + calico + containerd + K8s 1.19.2. Each VM has 16 gb ram + 4 vcpu. K8s control plane has 32gb ram + 8 vcpu (iirc)
Additionally, please consider attaching a cluster state archive by attaching
the dump file to this issue.
cluster-local-gateway (a knative ingress) has reset.
As a result of SIGTERM OOMKill (137):
Previous logs:
Current Envoy logs:
Pilot logs attached.
log.tar.gz
The text was updated successfully, but these errors were encountered: