-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
istio 1.2.4: envoy segfaults occasionally on istio-proxy pods #16357
Comments
Istio Proxy binary in 1.2.4 was built from istio/proxy@568f2b6, istio/envoy@204283e (branched from envoyproxy/envoy@829b905 on 2019-05-16). That backtrace decoded using binary with debug symbols corresponds to:
when exiting this lambda:
Note that this lambda was removed as part of xDS simplification, so it doesn't exist in envoyproxy/envoy@master anymore. |
@gotwarlost to help narrow this down:
|
The other stacktrace has the form:
2
Do you think that there could be one bad/ unexpected config that could be causing this? |
Since this is a playground, could you try using Since we didn't see those crashes during our testing done by multiple people, I imagine that a specific cluster configuration is triggering this. If you change the log level to "trace", then you'll see contents of the CDS update immediately before the crash (note that it might contain sensitive information, so please sanitize it before pasting). |
@PiotrSikora does this still happen at Envoy master HEAD? You mention that there were a bunch of changes since this release cut. |
@gotwarlost please file bugs for the other backtraces you see (the ones present in earlier releases but still occurring with 1.2.4) and link them to this one |
@htuch we don't have a way to replicate this, so I have no idea if that's still happening at |
@gotwarlost I'm worried that we don't know how to make progress here. The 1.1.8 to 1.2.4 delta is large. +1 to @PiotrSikora's suggestion to try to upgrade from 1.1.8 to 1.1.13 first |
Just came back from offsite + time off. Will sync up with the team and get you additional data. Sorry for the delay. |
Deployed
Now trying to get a detailed TRACE level log for one of the crashes. |
@gotwarlost thank you that very much reduces the search space. Could you try 1.1.12? If 1.1.12 does not crash, then I think we could squarely point the finger at the http2 vulnerabiity fixes or the switch to libc++ |
Running |
Still no crashes. |
I have a trace file for one of the ingress gateway pods but there is a lot of disclosure type information in the file that I'm not comfortable sharing in a public forum. Taking all this out might make the trace useless. |
@gotwarlost could you test with 1.1.14 as well? |
The context there is that the proxyv2 image in 1.1.13 was built by hand because we had to build it under private embargo and couldn't leverage our CI/CD pipeline since that builds everything in the clear. Piotr wants to rule out that hand build (done by me) as a source of error. 1.1.14 has the same fixes but was built in the clear using our standard pipeline. |
@PiotrSikora corrected me. There's a memory alignment fix in 1.1.14 as well as an upgraded version of libc++, so it's well worth a try. |
we'll do this tomorrow morning and report results |
still happening with |
The number of crashes seem to have gone down but it is still an infrequent occurrence. 20 crashes in the last 20 hours just for the ingress gateway |
17 crashes for the ingress gateway in the last 24 hours. It is much reduced from before but still happening. Any ideas how to move forward/ debug this? As it stands we are caught between a rock and a hard place, really needing the upgrade and not being able to proceed safely. |
The changes between 1.1.12 and 1.1.13 in Envoy (istio/envoy@8e54972...6910569) consist only of HTTP/2 fixes and switch to libc++ (along with some Lua fixes for it). Since you're the only one that reported any issues with the new releases, it leads me to believe that's something unique to your setup (e.g. a non-default option that you enabled, etc.). Would you be willing to share the trace log with me privately? Also, just to make sure, are the resources (mostly memory) on the ingress gateway sufficiently over-provisioned? Envoy's memory usage tends to spike during xDS message parsing, so it's possible that you'd consistently run out of memory at the exact same place in the code. |
We are currently trying an asan intrumented build in @gotwarlost's env. Our best guess is that this was introduced by the switch from libstdc++ to libc++ |
I'm facing the same issue. |
A core file would be most appreciated. This issue seems to be rather subtle. |
New backtrace
|
Potential fix: istio/envoy#103 |
The crash in this core seems to be caused by calling ClusterManagerInitHelper::removeCluster with a smartpointer to a Cluster which is nullptr which in turn is caused by cluster_manager_impl.cc:1258 The dereferenced nullptr (Cluster&) is not used if state_ == State::AllClustersInitialized but, the compile has an option to load the nullptr when it converts the smart pointer to a reference via * and that could be std library dependent. This issue seems to arise with libc++ but not libstdc++. It is not surprising that the behavior could be different because it depends on the optimizations the compiler does which in tern depend on the specifics of the smart pointer implementation and the environment that it is used in. |
Fix is upstream: envoyproxy/envoy@1339ed2#diff-982df999d2cabd23afd6c25088035bf5 |
@jplevyak as I commented in the PR, let's do a full cherry-pick of envoyproxy/envoy#8106? |
Updated istio/envoy#105 to be a cherry-pick of the upstream fix. |
Fixed in 1.1, 1.2, 1.3 and upstream. |
Funny, these # backtraces from the logs are interpreted as references to old PRs in istio, just had one of my first ones referenced :). |
This is not done, at least for
|
@jplevyak please make sure fix is in master too else 1.4 will regress |
1.1.15 and 1.2.6 are out. 1.3.1 should be out this week or early next at the latest (@rlenglet) |
this issue still occurred after upgrade to version
client version: 1.2.6
egressgateway version: 94746ccd404a8e056483dd02e4e478097b950da6-dirty
galley version: 1.2.6
ingressgateway version: 1.2.6
pilot version: 1.2.6
policy version: 1.2.6
sidecar-injector version: 1.2.6
telemetry version: 1.2.6
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:02:58Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"} |
@jacexh I walked through all the shas in the 1.2.6 release and verified that istio/envoy@52a3903 is included in it. When you upgraded to 1.2.6, did you complete this step? https://istio.io/docs/setup/upgrade/steps/#sidecar-upgrade |
1.3.1 is out. If there are any more segfaults with similar backtrace, please reopen |
Here is what we see in
|
/reopen |
@jplevyak PTAL backtrace is different |
We should open this as a different issue since the backtrace is different it and it likely has a different root cause the historical information is going to be deceptive. |
I opened a new issue: #17699. |
Bug description
After upgrading to
1.2.4
(from:1.1.8
) we see occasional segfaults on many istio-proxy pods all of which have a form similar to this:Affected product area (please put an X in all that apply)
[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[X ] Networking
[ X] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure
Expected behavior
Envoy does not crash.
Steps to reproduce the bug
Very intermittent so hard to reproduce. Creating issue to alert people of potential problem. We will continue to investigate.
Version (include the output of
istioctl version --remote
andkubectl version
)How was Istio installed?
Helm charts
Environment where bug was observed (cloud vendor, OS, etc)
AWS/ Kops
The text was updated successfully, but these errors were encountered: