New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
503NR/BlackHole during istiod rollouts #28120
Comments
It's worth noting that our process removes the old certificate after the workload has rolled out:
But perhaps there's a bit of a race condition because the old workload could still theoretically be terminating. My understanding of kubernetes was that deleting a mounted certificate after the pod has started doesn't remove it - it'd only stop a new pod scheduling if the secret wasn't available to mount. |
ahhhh!
Look at the time stamps, it looks like a bug in the script caused this deployment to not actually wait, thus deleting certs for an in-use workload. |
@howardjohn could you please reopen this because we just saw it again going from Struggling to consistently reproduce at the moment however. |
@howardjohn I've narrowed this down to The two events above were caused by the removal of Whilst i'm not hugely concerned (as we won't be bringing that flag back) - It'd be great if someone could confirm the problem we're seeing is only caused by that flag, and can't manifest as a result of other code paths? Potentially #26861? |
I (not too surprisingly) was not able to reproduce this. Do you see this on any environment that we can play around in? Either direct access or we can set up a call perhaps |
bringing slack conversation here: Suspect there are possible multiple issues. focusing on UF,URX first. Observed on 1.6 <-> 1.7 rollback/forward. Diff on config:
Logs: |
I build a custom image with v2 TLS context and its zero downtime. So seems extremely likely due to ^. Looking into what the "proper" fix is |
See istio#28120 See envoyproxy/envoy#13864 This resolves a downtime event on in place upgrade from 1.6 to 1.7. This is a couple seconds of 503s. This is intentionally sent only to 1.7 as it is only relevant for this branch. Please note this feature flag is shipped by on by default. We have two choices: * Off by default. Anyone upgrading from 1.6 to 1.7 will continue to get downtime unless they read the release notes and add the flag. * On by default. Anyone with 1.7 already deployed, but that still has 1.6 proxies will encur a downtime unless they read the release notes and remove the flag. I have chosen on by default, as the set of people with 1.6 proxies with 1.7.x Istiod upgrading to 1.7.5 seems far smaller than the impacted set of "off by default", and the mitigation is the same. Additionally, for those that are impacted, the impact will be exclusively the proxies on 1.6, which is presumably not 100% of proxies, whereas in the other case ALL proxies are 1.6 and thus impacted.
* Do not switch TLS version on 1.6 -> 1.7 upgrade See #28120 See envoyproxy/envoy#13864 This resolves a downtime event on in place upgrade from 1.6 to 1.7. This is a couple seconds of 503s. This is intentionally sent only to 1.7 as it is only relevant for this branch. Please note this feature flag is shipped by on by default. We have two choices: * Off by default. Anyone upgrading from 1.6 to 1.7 will continue to get downtime unless they read the release notes and add the flag. * On by default. Anyone with 1.7 already deployed, but that still has 1.6 proxies will encur a downtime unless they read the release notes and remove the flag. I have chosen on by default, as the set of people with 1.6 proxies with 1.7.x Istiod upgrading to 1.7.5 seems far smaller than the impacted set of "off by default", and the mitigation is the same. Additionally, for those that are impacted, the impact will be exclusively the proxies on 1.6, which is presumably not 100% of proxies, whereas in the other case ALL proxies are 1.6 and thus impacted. * fix nil * Fix initial fetch
I can confirm @howardjohn fix for #28498 has fixed the majority of the rollout issues we experienced going from 1.6 to 1.7, however we are still seeing some (very small amount) of requests get "BlackHole" (we run In this particular example, we have three instances of the source service:
The service they are talking to that failed (
I started looking at the The main thing I noticed was that on the pods where we saw issues, I also see:
Unfortunately as the failures are so few, and i'm unable to inconsistently reproduce them, it's quite hard to get any more debug info. |
For the logs about failed CSR, that should be unknown to envoy. From its perspective it just seeings that there is a 2s window where it request a secret and didn't get one. I could see that making envoy unhappy (it shouldn't, but seems a plausible bug), but if it was the cause I would expect 2s outage likely. |
I've not been able to recreate the BlackHole/Zipkin one and have had 1.7 on a cluster for 24 hours with 400 different apps and about 300 deployments, and haven't seen any other errors. I'm going to close this issue as sorted, for anyone coming to this the fix will be released in 1.7.5. |
Bug description
We are in the process of restarting all workloads one by one to move them to a 1.6 sidecar, following a control plane update. All applications are identical in terms of their istio, deployment and pod config, just running different
image
.We noticed that when one app was patched, the
1.5
pods shutting down started getting503NR
(No route to host) to all services it talks to. These metrics were recorded fromreporter=source
, so the source proxy (the 1.5 proxy on the terminating pods).We have 3
istiod
instances, here are the logs from each:There are two services pointing to the same pods:
Unfortunately I don't have the proxy logs persisted.
I have been unable to reproduce this by going back and forth again, so it feels like a non-deterministic ordering or race condition of some sort.
[ ] Docs
[ ] Installation
[x] Networking
[ ] Performance and Scalability
[ ] Extensions and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure
Expected behavior
No 503NR's during upgrade of data plane
Steps to reproduce the bug
Version (include the output of
istioctl version --remote
andkubectl version --short
andhelm version
if you used Helm)1.6.12
How was Istio installed?
Helm
Environment where bug was observed (cloud vendor, OS, etc)
The text was updated successfully, but these errors were encountered: