New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection failures following control plane upgrade to 1.6 #28026
Comments
cc. @howardjohn @mandarjog @rshriram any help would be appreciated. Happy to give you access to the test cluster (or screen share if it helps). |
A couple of things I can think of are h2upgrade differences or TLS mismatch. @PiotrSikora hpe_invalid_method indicates that it chose the http1 codec. |
@mandarjog : no http2 in our deployment, we still have it disabled following the last incident and remember this issue is incredibly nuanced to applications that are specifically:
and restarting the app to a 1.6 proxy resolves it, which is why i've provided config dumps from both 1.5 and 1.6. Another correlation of the broken workloads on a 1.5 proxy and 1.6 control plane have a |
@mandarjog @howardjohn sorry to keep tagging you but i'm really keen to get rid of this 1.6 migration (and then start 1.7 testing). This is the last remaining "blocking" issue for us as it causes a service outage until the pods are restarted. I can recreate it easily, so will work with anyone (at any hour) to get to the bottom of it. We've intentionally left a test cluster in this broken state for investigation purposes. |
So we continued to try and narrow this down. We did the following:
= Broken
= OK The errors in the target app were slightly different on this one:
And here are source app logs:
|
To update further, we tried again with two workloads:
Updated control plane to 1.6
Restart workload 1 = fixed. Therefore i'm 100% confident the annotation is the cause during the upgrade process. The reason this is so impacting for us it we have this annotation on almost every workload. We've diff'd source and destination proxy configs and aren't able to identify and difference in config being pushed so just looks like the proxy is getting itself into a very funky state until its restarted. |
Good pod:
- istio-iptables
- -p
- "15001"
- -z
- "15006"
- -u
- "1337"
- -m
- REDIRECT
- -i
- ...
- -x
- ...
- -b
- -d
- "15020"
- -o
- ... Bad pod:
- command:
- istio-iptables
- -p
- "15001"
- -z
- "15006"
- -u
- "1337"
- -m
- REDIRECT
- -i
- ....
- -x
- foo,bar
- -b
- "8080"
- -d
- "15020"
- -o
- x,y,z,not-relevant The interesting thing is the arguments actually show |
On 1.5:
On 1.6:
|
Missing #22686? |
Works in 1.5 pilot because ab88b62 is not in 1.5 |
I've tested @howardjohn fix in #28111 and can confirm it works as expected |
Thanks @Stono ! For anyone else that runs into this, please use the latest 1.6 and PILOT_ENABLE_LEGACY_INBOUND_LISTENERS, or 1.7+ is fixed natively. Thanks! |
Bug description
Another set of apps that doesn't work. We upgrade the control plane to
1.6
, but don't touch the apps (so still running1.5
sidecar) and we start getting the oldupstream connect error or disconnect/reset before headers. reset reason: connection failure
when making requests from within the mesh (from another app to this one).Looking at the istio-proxy debug logs for the destination service we see:
Here are config dumps from
istio-proxy
on the destination service from 1.5 and 1.6 control planes:config-dumps.tar.gz
The pod exposes three ports:
The cluster is configured with strict mtls:
The only difference I could draw was that this app (and the other broken services) also has a non-mesh service (
stfp
):However that is excluded from istio with a
Sidecar
:and
We tried adding an explicit
PeerAuthentication
policy (even though it shouldn't actually make any difference):And as expected it didn't change anything.
Restarting the app to pull in a 1.6 sidecar fixed the issue, therefore it seems to only be an issue with a 1.5 sidecar & 1.6 control plane.
[ ] Docs
[ ] Installation
[x] Networking
[ ] Performance and Scalability
[ ] Extensions and Telemetry
[ ] Security
[x] Test and Release
[x] User Experience
[ ] Developer Infrastructure
Expected behavior
Upgrades to not break services
Steps to reproduce the bug
🤷 happy to live debug my cluster with you
Version (include the output of
istioctl version --remote
andkubectl version --short
andhelm version
if you used Helm)1.6.11
How was Istio installed?
helm
Environment where bug was observed (cloud vendor, OS, etc)
gke
The text was updated successfully, but these errors were encountered: