-
Notifications
You must be signed in to change notification settings - Fork 7.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ingress SDS not getting secret updates #23715
Comments
Another example:
config dump clearly shows httpbin-credential is present |
This doesn't happ[en 100% of the time. Wrote a simple script to verify:
|
Does the config dump show the old secret? It would take some time for the SDS flow to push the new one. Are there istio-proxy logs available. |
I am pretty sure what happens is:
But we don't handle the update. I will verify this in a bit, just a guess right now |
Btw its not just changing the secret, I reproduced with a brand new cert which is why I think ^ is the root cause. will try to get 100% reproducer |
I think there are two related problems Deleting secret and creating empty secret behave different Scenario one:
Scenario two:
Empty secret means apiVersion: v1
data:
ca.crt: ""
tls.crt: ""
tls.key: ""
kind: Secret
metadata:
name: certificate
namespace: istio-system
type: kubernetes.io/tls Secret update broken in some cases. Reproducer:
|
I confirmed similar behavior is present in Istio 1.5 so its not a (new) regression |
Bumping to P0 as we have a reproducer now |
Thanks @howardjohn Secret update broken in some cases. Reproducer:
@williamaronli could you take a look and try reproduce it following these steps? ===============
What happens here is when apply empty secret, SDS agent detects the secret is empty, so it rejects the secret. The cached copy in SDS agent is the last valid secret, which is the old cert. Next time, when gw is removed and reapplied, gw asks SDS agent to get new secret, and SDS agent pushes the cached secret to gw. This is expected. ============= Scenario two:
What happens here is when secret is deleted, SDS agent removes the cached secret as well. That's why gw does not get cert after remove and reapply. ============== |
Before creating i verified it works:
fengxiangli@williamaronli:/istio-1.6.0$ kubectl -n istio-system delete secret httpbin-credential
fengxiangli@williamaronli:/istio-1.6.0$ kubectl delete gateway mygateway
fengxiangli@williamaronli:/istio-1.6.0$ kubectl get pods -n istio-system fengxiangli@williamaronli:/istio-1.6.0$ istioctl proxy-config secret istio-ingressgateway-74cb7595bd-tgxqn.istio-system |
restart the ingress pod:
fengxiangli@williamaronli:/istio-1.6.0$ istioctl proxy-config secret istio-ingressgateway-5f9b77885c-99f44.istio-system now after restart the pod the cert is active |
check the logs using kubectl logs -n istio-system "$(kubectl get pod -l istio=ingressgateway
|
This is a flaky issue: the running time Create gw and create secret matters the secret successfully pushed to the ingress gateway proxy or not
0.2 second (success):
And if we switch order of the create gateway and create secret. such problem will not show
error debug loghttps://docs.google.com/document/d/1z4EnJ-T9caRHbABFw-wVfdRZNh0szs2lFpwD1aInrc0/edit Some potential root causeI guess that if the gateway tries to fetch the secret, and the secret is not ready or created. The gateway is stuck there and will not retry to catch it. potential solution
|
Another finding:
|
Until to current finding. The root cause : error log:
code: https://github.com/istio/istio/blob/master/security/pkg/nodeagent/sds/sdsservice.go#L482 Some workaround methods:
|
This is not the "correct" order. One of the most common use cases is deploying certs with cert-manager and this is currently pretty broken today. We cannot add this arbitrary restriction. Pushing some data to envoy shouldn't need to be this complicated |
Folks, is this fixed? |
It is fixed by this PR: #24817 |
Working with ISTIO 1.10.3. I see this issue happening again. Restarted every possible artifact and yet it keeps giving the same error. details here
here is the secret -- redacted.
Now this gateway was earlier associated with a different old cert and even after the change the old cert keeps being associated. This is completely screwing up our calls to the services exposed via this as the TLS error comes because host names not matching
|
Will add more info here later if I can reproduce it.
Running from 5f8807a
I deployed a secret with cert-manager letsencrypt-staging. I later moved it to prod, resulting in the secret updating. However, the gateway used the old cert. I then deleted the secret and recreated it -- same thing, old secret.
I verified the cert in config_dump does not match the secret.
Then I restarted the ingress pod and it finally picked up the proper cert
The text was updated successfully, but these errors were encountered: