Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingress SDS missing event to push key/cert pair to proxy #18912

Closed
aaronjwood opened this issue Nov 12, 2019 · 23 comments
Closed

Ingress SDS missing event to push key/cert pair to proxy #18912

aaronjwood opened this issue Nov 12, 2019 · 23 comments
Assignees
Labels
area/config area/security lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically.

Comments

@aaronjwood
Copy link
Contributor

aaronjwood commented Nov 12, 2019

Bug description
In rare cases we're seeing that SDS is missing an event entirely that's causing only the root cert to get pushed but not the key/cert pair. Looking through the logs we can see that the root cert gets pushed out once the secret is eventually found:

2019-11-12T22:05:53.335991Z	warn	secretFetcherLog	Cannot find secret 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert, searching for fallback secret gateway-fallback
2019-11-12T22:05:53.336191Z	error	secretFetcherLog	cannot find secret 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert and cannot find fallback secret gateway-fallback
2019-11-12T22:05:53.336244Z	warn	cacheLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-112, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert, EVENT: SecretFetcher cannot find secret 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert from cache
2019-11-12T22:05:53.336308Z	warn	sdsServiceLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-112, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert, EVENT: waiting for ingress gateway secret for proxy "router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local"

2019-11-12T22:05:53.336165Z	warn	secretFetcherLog	Cannot find secret 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b, searching for fallback secret gateway-fallback
2019-11-12T22:05:53.336362Z	error	secretFetcherLog	cannot find secret 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b and cannot find fallback secret gateway-fallback
2019-11-12T22:05:53.336382Z	warn	cacheLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-113, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b, EVENT: SecretFetcher cannot find secret 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b from cache
2019-11-12T22:05:53.336391Z	warn	sdsServiceLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-113, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b, EVENT: waiting for ingress gateway secret for proxy "router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local"

2019-11-12T22:05:59.991770Z	error	sdsServiceLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-111, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b, EVENT: NotifyProxy failed. No connection with id "router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-111" can be found
2019-11-12T22:05:59.991797Z	error	cacheLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-111, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b, EVENT: failed to notify secret change for proxy: no connection with id "router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-111" can be found
2019-11-12T22:05:59.991894Z	info	sdsServiceLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-112, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert, EVENT: pushed root cert to proxy

You'll see that no key/cert pair gets pushed. This causes TLS/mTLS gateways to break completely.

The way to recover from this is to delete the secret and add it back again. This results in the following logs:

2019-11-12T23:09:37.570981Z	info	secretFetcherLog	secret 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b is deleted
2019-11-12T23:09:37.571063Z	error	sdsServiceLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-111, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b, EVENT: NotifyProxy failed. No connection with id "router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-111" can be found
2019-11-12T23:09:37.571075Z	error	cacheLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-111, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b, EVENT: failed to notify secret change for proxy: no connection with id "router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-111" can be found
2019-11-12T23:09:37.571085Z	info	secretFetcherLog	secret 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert is deleted
2019-11-12T23:09:37.571104Z	error	sdsServiceLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-110, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert, EVENT: NotifyProxy failed. No connection with id "router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-110" can be found
2019-11-12T23:09:37.571115Z	error	cacheLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-110, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert, EVENT: failed to notify secret change for proxy: no connection with id "router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-110" can be found
2019-11-12T23:10:14.200253Z	info	sdsServiceLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-112, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b-cacert, EVENT: pushed root cert to proxy

2019-11-12T23:10:14.200251Z	info	sdsServiceLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-113, RESOURCE NAME: 975cfa2b-9e3f-4c4b-8eb6-2c810aad271b, EVENT: pushed key/cert pair to proxy

Expected behavior
SDS always pushing the required crypto to the proxy.

Steps to reproduce the bug
Unknown, I've found this very hard to reproduce. It has only affected one user of ours AFAIK.

Version (include the output of istioctl version --remote and kubectl version)

client version: unknown
control plane version: 1.3.0
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T23:41:55Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.11-eks-5876d6", GitCommit:"5876d6b7429820450950ade17fe7b4bf5ccada7f", GitTreeState:"clean", BuildDate:"2019-09-24T20:54:25Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}

How was Istio installed?
Imagine having a Makefile that does this where $ISTIO_VERSION is 1.3.0:

@curl -s -L https://git.io/getLatestIstio | ISTIO_VERSION=$(ISTIO_VERSION) sh -
@cat istio/istio-namespace.yaml > istio/istio.yaml
@cat istio/istio-namespace.yaml > istio/istio-crds.yaml
@helm template istio-$(ISTIO_VERSION)/install/kubernetes/helm/istio-init --name istio-init --namespace istio-system >> istio/istio-crds.yaml
@helm template istio-$(ISTIO_VERSION)/install/kubernetes/helm/istio --name istio --namespace istio-system \
    --set sidecarInjectorWebhook.rewriteAppHTTPProbe=true \
    --set servicegraph.enabled=true \
    --set gateways.istio-ingressgateway.sds.enabled=true \
    --set tracing.enabled=true \
    --set kiali.enabled=true \
    --set "kiali.dashboard.jaegerURL=http://jaeger-query:16686" \
    --set "kiali.dashboard.grafanaURL=http://grafana:3000" \
    --set grafana.enabled=true >> istio/istio.yaml
@cat istio-$(ISTIO_VERSION)/install/kubernetes/helm/istio-init/files/crd-1* > istio/istio-delete-crds.yaml
@rm -rf istio-$(ISTIO_VERSION)

A kubectl apply is later done on the istio-crds.yaml and istio.yaml files.

Environment where bug was observed (cloud vendor, OS, etc)
EKS

@howardjohn
Copy link
Member

I vaguely recall this being a known bug that has been fixed in one of the 1.3 patch releases. I highly recommend you upgrade to 1.3.5 which has a number of high severity fixed.

@JimmyCYJ can verify if this was fixed

@aaronjwood
Copy link
Contributor Author

Thanks! I couldn't find any reference to this in the release notes. @JimmyCYJ can you advise?

@JimmyCYJ
Copy link
Member

Thanks @howardjohn! Yes, there is a bug in Citadel agent in 1.3.0 release. If your Cidatel agent prints logs such as "warn secretFetcherLog unexpected server key/cert change in secret kyma-gateway-certs", then it indicates that Citadel agent hits this bug.

The release notes is here https://istio.io/news/2019/announcing-1.3.4/#minor-enhancements.
Tracking issue is here #17409.

Please upgrade to 1.3.5.

@aaronjwood
Copy link
Contributor Author

aaronjwood commented Nov 13, 2019

Thanks @JimmyCYJ. I'm not seeing logs like that anywhere in our setup. We are seeing these logs from our SDS container though:

2019-11-13T22:46:36.022211Z	error	cacheLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-106, RESOURCE NAME: 6dd40d9f-0f30-481f-9c12-086715e685d9, EVENT: failed to notify secret change for proxy: no connection with id "router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-106" can be found

Along with these:

2019-11-13T22:46:36.022191Z	error	sdsServiceLog	CONNECTION ID: router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-106, RESOURCE NAME: 6dd40d9f-0f30-481f-9c12-086715e685d9, EVENT: NotifyProxy failed. No connection with id "router~192.168.54.237~istio-ingressgateway-66b78f978b-njr66.istio-system~istio-system.svc.cluster.local-106" can be found

Does it look like a different issue to you?

@aaronjwood
Copy link
Contributor Author

I'm testing this against a cluster that's been upgraded to 1.3.5, but I find the original issue is very hard to reproduce. Is there something specific that I can exercise to hit the case above where SDS loses track of the connection?

@incfly incfly changed the title SDS missing event to push key/cert pair to proxy Ingress SDS missing event to push key/cert pair to proxy Nov 18, 2019
@incfly
Copy link

incfly commented Nov 18, 2019

I'm closing this because @JimmyCYJ provide a suggestion to upgrade to 1.3.5

Please feel free to re-open if you can reproduce after upgrading to 1.3.5

@aaronjwood
Copy link
Contributor Author

aaronjwood commented Nov 19, 2019

I still see this issue with 1.3.5. Can we reopen this? @incfly I don't have perms to reopen the issue. Here's how I'm able to repro:

Create a gateway that uses a secret like so:

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: apa0401f02e6d286f2f0678eff4046e11571e58
  namespace: istio-system
spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    - '*.apa0401f02e6d286f2f0678eff4046e11571e58.myurl.com'
    port:
      name: https-mutual
      number: 443
      protocol: HTTPS
    tls:
      credentialName: apa0401f02e6d286f2f0678eff4046e11571e58
      mode: MUTUAL

Create the secret that the gateway uses:

apiVersion: v1
data:
  cacert: cacertdatahere
  cert: certdatahere
  key: secretdatahere
kind: Secret
metadata:
  name: apa0401f02e6d286f2f0678eff4046e11571e58
  namespace: istio-system
type: Generic

Now delete the secret and gateway and create them again using the exact same values. This causes SDS to fail to push things over to Envoy. In 1.3.5 I see that even the root cert doesn't make it to Envoy so the TLS endpoint is completely broken:

2019-11-19T05:45:20.765756Z	warn	secretFetcherLog	Cannot find secret apa0401f02e6d286f2f0678eff4046e11571e58-cacert, searching for fallback secret gateway-fallback
2019-11-19T05:45:20.765784Z	error	secretFetcherLog	cannot find secret apa0401f02e6d286f2f0678eff4046e11571e58-cacert and cannot find fallback secret gateway-fallback
2019-11-19T05:45:20.765791Z	warn	cacheLog	CONNECTION ID: router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-18, RESOURCE NAME: apa0401f02e6d286f2f0678eff4046e11571e58-cacert, EVENT: SecretFetcher cannot find secret apa0401f02e6d286f2f0678eff4046e11571e58-cacert from cache
2019-11-19T05:45:20.765800Z	warn	sdsServiceLog	CONNECTION ID: router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-18, RESOURCE NAME: apa0401f02e6d286f2f0678eff4046e11571e58-cacert, EVENT: waiting for ingress gateway secret for proxy "router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local"

2019-11-19T05:45:20.769565Z	warn	secretFetcherLog	Cannot find secret apa0401f02e6d286f2f0678eff4046e11571e58, searching for fallback secret gateway-fallback
2019-11-19T05:45:20.769693Z	error	secretFetcherLog	cannot find secret apa0401f02e6d286f2f0678eff4046e11571e58 and cannot find fallback secret gateway-fallback
2019-11-19T05:45:20.769723Z	warn	cacheLog	CONNECTION ID: router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-17, RESOURCE NAME: apa0401f02e6d286f2f0678eff4046e11571e58, EVENT: SecretFetcher cannot find secret apa0401f02e6d286f2f0678eff4046e11571e58 from cache
2019-11-19T05:45:20.769739Z	warn	sdsServiceLog	CONNECTION ID: router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-17, RESOURCE NAME: apa0401f02e6d286f2f0678eff4046e11571e58, EVENT: waiting for ingress gateway secret for proxy "router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local"

2019-11-19T05:45:41.187776Z	error	sdsServiceLog	CONNECTION ID: router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-16, RESOURCE NAME: apa0401f02e6d286f2f0678eff4046e11571e58, EVENT: NotifyProxy failed. No connection with id "router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-16" can be found
2019-11-19T05:45:41.187808Z	error	cacheLog	CONNECTION ID: router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-16, RESOURCE NAME: apa0401f02e6d286f2f0678eff4046e11571e58, EVENT: failed to notify secret change for proxy: no connection with id "router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-16" can be found
2019-11-19T05:45:41.187980Z	error	sdsServiceLog	CONNECTION ID: router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-15, RESOURCE NAME: apa0401f02e6d286f2f0678eff4046e11571e58-cacert, EVENT: NotifyProxy failed. No connection with id "router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-15" can be found
2019-11-19T05:45:41.187996Z	error	cacheLog	CONNECTION ID: router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-15, RESOURCE NAME: apa0401f02e6d286f2f0678eff4046e11571e58-cacert, EVENT: failed to notify secret change for proxy: no connection with id "router~100.96.4.216~istio-ingressgateway-6985dc855-qgc4z.istio-system~istio-system.svc.cluster.local-15" can be found

@howardjohn
Copy link
Member

@aaronjwood

@jamesmoriarty
Copy link

I've also seen this issue with 1.3.5.

@psychoveter
Copy link

I see the same for 1.4.0 (kube 1.16.2):

istioctl version
client version: 1.4.0
control plane version: 1.4.0

kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T19:18:23Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T19:09:08Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}

While SIMPLE tls for ingress works fine, MUTUAL tls realized according to task https://istio.io/docs/tasks/traffic-management/ingress/secure-ingress-sds/ behaves like in the issue

@JimmyCYJ
Copy link
Member

JimmyCYJ commented Dec 3, 2019

@aaronjwood could you turn on debug logging at ingress-sds? You can do it via /ctrlz endpoint. Please share a copy of the log, or send me a copy by slack channel. From the piece of log you shared, ingress-sds does not detect the secret apa0401f02e6d286f2f0678eff4046e11571e58. I need to figure out why it fails to detect the secret.

@aaronjwood
Copy link
Contributor Author

Okay I ran through this scenario again with cacheLog, sdsServiceLog, and secretFetcher debug logs turned on:

2019-12-12T00:26:25.294660Z	debug	sdsServiceLog	start staled connection cleanup job
2019-12-12T00:26:25.294729Z	debug	sdsServiceLog	remove staled clients {ConnectionID:router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-6 ResourceName:awood1018db10718dc0e1994886056a23fd7b9b1-cacert}
2019-12-12T00:26:25.294748Z	debug	sdsServiceLog	remove staled clients {ConnectionID:router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-4 ResourceName:awood1018db10718dc0e1994886056a23fd7b9b1}
2019-12-12T00:26:25.294756Z	debug	sdsServiceLog	remove staled clients {ConnectionID:router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-3 ResourceName:awood1018db10718dc0e1994886056a23fd7b9b1-cacert}
2019-12-12T00:26:25.294768Z	debug	sdsServiceLog	remove staled clients {ConnectionID:router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-5 ResourceName:awood1018db10718dc0e1994886056a23fd7b9b1}
2019-12-12T00:26:25.294660Z	debug	sdsServiceLog	start staled connection cleanup job
2019-12-12T00:26:58.677632Z	debug	sdsServiceLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-7, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1, EVENT: add a new connection
2019-12-12T00:26:58.677691Z	debug	sdsServiceLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-7, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1, EVENT: received first SDS request from proxy "router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local", version info "", error details nil

2019-12-12T00:26:58.677711Z	debug	cacheLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-7, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1, EVENT: calling SecretFetcher to search for secret awood1018db10718dc0e1994886056a23fd7b9b1
2019-12-12T00:26:58.677717Z	debug	secretFetcherLog	SecretFetcher search for secret awood1018db10718dc0e1994886056a23fd7b9b1
2019-12-12T00:26:58.677724Z	debug	secretFetcherLog	load secret awood1018db10718dc0e1994886056a23fd7b9b1 from secret fetcher: false
2019-12-12T00:26:58.677847Z	debug	sdsServiceLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-8, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1-cacert, EVENT: add a new connection
2019-12-12T00:26:58.677897Z	debug	sdsServiceLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-8, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1-cacert, EVENT: received first SDS request from proxy "router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local", version info "", error details nil

2019-12-12T00:26:58.677912Z	debug	cacheLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-8, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1-cacert, EVENT: calling SecretFetcher to search for secret awood1018db10718dc0e1994886056a23fd7b9b1-cacert
2019-12-12T00:26:58.677918Z	debug	secretFetcherLog	SecretFetcher search for secret awood1018db10718dc0e1994886056a23fd7b9b1-cacert
2019-12-12T00:26:58.677924Z	debug	secretFetcherLog	load secret awood1018db10718dc0e1994886056a23fd7b9b1-cacert from secret fetcher: false
2019-12-12T00:26:58.684456Z	warn	secretFetcherLog	Cannot find secret awood1018db10718dc0e1994886056a23fd7b9b1, searching for fallback secret gateway-fallback
2019-12-12T00:26:58.684523Z	error	secretFetcherLog	cannot find secret awood1018db10718dc0e1994886056a23fd7b9b1 and cannot find fallback secret gateway-fallback
2019-12-12T00:26:58.684539Z	warn	cacheLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-7, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1, EVENT: SecretFetcher cannot find secret awood1018db10718dc0e1994886056a23fd7b9b1 from cache
2019-12-12T00:26:58.684549Z	warn	sdsServiceLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-7, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1, EVENT: waiting for ingress gateway secret for proxy "router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local"

2019-12-12T00:26:58.684601Z	warn	secretFetcherLog	Cannot find secret awood1018db10718dc0e1994886056a23fd7b9b1-cacert, searching for fallback secret gateway-fallback
2019-12-12T00:26:58.684608Z	error	secretFetcherLog	cannot find secret awood1018db10718dc0e1994886056a23fd7b9b1-cacert and cannot find fallback secret gateway-fallback
2019-12-12T00:26:58.684612Z	warn	cacheLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-8, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1-cacert, EVENT: SecretFetcher cannot find secret awood1018db10718dc0e1994886056a23fd7b9b1-cacert from cache
2019-12-12T00:26:58.684619Z	warn	sdsServiceLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-8, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1-cacert, EVENT: waiting for ingress gateway secret for proxy "router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local"

2019-12-12T00:27:13.477775Z	debug	secretFetcherLog	secret awood1018db10718dc0e1994886056a23fd7b9b1 is added as a server certificate
2019-12-12T00:27:13.477822Z	debug	cacheLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-5, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1, EVENT: secret cache is updated
2019-12-12T00:27:13.477864Z	error	sdsServiceLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-5, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1, EVENT: NotifyProxy failed. No connection with id "router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-5" can be found
2019-12-12T00:27:13.477881Z	error	cacheLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-5, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1, EVENT: failed to notify secret change for proxy: no connection with id "router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-5" can be found
2019-12-12T00:27:13.477900Z	debug	secretFetcherLog	secret awood1018db10718dc0e1994886056a23fd7b9b1-cacert is added as a client CA cert (from a compound Secret)
2019-12-12T00:27:13.477922Z	debug	cacheLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-6, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1-cacert, EVENT: secret cache is updated
2019-12-12T00:27:13.477948Z	error	sdsServiceLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-6, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1-cacert, EVENT: NotifyProxy failed. No connection with id "router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-6" can be found
2019-12-12T00:27:13.477957Z	error	cacheLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-6, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1-cacert, EVENT: failed to notify secret change for proxy: no connection with id "router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-6" can be found
2019-12-12T00:28:33.326025Z	info	secretFetcherLog	secret awood1018db10718dc0e1994886056a23fd7b9b1 is deleted
2019-12-12T00:28:33.326115Z	debug	cacheLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-5, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1, EVENT: secret cache is deleted
2019-12-12T00:28:33.326201Z	error	sdsServiceLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-5, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1, EVENT: NotifyProxy failed. No connection with id "router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-5" can be found
2019-12-12T00:28:33.326284Z	error	cacheLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-5, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1, EVENT: failed to notify secret change for proxy: no connection with id "router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-5" can be found
2019-12-12T00:28:33.326325Z	info	secretFetcherLog	secret awood1018db10718dc0e1994886056a23fd7b9b1-cacert is deleted
2019-12-12T00:28:33.326356Z	debug	cacheLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-8, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1-cacert, EVENT: secret cache is deleted
2019-12-12T00:28:33.326413Z	debug	sdsServiceLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-8, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1-cacert, EVENT: received push channel request for proxy "router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local"
2019-12-12T00:28:33.326447Z	debug	sdsServiceLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-8, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1-cacert, EVENT: close connection for proxy "router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local"
2019-12-12T00:28:33.326551Z	info	sdsServiceLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-8, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1-cacert, EVENT: connection is terminated: rpc error: code = Canceled desc = context canceled
2019-12-12T00:28:33.400910Z	debug	sdsServiceLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-9, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1-cacert, EVENT: add a new connection
2019-12-12T00:28:33.400968Z	debug	sdsServiceLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-9, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1-cacert, EVENT: received first SDS request from proxy "router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local", version info "", error details nil

2019-12-12T00:28:33.400991Z	debug	cacheLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-9, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1-cacert, EVENT: calling SecretFetcher to search for secret awood1018db10718dc0e1994886056a23fd7b9b1-cacert
2019-12-12T00:28:33.400997Z	debug	secretFetcherLog	SecretFetcher search for secret awood1018db10718dc0e1994886056a23fd7b9b1-cacert
2019-12-12T00:28:33.401004Z	debug	secretFetcherLog	load secret awood1018db10718dc0e1994886056a23fd7b9b1-cacert from secret fetcher: false
2019-12-12T00:28:33.407724Z	warn	secretFetcherLog	Cannot find secret awood1018db10718dc0e1994886056a23fd7b9b1-cacert, searching for fallback secret gateway-fallback
2019-12-12T00:28:33.407750Z	error	secretFetcherLog	cannot find secret awood1018db10718dc0e1994886056a23fd7b9b1-cacert and cannot find fallback secret gateway-fallback
2019-12-12T00:28:33.407757Z	warn	cacheLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-9, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1-cacert, EVENT: SecretFetcher cannot find secret awood1018db10718dc0e1994886056a23fd7b9b1-cacert from cache
2019-12-12T00:28:33.409307Z	warn	sdsServiceLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-9, RESOURCE NAME: awood1018db10718dc0e1994886056a23fd7b9b1-cacert, EVENT: waiting for ingress gateway secret for proxy "router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local"

At a quick glance it looks like the cleanup job which is taking out stalled clients seems to be the culprit.

@aaronjwood
Copy link
Contributor Author

I should instead say that the recycle logic here

defer recycleConnection(conID, resourceName)
and/or here
recycleConnection(conID, resourceName)
seems to be creating some sort of logical race...

@aaronjwood
Copy link
Contributor Author

Also, for some context the reason why you see that the secret does not exist for some time in the logs (even in the scenarios that work correctly) is due to the behavior of our own software. What we do is we first program the gateway config into the cluster and then later (between 30 seconds and 1 min usually) program the secret into the cluster.

@aaronjwood
Copy link
Contributor Author

Saw an interesting scenario today:

2019-12-13T18:21:37.813823Z	error	sdsServiceLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-13, RESOURCE NAME: awood01ded641ce22a5805cb501f6413c9fd0c4, EVENT: NotifyProxy failed. No connection with id "router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-13" can be found
2019-12-13T18:21:37.813875Z	error	cacheLog	CONNECTION ID: router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-13, RESOURCE NAME: awood01ded641ce22a5805cb501f6413c9fd0c4, EVENT: failed to notify secret change for proxy: no connection with id "router~100.102.204.50~istio-ingressgateway-66

The connection its trying to use called router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-13 was recycled yesterday:

2019-12-12T01:41:25.294628Z	debug	sdsServiceLog	start staled connection cleanup job
2019-12-12T01:41:25.294675Z	debug	sdsServiceLog	remove staled clients {ConnectionID:router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-13 ResourceName:awood01ded641ce22a5805cb501f6413c9fd0c4}
2019-12-12T01:41:25.294691Z	debug	sdsServiceLog	remove staled clients {ConnectionID:router~100.102.204.50~istio-ingressgateway-66b78f978b-dmnp8.istio-system~istio-system.svc.cluster.local-12 ResourceName:awood01ded641ce22a5805cb501f6413c9fd0c4-cacert}

@perry9186
Copy link

perry9186 commented Dec 26, 2019

Hi,
It happens to me too with istio 1.4.2
in the ingress gateway :

ingress sds :
2019-12-26T05:25:36.481345Z info sdsServiceLog CONNECTION ID: router10.233.65.7istio-ingressgateway-6677d6c5f5-6z7wq.istio-system~istio-system.svc.cluster.local-1, RESOURCE NAME: test-cert, EVENT: connection is terminated: rpc error: code = Canceled desc = context canceled
2019-12-26T05:25:36.481415Z error sdsServiceLog Remote side closed connection

istio-proxy :
2019-12-26T07:45:46.407147Z error Epoch 0 exited with error: signal: segmentation fault (core dumped)
2019-12-26T07:45:46.407181Z info 1 active epochs running

it seems like the istio-proxy inside the ingress crashed and didn't recover.( i looked for info Envoy proxy is ready)

can someone advise?

thanks,
perry

@JimmyCYJ JimmyCYJ modified the milestone: 1.6 Feb 4, 2020
@JimmyCYJ JimmyCYJ added this to the 1.6 milestone Feb 28, 2020
@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Mar 3, 2020
@istio-policy-bot istio-policy-bot added the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Mar 18, 2020
@aaronjwood
Copy link
Contributor Author

Note that SDS in 1.5 is not quite ready yet: #22443

@dkalani
Copy link

dkalani commented Apr 14, 2020

@howardjohn should we reopen this ticket @aaronjwood reported that the problem is still seen.

@JimmyCYJ JimmyCYJ reopened this Apr 14, 2020
@istio-policy-bot istio-policy-bot removed the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Apr 14, 2020
@JimmyCYJ JimmyCYJ removed this from the 1.6 milestone May 5, 2020
@jullianoacqio
Copy link

I face the same problem in version 1.5.

@JimmyCYJ
Copy link
Member

#24817 is the fix for the issue.

@JimmyCYJ JimmyCYJ closed this as completed Jul 8, 2020
@aaronjwood
Copy link
Contributor Author

@JimmyCYJ what release will this be targeted for? Will it be backported anywhere?

@JimmyCYJ
Copy link
Member

what release will this be targeted for? Will it be backported anywhere?

It will be shipped with release-1.7. Which will be announced on August 11th (https://github.com/istio/istio/wiki/Istio-Release-1.7)
@williamaronli could you cherrypick this fix into 1.5 and 1.6 as well?

@williamaronli
Copy link
Contributor

will cherry-pick to 1.5 and 1.6 this week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/config area/security lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically.
Projects
None yet
Development

No branches or pull requests