ingress SDS not getting secret updates #23715

howardjohn · 2020-05-11T17:34:59Z

Will add more info here later if I can reproduce it.

Running from 5f8807a

I deployed a secret with cert-manager letsencrypt-staging. I later moved it to prod, resulting in the secret updating. However, the gateway used the old cert. I then deleted the secret and recreated it -- same thing, old secret.

I verified the cert in config_dump does not match the secret.

Then I restarted the ingress pod and it finally picked up the proper cert

The text was updated successfully, but these errors were encountered:

howardjohn · 2020-05-12T20:07:58Z

Another example:

020-05-12T20:06:59.884794Z     info    secretfetcher   scrtUpdated is called on kubernetes secret httpbin-credential
2020-05-12T20:06:59.885457Z     error   sds     resource:httpbin-credential NotifyProxy failed. No connection with id "router~10.28.0.123~istio-ingressgateway-7fd679cd56-z6gvh.istio-system~istio-system.svc.cluster.local-7" can be found
2020-05-12T20:06:59.885488Z     error   cache   resource:httpbin-credential failed to notify secret change for proxy: no connection with id "router~10.28.0.123~istio-ingressgateway-7fd679cd56-z6gvh.istio-system~istio-system.svc.cluster.local-7" can be found

config dump clearly shows httpbin-credential is present

howardjohn · 2020-05-12T20:17:41Z

This doesn't happ[en 100% of the time. Wrote a simple script to verify:

for i in {0..50}; do
    openssl x509 -req -days 365 -CA example.com.crt -CAkey example.com.key -set_serial $i -in httpbin.example.com.csr -out httpbin.example.com.crt
    kubectl patch secret/httpbin-credential -n istio-system --type merge -p '{"data":{"tls.crt":"'$(cat httpbin.example.com.crt | base64 -w0)'", "tls.key":"'$(cat httpbin.example.com.key | base64 -w0)'"}}'
    ik pc secret istio-ingressgateway-7fd679cd56-gzw54.istio-system
done

JimmyCYJ · 2020-05-14T20:30:37Z

Does the config dump show the old secret? It would take some time for the SDS flow to push the new one. Are there istio-proxy logs available.

howardjohn · 2020-05-18T23:39:31Z

I am pretty sure what happens is:

Empty secret is created
Agent fails to read it
Secret update with key/cert

But we don't handle the update. I will verify this in a bit, just a guess right now

howardjohn · 2020-05-18T23:40:13Z

Btw its not just changing the secret, I reproduced with a brand new cert which is why I think ^ is the root cause. will try to get 100% reproducer

howardjohn · 2020-05-19T15:36:19Z

I think there are two related problems

Deleting secret and creating empty secret behave different

Scenario one:

Apply secret with cert
Apply empty secret
Remove gw and reapply
Nothing happens, old cert remains, old cert is sent

Scenario two:

Apply secret with cert
Delete secret
Nothing happens, old cert remains
Remove gw and reapply, NO cert is sent

Empty secret means

apiVersion: v1
data:
  ca.crt: ""
  tls.crt: ""
  tls.key: ""
kind: Secret
metadata:
  name: certificate
  namespace: istio-system
type: kubernetes.io/tls

Secret update broken in some cases. Reproducer:

Create secret+gw
Delete secret
Delete gw
<wait 45s for drain>
Create gw
Recreate secret
Secret is stuck warming permanent

howardjohn · 2020-05-19T15:42:21Z

I confirmed similar behavior is present in Istio 1.5 so its not a (new) regression

howardjohn · 2020-05-19T15:42:49Z

Bumping to P0 as we have a reproducer now

JimmyCYJ · 2020-05-19T20:34:36Z

Thanks @howardjohn

Secret update broken in some cases. Reproducer:

Create secret+gw
Delete secret
Delete gw
<wait 45s for drain>
Create gw
Recreate secret
Secret is stuck warming permanent

@williamaronli could you take a look and try reproduce it following these steps?

===============
Scenario one:

Apply secret with cert
Apply empty secret
Remove gw and reapply
Nothing happens, old cert remains, old cert is sent

What happens here is when apply empty secret, SDS agent detects the secret is empty, so it rejects the secret. The cached copy in SDS agent is the last valid secret, which is the old cert. Next time, when gw is removed and reapplied, gw asks SDS agent to get new secret, and SDS agent pushes the cached secret to gw. This is expected.

=============

Scenario two:

Apply secret with cert
Delete secret
Nothing happens, old cert remains
Remove gw and reapply, NO cert is sent

What happens here is when secret is deleted, SDS agent removes the cached secret as well. That's why gw does not get cert after remove and reapply.

==============

williamaronli · 2020-05-26T22:35:05Z

Before creating i verified it works:
fengxiangli@williamaronli:~/istio-1.6.0$ curl -v -HHost:httpbin.example.com --resolve "httpbin.example.com:$SECURE_INGRESS_PORT:$INGRESS_HOST"
--cacert example.com.crt "https://httpbin.example.com:$SECURE_INGRESS_PORT/status/418"
< HTTP/2 418
< server: istio-envoy
< date: Tue, 26 May 2020 17:07:55 GMT
< x-more-info: http://tools.ietf.org/html/rfc2324
< access-control-allow-origin: *
< access-control-allow-credentials: true
< content-length: 135
< x-envoy-upstream-service-time: 2
<

-=[ teapot ]=-

   _...._
 .'  _ _ `.
| ."` ^ `". _,
\_;`"---"`|//
  |       ;/
  \_     _/
    `"""`

Create secret+gw:
fengxiangli@williamaronli:/istio-1.6.0$ kubectl create -n istio-system secret tls httpbin-credential --key=httpbin.example.com.key --cert=httpbin.example.com.crt
secret/httpbin-credential created
fengxiangli@williamaronli:/istio-1.6.0$ cat <<EOF | kubectl apply -f -
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: mygateway
spec:
selector:
istio: ingressgateway # use istio default ingress gateway
servers:

port:
number: 443
name: https
protocol: HTTPS
tls:
mode: SIMPLE
credentialName: httpbin-credential # must be the same as secret
hosts:
- httpbin.example.com
  EOF
  gateway.networking.istio.io/mygateway created

Delete secret

fengxiangli@williamaronli:/istio-1.6.0$ kubectl -n istio-system delete secret httpbin-credential
secret "httpbin-credential" deleted
fengxiangli@williamaronli:/istio-1.6.0$

Delete gw
<wait 45s for drain>

fengxiangli@williamaronli:/istio-1.6.0$ kubectl delete gateway mygateway
gateway.networking.istio.io "mygateway" deleted

Create gw
fengxiangli@williamaronli:/istio-1.6.0$ cat <<EOF | kubectl apply -f -
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: mygateway
spec:
selector:
istio: ingressgateway # use istio default ingress gateway
servers:

port:
number: 443
name: https
protocol: HTTPS
tls:
mode: SIMPLE
credentialName: httpbin-credential # must be the same as secret
hosts:
- httpbin.example.com
  EOF
  gateway.networking.istio.io/mygateway created

Recreate secret
fengxiangli@williamaronli:/istio-1.6.0$ kubectl create -n istio-system secret tls httpbin-credential --key=httpbin.example.com.key --cert=httpbin.example.com.crt
secret/httpbin-credential created
Secret is stuck warming permanent

fengxiangli@williamaronli:/istio-1.6.0$ kubectl get pods -n istio-system
NAME READY STATUS RESTARTS AGE
grafana-54b54568fc-h8gtl 1/1 Running 0 2d23h
istio-egressgateway-5cbb74b6d-7kjf6 1/1 Running 0 2d23h
istio-ingressgateway-74cb7595bd-tgxqn 1/1 Running 0 2d23h
istio-tracing-9dd6c4f7c-hnzsl 1/1 Running 0 2d23h
istiod-6b8bf87986-29tgd 1/1 Running 0 2d23h
kiali-d45468dc4-2g65t 1/1 Running 0 2d23h
prometheus-5c84c494dd-5dcbg 2/2 Running 0 2d23h

fengxiangli@williamaronli:/istio-1.6.0$ istioctl proxy-config secret istio-ingressgateway-74cb7595bd-tgxqn.istio-system
RESOURCE NAME TYPE STATUS VALID CERT SERIAL NUMBER NOT AFTER NOT BEFORE
httpbin-credential WARMING false
default Cert Chain ACTIVE true 117649866128878731488369555542841752511 2020-05-27T11:13:24Z 2020-05-26T11:13:24Z
ROOTCA CA ACTIVE true 153371244782393518006123463350295732481 2030-05-21T23:08:13Z 2020-05-23T23:08:13Z

williamaronli · 2020-05-26T22:59:34Z

restart the ingress pod:

fengxiangli@williamaronli:/istio-1.6.0 kubectl -n istio-system rollout restart deployment istio-ingressgateway
deployment.apps/istio-ingressgateway restarted
fengxiangli@williamaronli: /istio-1.6.0$ kubectl get pods -n istio-system
NAME READY STATUS RESTARTS AGE
grafana-54b54568fc-h8gtl 1/1 Running 0 2d23h
istio-egressgateway-5cbb74b6d-7kjf6 1/1 Running 0 2d23h
istio-ingressgateway-5f9b77885c-99f44 1/1 Running 0 67s
istio-tracing-9dd6c4f7c-hnzsl 1/1 Running 0 2d23h
istiod-6b8bf87986-29tgd 1/1 Running 0 2d23h
kiali-d45468dc4-2g65t 1/1 Running 0 2d23h
prometheus-5c84c494dd-5dcbg 2/2 Running 0 2d23h

fengxiangli@williamaronli:/istio-1.6.0$ istioctl proxy-config secret istio-ingressgateway-5f9b77885c-99f44.istio-system
RESOURCE NAME TYPE STATUS VALID CERT SERIAL NUMBER NOT AFTER NOT BEFORE
httpbin-credential Cert Chain ACTIVE true 0 2021-05-26T18:44:29Z 2020-05-26T18:44:29Z
default Cert Chain ACTIVE true 74707085384625053020352826882864255223 2020-05-27T22:57:09Z 2020-05-26T22:57:09Z
ROOTCA CA ACTIVE true 153371244782393518006123463350295732481 2030-05-21T23:08:13Z 2020-05-23T23:08:13Z

now after restart the pod the cert is active

williamaronli · 2020-05-27T06:26:54Z

check the logs using kubectl logs -n istio-system "$(kubectl get pod -l istio=ingressgateway
-n istio-system -o jsonpath='{.items[0].metadata.name}')"

2020-05-27T01:53:31.525183Z error sds Remote side closed connection
2020-05-27T01:53:31.526147Z info transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2020-05-27T01:58:48.941856Z info sds resource:httpbin-credential new connection
2020-05-27T01:58:48.944907Z warn secretfetcher Cannot find secret httpbin-credential, searching for fallback secret gateway-fallback
2020-05-27T01:58:48.944939Z error secretfetcher cannot find secret httpbin-credential and cannot find fallback secret gateway-fallback
2020-05-27T01:58:48.944945Z warn cache resource:httpbin-credential SecretFetcher cannot find secret httpbin-credential from cache
2020-05-27T01:58:48.944958Z warn sds resource:httpbin-credential waiting for ingress gateway secret for proxy "router~~10.244.0.7~~istio-ingressgateway-74cb7595bd-2nllr.istio-system~istio-system.svc.cluster.local"

2020-05-27T01:58:58.534441Z error sds resource:httpbin-credential NotifyProxy failed. No connection with id "router~~10.244.0.7~~istio-ingressgateway-74cb7595bd-2nllr.istio-system~istio-system.svc.cluster.local-4" can be found

2020-05-27T01:58:58.534500Z error cache resource:httpbin-credential failed to notify secret change for proxy: no connection with id "router~~10.244.0.7~~istio-ingressgateway-74cb7595bd-2nllr.istio- system~ istio-system.svc.cluster.local-4" can be found
2020-05-27T02:14:29.905068Z info Subchannel Connectivity change to CONNECTING
2020-05-27T02:14:29.905156Z info transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2020-05-27T02:14:29.905464Z info pickfirstBalancer: HandleSubConnStateChange: 0xc0003247d0, {CONNECTING }
2020-05-27T02:14:29.905535Z info Channel Connectivity change to CONNECTING
2020-05-27T02:14:29.905363Z info Subchannel picks a new address "istiod.istio-system.svc:15012" to connect
2020-05-27T02:14:29.912461Z info Subchannel Connectivity change to READY
2020-05-27T02:14:29.912506Z info pickfirstBalancer: HandleSubConnStateChange: 0xc0003247d0, {READY }
2020-05-27T02:14:29.912513Z info Channel Connectivity change to READY

williamaronli · 2020-05-27T06:36:23Z

from the log we can locate to https://github.com/williamaronli/istio/blob/504e17395450b26d26f322c5b197771f09d21e21/security/pkg/nodeagent/secretfetcher/secretfetcher.go#L532

williamaronli · 2020-05-28T00:30:11Z

This is a flaky issue:

the running time Create gw and create secret matters the secret successfully pushed to the ingress gateway proxy or not
if the interval between create gateway and secret too long, the secret will always be in warming status permanently until restart the pod
1 second interval example: (failed)
bash ~/create_gw.sh; sleep 1;bash ~/create_secret.sh

RESOURCE NAME          TYPE           STATUS      VALID CERT     SERIAL NUMBER                               NOT AFTER                NOT BEFORE
httpbin-credential                    WARMING     false

0.2 second (success):

bash ~/create_gw.sh; sleep 0.2;bash ~/create_secret.sh

RESOURCE NAME          TYPE           STATUS     VALID CERT     SERIAL NUMBER                               NOT AFTER                NOT BEFORE
httpbin-credential     Cert Chain     ACTIVE     true           0                                           2021-05-27T23:30:43Z     2020-05-27T23:30:43Z

And if we switch order of the create gateway and create secret. such problem will not show

bash ~/create_secret.sh; sleep 5;bash ~/create_gw.sh 
istioctl proxy-config secret istio-ingressgateway-76ccf47fb4-z8vz8.istio-system
RESOURCE NAME          TYPE           STATUS     VALID CERT     SERIAL NUMBER                               NOT AFTER                NOT BEFORE
httpbin-credential     Cert Chain     ACTIVE     true           0                                           2021-05-27T23:30:43Z     2020-05-27T23:30:43Z

error debug log

https://docs.google.com/document/d/1z4EnJ-T9caRHbABFw-wVfdRZNh0szs2lFpwD1aInrc0/edit

Some potential root cause

I guess that if the gateway tries to fetch the secret, and the secret is not ready or created. The gateway is stuck there and will not retry to catch it.

potential solution

setup a retry policy in ingress gateway after some timeout

williamaronli · 2020-06-05T18:19:28Z

Another finding:

Besides restarting the ingress pod can recover
If we delete the secret credential and then recreate it. the ingress can also get that secret and works

williamaronli · 2020-06-08T22:54:09Z

Until to current finding. The root cause :
The connection cache are not hit when secret is created after gateway created.

error log:

2020-06-05T05:26:29.309881Z	error	sds	resource:httpbin-credential NotifyProxy failed. No connection with id "router~10.244.0.14~istio-ingressgateway-7b986dc8bc-kb6vk.istio-system~istio-system.svc.cluster.local-4" can be found
2020-06-05T05:26:29.309951Z	error	cache	resource:httpbin-credential failed to notify secret change for proxy: no connection with id

code

: https://github.com/istio/istio/blob/master/security/pkg/nodeagent/sds/sdsservice.go#L482
from code, if the connKey doesn't hit the sds clients cache, it will return an error and will not send notification to proxy about the secret updates.

Some workaround methods:

ask the customers to follow the correct order. create secret first before
instead of restarting the pod, just delete the secret and recreate them, and such secret will be pushed to proxy

howardjohn · 2020-06-08T23:01:06Z

ask the customers to follow the correct order. create secret first before

This is not the "correct" order. One of the most common use cases is deploying certs with cert-manager and this is currently pretty broken today. We cannot add this arbitrary restriction. Pushing some data to envoy shouldn't need to be this complicated

fix issue: istio#23715

myidpt · 2020-06-23T17:26:01Z

Folks, is this fixed?

williamaronli · 2020-06-23T20:24:54Z

It is fixed by this PR: #24817

SabySen · 2021-07-29T21:30:51Z

Working with ISTIO 1.10.3. I see this issue happening again. Restarted every possible artifact and yet it keeps giving the same error. details here
-- Gateway --

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: internal-ingress-gateway
  namespace: default
spec:
  selector:
    istio: internal-ingressgateway # use Istio default gateway implementation
  servers:
  - port:
      number: 443
      name: https
      protocol: HTTPS
    tls:
      mode: SIMPLE
      credentialName: internal-gkegatewaysecret
    hosts:
    - "*"

here is the secret -- redacted.

Name:         internal-gkegatewaysecret
Namespace:    istio-system
Labels:       app.kubernetes.io/instance=istio-certs
Annotations:  argocd.argoproj.io/sync-wave: -15
              cert-manager.io/alt-names:
                internal-non-prod.digital-subscription-qa.cvs.com,internal-sit1.digital-subscription-qa.cvs.com,internal-sit2.digital-subscription-qa.cvs.com,internal-argocd-np.digital-subsc...
              cert-manager.io/certificate-name: internal-gkegatewaysecret
              cert-manager.io/common-name:  internal-non-prod.digital-subscription-qa.cvs.com
              cert-manager.io/ip-sans:
              cert-manager.io/issuer-group: cert-manager.io
              cert-manager.io/issuer-kind: ClusterIssuer
              cert-manager.io/issuer-name: tpp-venafi-issuer
              cert-manager.io/uri-sans:

Type:  Opaque

Data
====
ca.crt:   1338 bytes
cert:     6677 bytes
key:      1679 bytes
tls.crt:  4444 bytes
tls.key:  1679 bytes

Now this gateway was earlier associated with a different old cert and even after the change the old cert keeps being associated. This is completely screwing up our calls to the services exposed via this as the TLS error comes because host names not matching
e.g as below..

POST https://internal-sit1.digital-subscription-qa.cvs.com/status
Error: Hostname/IP does not match certificate's altnames: Host: internal-sit1.digital-subscription-qa.cvs.com. is not in the cert's altnames: DNS:dev1.digital-subscription-dev.cvs.com

howardjohn added the area/security label May 11, 2020

istio-policy-bot added the lifecycle/needs-triage label May 14, 2020

JimmyCYJ self-assigned this May 14, 2020

istio-policy-bot removed the lifecycle/needs-triage label May 15, 2020

howardjohn added this to the 1.7 milestone May 19, 2020

istio-policy-bot added the lifecycle/needs-escalation label May 27, 2020

myidpt assigned williamaronli May 27, 2020

istio-policy-bot removed the lifecycle/needs-escalation label Jun 8, 2020

williamaronli added a commit to williamaronli/istio that referenced this issue Jun 12, 2020

fix ingress SDS not getting secret updates issue

ed20447

fix issue: istio#23715

williamaronli mentioned this issue Jun 12, 2020

fix ingress SDS not getting secret updates issue #24643

Closed

istio-policy-bot added the lifecycle/needs-escalation label Jun 16, 2020

williamaronli mentioned this issue Jun 19, 2020

Fix ingress SDS not getting secret updates and create unit tests #24817

Merged

istio-policy-bot removed the lifecycle/needs-escalation label Jun 23, 2020

williamaronli closed this as completed Jun 23, 2020

agoblet mentioned this issue Aug 10, 2020

Support Istio 1.6 kubeflow/kubeflow#5176

Closed

SabySen mentioned this issue Jul 30, 2021

Ingress SDS not updating secrets #34425

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingress SDS not getting secret updates #23715

ingress SDS not getting secret updates #23715

howardjohn commented May 11, 2020

howardjohn commented May 12, 2020

howardjohn commented May 12, 2020

JimmyCYJ commented May 14, 2020

howardjohn commented May 18, 2020

howardjohn commented May 18, 2020

howardjohn commented May 19, 2020

howardjohn commented May 19, 2020

howardjohn commented May 19, 2020

JimmyCYJ commented May 19, 2020 •

edited

Loading

williamaronli commented May 26, 2020

williamaronli commented May 26, 2020 •

edited

Loading

williamaronli commented May 27, 2020

williamaronli commented May 27, 2020

williamaronli commented May 28, 2020 •

edited

Loading

williamaronli commented Jun 5, 2020

williamaronli commented Jun 8, 2020 •

edited

Loading

howardjohn commented Jun 8, 2020

myidpt commented Jun 23, 2020

williamaronli commented Jun 23, 2020 •

edited

Loading

SabySen commented Jul 29, 2021 •

edited

Loading

ingress SDS not getting secret updates #23715

ingress SDS not getting secret updates #23715

Comments

howardjohn commented May 11, 2020

howardjohn commented May 12, 2020

howardjohn commented May 12, 2020

JimmyCYJ commented May 14, 2020

howardjohn commented May 18, 2020

howardjohn commented May 18, 2020

howardjohn commented May 19, 2020

howardjohn commented May 19, 2020

howardjohn commented May 19, 2020

JimmyCYJ commented May 19, 2020 • edited Loading

williamaronli commented May 26, 2020

williamaronli commented May 26, 2020 • edited Loading

williamaronli commented May 27, 2020

williamaronli commented May 27, 2020

williamaronli commented May 28, 2020 • edited Loading

error debug log

Some potential root cause

potential solution

williamaronli commented Jun 5, 2020

williamaronli commented Jun 8, 2020 • edited Loading

error log:

code

Some workaround methods:

howardjohn commented Jun 8, 2020

myidpt commented Jun 23, 2020

williamaronli commented Jun 23, 2020 • edited Loading

SabySen commented Jul 29, 2021 • edited Loading

JimmyCYJ commented May 19, 2020 •

edited

Loading

williamaronli commented May 26, 2020 •

edited

Loading

williamaronli commented May 28, 2020 •

edited

Loading

williamaronli commented Jun 8, 2020 •

edited

Loading

williamaronli commented Jun 23, 2020 •

edited

Loading

SabySen commented Jul 29, 2021 •

edited

Loading