Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingress SDS not getting secret updates #23715

Closed
howardjohn opened this issue May 11, 2020 · 20 comments
Closed

ingress SDS not getting secret updates #23715

howardjohn opened this issue May 11, 2020 · 20 comments
Assignees
Milestone

Comments

@howardjohn
Copy link
Member

Will add more info here later if I can reproduce it.

Running from 5f8807a

I deployed a secret with cert-manager letsencrypt-staging. I later moved it to prod, resulting in the secret updating. However, the gateway used the old cert. I then deleted the secret and recreated it -- same thing, old secret.

I verified the cert in config_dump does not match the secret.

Then I restarted the ingress pod and it finally picked up the proper cert

@howardjohn
Copy link
Member Author

Another example:

020-05-12T20:06:59.884794Z     info    secretfetcher   scrtUpdated is called on kubernetes secret httpbin-credential
2020-05-12T20:06:59.885457Z     error   sds     resource:httpbin-credential NotifyProxy failed. No connection with id "router~10.28.0.123~istio-ingressgateway-7fd679cd56-z6gvh.istio-system~istio-system.svc.cluster.local-7" can be found
2020-05-12T20:06:59.885488Z     error   cache   resource:httpbin-credential failed to notify secret change for proxy: no connection with id "router~10.28.0.123~istio-ingressgateway-7fd679cd56-z6gvh.istio-system~istio-system.svc.cluster.local-7" can be found

config dump clearly shows httpbin-credential is present

@howardjohn
Copy link
Member Author

This doesn't happ[en 100% of the time. Wrote a simple script to verify:

for i in {0..50}; do
    openssl x509 -req -days 365 -CA example.com.crt -CAkey example.com.key -set_serial $i -in httpbin.example.com.csr -out httpbin.example.com.crt
    kubectl patch secret/httpbin-credential -n istio-system --type merge -p '{"data":{"tls.crt":"'$(cat httpbin.example.com.crt | base64 -w0)'", "tls.key":"'$(cat httpbin.example.com.key | base64 -w0)'"}}'
    ik pc secret istio-ingressgateway-7fd679cd56-gzw54.istio-system
done

@JimmyCYJ
Copy link
Member

Does the config dump show the old secret? It would take some time for the SDS flow to push the new one. Are there istio-proxy logs available.

@howardjohn
Copy link
Member Author

I am pretty sure what happens is:

  • Empty secret is created
  • Agent fails to read it
  • Secret update with key/cert

But we don't handle the update. I will verify this in a bit, just a guess right now

@howardjohn
Copy link
Member Author

Btw its not just changing the secret, I reproduced with a brand new cert which is why I think ^ is the root cause. will try to get 100% reproducer

@howardjohn
Copy link
Member Author

I think there are two related problems

Deleting secret and creating empty secret behave different

Scenario one:

  • Apply secret with cert
  • Apply empty secret
  • Remove gw and reapply
  • Nothing happens, old cert remains, old cert is sent

Scenario two:

  • Apply secret with cert
  • Delete secret
  • Nothing happens, old cert remains
  • Remove gw and reapply, NO cert is sent

Empty secret means

apiVersion: v1
data:
  ca.crt: ""
  tls.crt: ""
  tls.key: ""
kind: Secret
metadata:
  name: certificate
  namespace: istio-system
type: kubernetes.io/tls

Secret update broken in some cases. Reproducer:

  • Create secret+gw
  • Delete secret
  • Delete gw
    <wait 45s for drain>
  • Create gw
  • Recreate secret
  • Secret is stuck warming permanent

@howardjohn
Copy link
Member Author

I confirmed similar behavior is present in Istio 1.5 so its not a (new) regression

@howardjohn howardjohn added this to the 1.7 milestone May 19, 2020
@howardjohn
Copy link
Member Author

Bumping to P0 as we have a reproducer now

@JimmyCYJ
Copy link
Member

JimmyCYJ commented May 19, 2020

Thanks @howardjohn

Secret update broken in some cases. Reproducer:

  • Create secret+gw
  • Delete secret
  • Delete gw
    <wait 45s for drain>
  • Create gw
  • Recreate secret
    Secret is stuck warming permanent

@williamaronli could you take a look and try reproduce it following these steps?

===============
Scenario one:

  • Apply secret with cert
  • Apply empty secret
  • Remove gw and reapply
  • Nothing happens, old cert remains, old cert is sent

What happens here is when apply empty secret, SDS agent detects the secret is empty, so it rejects the secret. The cached copy in SDS agent is the last valid secret, which is the old cert. Next time, when gw is removed and reapplied, gw asks SDS agent to get new secret, and SDS agent pushes the cached secret to gw. This is expected.

=============

Scenario two:

  • Apply secret with cert
  • Delete secret
  • Nothing happens, old cert remains
  • Remove gw and reapply, NO cert is sent

What happens here is when secret is deleted, SDS agent removes the cached secret as well. That's why gw does not get cert after remove and reapply.

==============

@williamaronli
Copy link
Contributor

Before creating i verified it works:
fengxiangli@williamaronli:~/istio-1.6.0$ curl -v -HHost:httpbin.example.com --resolve "httpbin.example.com:$SECURE_INGRESS_PORT:$INGRESS_HOST"
--cacert example.com.crt "https://httpbin.example.com:$SECURE_INGRESS_PORT/status/418"
< HTTP/2 418
< server: istio-envoy
< date: Tue, 26 May 2020 17:07:55 GMT
< x-more-info: http://tools.ietf.org/html/rfc2324
< access-control-allow-origin: *
< access-control-allow-credentials: true
< content-length: 135
< x-envoy-upstream-service-time: 2
<

-=[ teapot ]=-

   _...._
 .'  _ _ `.
| ."` ^ `". _,
\_;`"---"`|//
  |       ;/
  \_     _/
    `"""`
  1. Create secret+gw:
    fengxiangli@williamaronli:/istio-1.6.0$ kubectl create -n istio-system secret tls httpbin-credential --key=httpbin.example.com.key --cert=httpbin.example.com.crt
    secret/httpbin-credential created
    fengxiangli@williamaronli:/istio-1.6.0$ cat <<EOF | kubectl apply -f -
    apiVersion: networking.istio.io/v1alpha3
    kind: Gateway
    metadata:
    name: mygateway
    spec:
    selector:
    istio: ingressgateway # use istio default ingress gateway
    servers:
  • port:
    number: 443
    name: https
    protocol: HTTPS
    tls:
    mode: SIMPLE
    credentialName: httpbin-credential # must be the same as secret
    hosts:
    • httpbin.example.com
      EOF
      gateway.networking.istio.io/mygateway created
  1. Delete secret

fengxiangli@williamaronli:/istio-1.6.0$ kubectl -n istio-system delete secret httpbin-credential
secret "httpbin-credential" deleted
fengxiangli@williamaronli:/istio-1.6.0$

  1. Delete gw
    <wait 45s for drain>

fengxiangli@williamaronli:/istio-1.6.0$ kubectl delete gateway mygateway
gateway.networking.istio.io "mygateway" deleted

  1. Create gw
    fengxiangli@williamaronli:/istio-1.6.0$ cat <<EOF | kubectl apply -f -
    apiVersion: networking.istio.io/v1alpha3
    kind: Gateway
    metadata:
    name: mygateway
    spec:
    selector:
    istio: ingressgateway # use istio default ingress gateway
    servers:
  • port:
    number: 443
    name: https
    protocol: HTTPS
    tls:
    mode: SIMPLE
    credentialName: httpbin-credential # must be the same as secret
    hosts:
    • httpbin.example.com
      EOF
      gateway.networking.istio.io/mygateway created
  1. Recreate secret
    fengxiangli@williamaronli:/istio-1.6.0$ kubectl create -n istio-system secret tls httpbin-credential --key=httpbin.example.com.key --cert=httpbin.example.com.crt
    secret/httpbin-credential created

  2. Secret is stuck warming permanent

fengxiangli@williamaronli:/istio-1.6.0$ kubectl get pods -n istio-system
NAME READY STATUS RESTARTS AGE
grafana-54b54568fc-h8gtl 1/1 Running 0 2d23h
istio-egressgateway-5cbb74b6d-7kjf6 1/1 Running 0 2d23h
istio-ingressgateway-74cb7595bd-tgxqn 1/1 Running 0 2d23h
istio-tracing-9dd6c4f7c-hnzsl 1/1 Running 0 2d23h
istiod-6b8bf87986-29tgd 1/1 Running 0 2d23h
kiali-d45468dc4-2g65t 1/1 Running 0 2d23h
prometheus-5c84c494dd-5dcbg 2/2 Running 0 2d23h

fengxiangli@williamaronli:/istio-1.6.0$ istioctl proxy-config secret istio-ingressgateway-74cb7595bd-tgxqn.istio-system
RESOURCE NAME TYPE STATUS VALID CERT SERIAL NUMBER NOT AFTER NOT BEFORE
httpbin-credential WARMING false
default Cert Chain ACTIVE true 117649866128878731488369555542841752511 2020-05-27T11:13:24Z 2020-05-26T11:13:24Z
ROOTCA CA ACTIVE true 153371244782393518006123463350295732481 2030-05-21T23:08:13Z 2020-05-23T23:08:13Z

@williamaronli
Copy link
Contributor

williamaronli commented May 26, 2020

restart the ingress pod:

  1. fengxiangli@williamaronli:/istio-1.6.0 kubectl -n istio-system rollout restart deployment istio-ingressgateway
    deployment.apps/istio-ingressgateway restarted

  2. fengxiangli@williamaronli: /istio-1.6.0$ kubectl get pods -n istio-system
    NAME READY STATUS RESTARTS AGE
    grafana-54b54568fc-h8gtl 1/1 Running 0 2d23h
    istio-egressgateway-5cbb74b6d-7kjf6 1/1 Running 0 2d23h
    istio-ingressgateway-5f9b77885c-99f44 1/1 Running 0 67s
    istio-tracing-9dd6c4f7c-hnzsl 1/1 Running 0 2d23h
    istiod-6b8bf87986-29tgd 1/1 Running 0 2d23h
    kiali-d45468dc4-2g65t 1/1 Running 0 2d23h
    prometheus-5c84c494dd-5dcbg 2/2 Running 0 2d23h

fengxiangli@williamaronli:/istio-1.6.0$ istioctl proxy-config secret istio-ingressgateway-5f9b77885c-99f44.istio-system
RESOURCE NAME TYPE STATUS VALID CERT SERIAL NUMBER NOT AFTER NOT BEFORE
httpbin-credential Cert Chain ACTIVE true 0 2021-05-26T18:44:29Z 2020-05-26T18:44:29Z
default Cert Chain ACTIVE true 74707085384625053020352826882864255223 2020-05-27T22:57:09Z 2020-05-26T22:57:09Z
ROOTCA CA ACTIVE true 153371244782393518006123463350295732481 2030-05-21T23:08:13Z 2020-05-23T23:08:13Z

now after restart the pod the cert is active

@williamaronli
Copy link
Contributor

check the logs using kubectl logs -n istio-system "$(kubectl get pod -l istio=ingressgateway
-n istio-system -o jsonpath='{.items[0].metadata.name}')"

2020-05-27T01:53:31.525183Z error sds Remote side closed connection
2020-05-27T01:53:31.526147Z info transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2020-05-27T01:58:48.941856Z info sds resource:httpbin-credential new connection
2020-05-27T01:58:48.944907Z warn secretfetcher Cannot find secret httpbin-credential, searching for fallback secret gateway-fallback
2020-05-27T01:58:48.944939Z error secretfetcher cannot find secret httpbin-credential and cannot find fallback secret gateway-fallback
2020-05-27T01:58:48.944945Z warn cache resource:httpbin-credential SecretFetcher cannot find secret httpbin-credential from cache
2020-05-27T01:58:48.944958Z warn sds resource:httpbin-credential waiting for ingress gateway secret for proxy "router10.244.0.7istio-ingressgateway-74cb7595bd-2nllr.istio-system~istio-system.svc.cluster.local"

2020-05-27T01:58:58.534441Z error sds resource:httpbin-credential NotifyProxy failed. No connection with id "router10.244.0.7istio-ingressgateway-74cb7595bd-2nllr.istio-system~istio-system.svc.cluster.local-4" can be found

2020-05-27T01:58:58.534500Z error cache resource:httpbin-credential failed to notify secret change for proxy: no connection with id "router10.244.0.7istio-ingressgateway-74cb7595bd-2nllr.istio- system~ istio-system.svc.cluster.local-4" can be found
2020-05-27T02:14:29.905068Z info Subchannel Connectivity change to CONNECTING
2020-05-27T02:14:29.905156Z info transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2020-05-27T02:14:29.905464Z info pickfirstBalancer: HandleSubConnStateChange: 0xc0003247d0, {CONNECTING }
2020-05-27T02:14:29.905535Z info Channel Connectivity change to CONNECTING
2020-05-27T02:14:29.905363Z info Subchannel picks a new address "istiod.istio-system.svc:15012" to connect
2020-05-27T02:14:29.912461Z info Subchannel Connectivity change to READY
2020-05-27T02:14:29.912506Z info pickfirstBalancer: HandleSubConnStateChange: 0xc0003247d0, {READY }
2020-05-27T02:14:29.912513Z info Channel Connectivity change to READY

@williamaronli
Copy link
Contributor

williamaronli commented May 28, 2020

This is a flaky issue:

the running time Create gw and create secret matters the secret successfully pushed to the ingress gateway proxy or not
if the interval between create gateway and secret too long, the secret will always be in warming status permanently until restart the pod
1 second interval example: (failed)
bash ~/create_gw.sh; sleep 1;bash ~/create_secret.sh

RESOURCE NAME          TYPE           STATUS      VALID CERT     SERIAL NUMBER                               NOT AFTER                NOT BEFORE
httpbin-credential                    WARMING     false

0.2 second (success):

bash ~/create_gw.sh; sleep 0.2;bash ~/create_secret.sh

RESOURCE NAME          TYPE           STATUS     VALID CERT     SERIAL NUMBER                               NOT AFTER                NOT BEFORE
httpbin-credential     Cert Chain     ACTIVE     true           0                                           2021-05-27T23:30:43Z     2020-05-27T23:30:43Z

And if we switch order of the create gateway and create secret. such problem will not show

bash ~/create_secret.sh; sleep 5;bash ~/create_gw.sh 
istioctl proxy-config secret istio-ingressgateway-76ccf47fb4-z8vz8.istio-system
RESOURCE NAME          TYPE           STATUS     VALID CERT     SERIAL NUMBER                               NOT AFTER                NOT BEFORE
httpbin-credential     Cert Chain     ACTIVE     true           0                                           2021-05-27T23:30:43Z     2020-05-27T23:30:43Z

error debug log

https://docs.google.com/document/d/1z4EnJ-T9caRHbABFw-wVfdRZNh0szs2lFpwD1aInrc0/edit

Some potential root cause

I guess that if the gateway tries to fetch the secret, and the secret is not ready or created. The gateway is stuck there and will not retry to catch it.

potential solution

  1. setup a retry policy in ingress gateway after some timeout

@williamaronli
Copy link
Contributor

Another finding:

  1. Besides restarting the ingress pod can recover
  2. If we delete the secret credential and then recreate it. the ingress can also get that secret and works

@williamaronli
Copy link
Contributor

williamaronli commented Jun 8, 2020

Until to current finding. The root cause :
The connection cache are not hit when secret is created after gateway created.

error log:

2020-06-05T05:26:29.309881Z	error	sds	resource:httpbin-credential NotifyProxy failed. No connection with id "router~10.244.0.14~istio-ingressgateway-7b986dc8bc-kb6vk.istio-system~istio-system.svc.cluster.local-4" can be found
2020-06-05T05:26:29.309951Z	error	cache	resource:httpbin-credential failed to notify secret change for proxy: no connection with id 

code

: https://github.com/istio/istio/blob/master/security/pkg/nodeagent/sds/sdsservice.go#L482
from code, if the connKey doesn't hit the sds clients cache, it will return an error and will not send notification to proxy about the secret updates.

Some workaround methods:

  1. ask the customers to follow the correct order. create secret first before
  2. instead of restarting the pod, just delete the secret and recreate them, and such secret will be pushed to proxy

@howardjohn
Copy link
Member Author

ask the customers to follow the correct order. create secret first before

This is not the "correct" order. One of the most common use cases is deploying certs with cert-manager and this is currently pretty broken today. We cannot add this arbitrary restriction. Pushing some data to envoy shouldn't need to be this complicated

@myidpt
Copy link
Contributor

myidpt commented Jun 23, 2020

Folks, is this fixed?

@williamaronli
Copy link
Contributor

williamaronli commented Jun 23, 2020

It is fixed by this PR: #24817

@SabySen
Copy link

SabySen commented Jul 29, 2021

Working with ISTIO 1.10.3. I see this issue happening again. Restarted every possible artifact and yet it keeps giving the same error. details here
-- Gateway --

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: internal-ingress-gateway
  namespace: default
spec:
  selector:
    istio: internal-ingressgateway # use Istio default gateway implementation
  servers:
  - port:
      number: 443
      name: https
      protocol: HTTPS
    tls:
      mode: SIMPLE
      credentialName: internal-gkegatewaysecret
    hosts:
    - "*"

here is the secret -- redacted.

Name:         internal-gkegatewaysecret
Namespace:    istio-system
Labels:       app.kubernetes.io/instance=istio-certs
Annotations:  argocd.argoproj.io/sync-wave: -15
              cert-manager.io/alt-names:
                internal-non-prod.digital-subscription-qa.cvs.com,internal-sit1.digital-subscription-qa.cvs.com,internal-sit2.digital-subscription-qa.cvs.com,internal-argocd-np.digital-subsc...
              cert-manager.io/certificate-name: internal-gkegatewaysecret
              cert-manager.io/common-name:  internal-non-prod.digital-subscription-qa.cvs.com
              cert-manager.io/ip-sans:
              cert-manager.io/issuer-group: cert-manager.io
              cert-manager.io/issuer-kind: ClusterIssuer
              cert-manager.io/issuer-name: tpp-venafi-issuer
              cert-manager.io/uri-sans:

Type:  Opaque

Data
====
ca.crt:   1338 bytes
cert:     6677 bytes
key:      1679 bytes
tls.crt:  4444 bytes
tls.key:  1679 bytes

Now this gateway was earlier associated with a different old cert and even after the change the old cert keeps being associated. This is completely screwing up our calls to the services exposed via this as the TLS error comes because host names not matching
e.g as below..

POST https://internal-sit1.digital-subscription-qa.cvs.com/status
Error: Hostname/IP does not match certificate's altnames: Host: internal-sit1.digital-subscription-qa.cvs.com. is not in the cert's altnames: DNS:dev1.digital-subscription-dev.cvs.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants