Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple replicas of istiod #20047

Closed
howardjohn opened this issue Jan 9, 2020 · 5 comments
Closed

Multiple replicas of istiod #20047

howardjohn opened this issue Jan 9, 2020 · 5 comments
Milestone

Comments

@howardjohn
Copy link
Member

@howardjohn howardjohn commented Jan 9, 2020

A while back we had issues with multiple citadels and multiple galleys. We (allegedly 馃檪 ) fixed these issues, but I am not sure how.

We should verify that whatever fixes were applied will still exist in the istiod model. We will of course test with multiple replicas, but subtle race conditions may be missed so we should logically check the fixes are still valid

cc @myidpt

@howardjohn howardjohn added this to the 1.5 milestone Jan 9, 2020
@howardjohn

This comment has been minimized.

Copy link
Member Author

@howardjohn howardjohn commented Jan 10, 2020

I created 50 replicas and ran into a crash:

2020-01-10T01:20:14.163876Z     info    No certificates specified, skipping DNS certificate controller
2020-01-10T01:20:14.177793Z     info    CRD controller watching namespaces ""
2020-01-10T01:20:14.178057Z     info    Adding Kubernetes registry adapter
2020-01-10T01:20:14.178073Z     info    Service controller watching namespace "" for services, endpoints, nodes and pods, refresh 1m0s
2020-01-10T01:20:14.178607Z     info    Setting up event handlers
2020-01-10T01:20:14.178751Z     info    Use self-signed certificate as the CA certificate
2020-01-10T01:20:14.178774Z     info    Starting Secrets controller
2020-01-10T01:20:14.178852Z     info    Waiting for informer caches to sync
2020-01-10T01:20:14.230757Z     info    pkiCaLog        Load signing key and cert from existing secret istio-system:istio-ca-secret
2020-01-10T01:20:14.245712Z     info    pkiCaLog        Using existing public key: -----BEGIN CERTIFICATE-----
MIIC3TCCAcWgAwIBAgIQUbdcFEzJAw6ogsSHh0D43zANBgkqhkiG9w0BAQsFADAY
MRYwFAYDVQQKEw1jbHVzdGVyLmxvY2FsMB4XDTE5MTIxODAwMTIyOVoXDTI5MTIx
NTAwMTIyOVowGDEWMBQGA1UEChMNY2x1c3Rlci5sb2NhbDCCASIwDQYJKoZIhvcN
AQEBBQADggEPADCCAQoCggEBAMJ/FOHAzpvpRUYBO8Sr2Mkgoo88aXCfKzaatzbc
8uEFuUlRsIEbFpkpTFj04Z3mF7vibGQlx0JVje7/vfZgMn9vY+pHqjflqWO/a6Hq
NjcZWYKrll1FSm7WzKkg12Hhe27+P7pFITqExa4HW3wV+nQgrnM2cms+jzxWTsHR
bM8lg/hfqB7XzdQzqt4spnpEBezxz8ccITCUSNCRxeh//V9OjTlkWynvWmKYQWrR
Gu8QgvH0sFGx2NXc02vzMd0yj2iQhcobGSiGYHSQr3P3h5QNLg/A+0R543XvdtQk
li+zFF7XKmqFd5O4fp3ekwLsQRsJWqNafKd8BpaXE3cEwykCAwEAAaMjMCEwDgYD
VR0PAQH/BAQDAgIEMA8GA1UdEwEB/wQFMAMBAf8wDQYJKoZIhvcNAQELBQADggEB
ALi5GUx6ZtNvnRA/LAO8u43WdCWlw+JrFQoHuKtZToctlpWhlmc+o+fR9o+eFPZn
qYygNvjHYn1uH8C1Jsz1E3E6HRCNAKGygLmEiRFGfflTxbfH42Cp8fzxQVu22Dso
UBwBoiWl+aXe09b+eXb0amGalLDUbDd70Cb/oRa74j2yEMOktDlMmM4OtpuE0lkE
9M4RyYWEVxlwZeda9ayjow2SrqABgi1TWQLp2eSSScfP3NXq2bz0+gk9YScJ43eG
5/Tby+VbCnB40c190vHXuFdRzh3VeZPR1BBxbiLV6u5ygwbdSWtQZjxp3P34J3fl
BtPHwks4v6m9rIDljFAhrCo=
-----END CERTIFICATE-----

2020-01-10T01:20:14.253622Z     info    pkiCaLog        The Citadel's public key is successfully written into configmap istio-security in namespace istio-system.
2020-01-10T01:20:14.253679Z     info    rootCertRotator Set up back off time 56m56s to start rotator.
2020-01-10T01:20:14.253700Z     info    serverCaLog     added client certificate authenticator
2020-01-10T01:20:14.253785Z     info    rootCertRotator Jitter is enabled, wait 56m56s before starting root cert rotator.
2020-01-10T01:20:14.253868Z     info    serverCaLog     added K8s JWT authenticator
2020-01-10T01:20:14.254311Z     info    Istiod CA has started
2020-01-10T01:20:14.254332Z     info    istiod namespace controller has started
2020-01-10T01:20:14.254330Z     info    serverCaLog     Starting GRPC server on port 0
2020-01-10T01:20:14.254600Z     error   failed to create discovery service: grpcDNS: tls: private key does not match public key
Error: failed to create discovery service: grpcDNS: tls: private key does not match public key

44 pods run fine, 6 pods are always crash looping like this. If I kill them they come back successfully, so somehow getting a new pod fixes it but new container does not

@howardjohn

This comment has been minimized.

Copy link
Member Author

@howardjohn howardjohn commented Jan 10, 2020

Also some crash at startup with:

2020-01-09T23:51:42.664999Z     info    pkiCaLog        Load signing key and cert from existing secret istio-system:istio-ca-secret
2020-01-09T23:51:42.665641Z     info    pkiCaLog        Using existing public key: -----BEGIN CERTIFICATE-----
MIIC3TCCAcWgAwIBAgIQUbdcFEzJAw6ogsSHh0D43zANBgkqhkiG9w0BAQsFADAY
MRYwFAYDVQQKEw1jbHVzdGVyLmxvY2FsMB4XDTE5MTIxODAwMTIyOVoXDTI5MTIx
NTAwMTIyOVowGDEWMBQGA1UEChMNY2x1c3Rlci5sb2NhbDCCASIwDQYJKoZIhvcN
AQEBBQADggEPADCCAQoCggEBAMJ/FOHAzpvpRUYBO8Sr2Mkgoo88aXCfKzaatzbc
8uEFuUlRsIEbFpkpTFj04Z3mF7vibGQlx0JVje7/vfZgMn9vY+pHqjflqWO/a6Hq
NjcZWYKrll1FSm7WzKkg12Hhe27+P7pFITqExa4HW3wV+nQgrnM2cms+jzxWTsHR
bM8lg/hfqB7XzdQzqt4spnpEBezxz8ccITCUSNCRxeh//V9OjTlkWynvWmKYQWrR
Gu8QgvH0sFGx2NXc02vzMd0yj2iQhcobGSiGYHSQr3P3h5QNLg/A+0R543XvdtQk
li+zFF7XKmqFd5O4fp3ekwLsQRsJWqNafKd8BpaXE3cEwykCAwEAAaMjMCEwDgYD
VR0PAQH/BAQDAgIEMA8GA1UdEwEB/wQFMAMBAf8wDQYJKoZIhvcNAQELBQADggEB
ALi5GUx6ZtNvnRA/LAO8u43WdCWlw+JrFQoHuKtZToctlpWhlmc+o+fR9o+eFPZn
qYygNvjHYn1uH8C1Jsz1E3E6HRCNAKGygLmEiRFGfflTxbfH42Cp8fzxQVu22Dso
UBwBoiWl+aXe09b+eXb0amGalLDUbDd70Cb/oRa74j2yEMOktDlMmM4OtpuE0lkE
9M4RyYWEVxlwZeda9ayjow2SrqABgi1TWQLp2eSSScfP3NXq2bz0+gk9YScJ43eG
5/Tby+VbCnB40c190vHXuFdRzh3VeZPR1BBxbiLV6u5ygwbdSWtQZjxp3P34J3fl
BtPHwks4v6m9rIDljFAhrCo=
-----END CERTIFICATE-----

2020-01-09T23:51:42.681047Z     info    pkiCaLog        The Citadel's public key is successfully written into configmap istio-security in namespace istio-system.
2020-01-09T23:51:42.681090Z     info    rootCertRotator Set up back off time 32m35s to start rotator.
2020-01-09T23:51:42.681115Z     info    serverCaLog     added client certificate authenticator
2020-01-09T23:51:42.681159Z     info    rootCertRotator Jitter is enabled, wait 32m35s before starting root cert rotator.
2020-01-09T23:51:42.681315Z     info    serverCaLog     added K8s JWT authenticator
2020-01-09T23:51:42.681943Z     info    Istiod CA has started
2020-01-09T23:51:42.681970Z     info    istiod namespace controller has started
2020-01-09T23:51:42.681996Z     info    Generating K8S-signed cert for [istio-pilot.istio-system.svc istiod.istio-system.svc]
2020-01-09T23:51:42.682059Z     info    serverCaLog     Starting GRPC server on port 0
2020-01-09T23:51:43.777654Z     error   failed to delete CSR (domain-cluster.local-ns-istio-system-secret-istio-pilot.csr.secret): certificatesigningrequests.certificates.k8s.io "domain-cluster.local-ns-istio-system-secret-istio-pilot.csr.secret" not found
2020-01-09T23:51:43.777689Z     error   failed to clean up CSR (domain-cluster.local-ns-istio-system-secret-istio-pilot.csr.secret): certificatesigningrequests.certificates.k8s.io "domain-cluster.local-ns-istio-system-secret-istio-pilot.csr.secret" not found
2020-01-09T23:51:43.777705Z     error   failed to create discovery service: grpcDNS: certificatesigningrequests.certificates.k8s.io "domain-cluster.local-ns-istio-system-secret-istio-pilot.csr.secret" not found
Error: failed to create discovery service: grpcDNS: certificatesigningrequests.certificates.k8s.io "domain-cluster.local-ns-istio-system-secret-istio-pilot.csr.secret" not found

@lei-tang

@lei-tang

This comment has been minimized.

Copy link
Contributor

@lei-tang lei-tang commented Jan 10, 2020

For the CSR part: #20052

@howardjohn

This comment has been minimized.

Copy link
Member Author

@howardjohn howardjohn commented Jan 10, 2020

@lei-tang i suspect that actually fixes the original issue as well. The interesting thing to note is that the dns cert folder is persisted when the container crashes! so that is why it never recovers. I suspect what happens is it gets mixed up with another Pilot's CSR so it gets an invalid key pair

@howardjohn

This comment has been minimized.

Copy link
Member Author

@howardjohn howardjohn commented Jan 17, 2020

I have been testing with many replicas for a long time. No issues after the above fix

@howardjohn howardjohn closed this Jan 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can鈥檛 perform that action at this time.