New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x509: certificate has expired or is not yet valid #17718
Comments
This eventually auto-healed after about a half a day outage |
I think auto-heal was caused by galley restarting. sidecar injector also sees 45 restarts so that is probably why only validation webhook was down not the injection? Edit: galley restart seems likely due to a rolling upgrade of GKE nodes and it got rescheduled |
I just did another upgrade and immediately ran into the same issue, still in a broken state now |
Somehow getting patched with a really old cert? EDIT: nevermind, read the year wrong.. cert looks valid
Galley also logging Citadel logs when I started the upgrade and it starts failing watches, etc:
|
Can you tell me how you did the upgrade? |
I pressed "upgrade master" button GKE. Then a few hours later I updated the node pool as well. This morning I upgraded just the master - I think its in a state where its permanantly broken now, it only recovered last time because all the pods were restarted by the node upgrade |
I did the update around 15:55:00. Looks like at 16:01 galley actually does reload the cert and key, but somehow things still aren't working with errors like |
Galley logs: Galley has a valid cert as well. We restarted galley and probably is resolved. Will see what happens in ~4 hours when the cert expires again |
To give more context: Jason: can you spot any changes on Galley between V1.1.12 and V1.1.13 that could impact the cert reload feature? |
My suspicion is it is related to the upgrading, not which version is being upgraded to. So I think 1.12.x -> 1.12.y would cause this too. We will see in a few hours 🙂 |
k8s secret file mount propagation can take several minutes. I verified that Galley re-loads the local cert/key file immediately after the file mount is updated.
No. $ git diff 1.1.12..1.1.13 --stat
istio.deps | 2 +-
pilot/docker/Dockerfile.proxy_debug | 8 ++++++++
pilot/docker/Dockerfile.proxytproxy | 8 ++++++++
pilot/docker/Dockerfile.proxyv2 | 8 ++++++++
4 files changed, 25 insertions(+), 1 deletion(-) |
I think there is some confusion here The upgrades I am referring to are GKE versions. Istio has been the same version of master the entire time. Oliver's mention of 1.1.12 should be 1.12, etc.
The gap between :55 and :01 is not concerning at all. I was stating here that this is actually working -- the caBundle on the webhook is correct it seems. |
The cert in galley would have expired by now and things are still working, meaning the reloading is working, just not during GKE master upgrade. @ayj also tried on a different cluster and couldn't reproduce, possible differences are cert TTL and cluster size. |
I had the same issue. After renewing the certificate with root-transition.sh and proceed with the instructions everything works fine now. |
I don't think I need to renew the cert? |
I ran into this again, this time no master upgrade or anything |
No master upgrade, but failure to connect to API server
Result is that anytime the API server goes down, for even a short period of time, the cluster is permanently broken until galley is restarted |
From jason ^ |
Galley watched and reloaded key/certs prior to istio 1.3. istio#12571 refactored galley's validation into two parts: (1) config controller and (2) the webhook server. watch/reload logic was retained in (1) and not added at all to (2). This PR adds the missing watch/reload code to (2). fix istio#17718
The |
Yes I think somehow failing to connect to the API server (because it is down, as in the original issue where I was doing an upgrade) is triggering the cert issues, although I'm not sure why really.. maybe just a coincidence |
* make validation watch and reload key/certs (again) Galley watched and reloaded key/certs prior to istio 1.3. #12571 refactored galley's validation into two parts: (1) config controller and (2) the webhook server. watch/reload logic was retained in (1) and not added at all to (2). This PR adds the missing watch/reload code to (2). fix #17718 * add integration test and fix some bugs * fix dashboard test, linter errors, and TestReloadConfig * print cert info on reload * improve logging * deflake pod fetch
Galley watched and reloaded key/certs prior to istio 1.3. istio#12571 refactored galley's validation into two parts: (1) config controller and (2) the webhook server. watch/reload logic was retained in (1) and not added at all to (2). This PR adds the missing watch/reload code to (2). fix istio#17718
* make validation watch and reload key/certs (again) Galley watched and reloaded key/certs prior to istio 1.3. istio#12571 refactored galley's validation into two parts: (1) config controller and (2) the webhook server. watch/reload logic was retained in (1) and not added at all to (2). This PR adds the missing watch/reload code to (2). fix istio#17718 (cherry picked from commit 39635d7)
* fix galley validation key/cert rotation (#17995) * make validation watch and reload key/certs (again) Galley watched and reloaded key/certs prior to istio 1.3. #12571 refactored galley's validation into two parts: (1) config controller and (2) the webhook server. watch/reload logic was retained in (1) and not added at all to (2). This PR adds the missing watch/reload code to (2). fix #17718 (cherry picked from commit 39635d7) * prevent webhook test from using stale pod from previous test * add // nolint: interfacer * reorder test startup * fix integration test for real this time
* make validation watch and reload key/certs (again) Galley watched and reloaded key/certs prior to istio 1.3. #12571 refactored galley's validation into two parts: (1) config controller and (2) the webhook server. watch/reload logic was retained in (1) and not added at all to (2). This PR adds the missing watch/reload code to (2). fix #17718 * add integration test and fix some bugs * fix dashboard test, linter errors, and TestReloadConfig * print cert info on reload * improve logging * deflake pod fetch * prevent webhook test from using stale pod from previous test (cherry picked from commit 22386dc) * add // nolint: interfacer * reorder test startup * fix integration test for real this time
@howardjohn We are seeing this issue in my cluster as well. Verified the cert expiry is well beyond today. Istio Version - 1.3.0 Galley logs -
What are the options to get this issue fixed in my cluster ? Is restarting Galley the only option ? |
IIRC, in 1.3 istio has implemented certificate rotate. Can you check what happened to galley? |
@hzxuzhonghu - What specifically would you want to check in galley ? |
@hzxuzhonghu k8s RS describe:
No errors or warnings in Galley logs only info.
IP "10.132.26.79" is EKS APIserver. Istio installed from Helm Chart and works like a charm about 109 days. |
Was getting same error in kubeflow if it helps someone. Did the following and it started working. |
I am getting errors
Internal error occurred: failed calling webhook "pilot.validation.istio.io": Post https://istio-galley.istio-system.svc:443/admitpilot?timeout=30s: x509: certificate has expired or is not yet valid
in my cluster. Seems a re-occurrence of #14517. Has been ongoing for ~12 hours or so, after a GKE node update. Will collect some more info tomorrowThe text was updated successfully, but these errors were encountered: