-
Notifications
You must be signed in to change notification settings - Fork 7.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[istio-pilot] Istio pilot failed to fetch pubkey sometimes. #16341
Comments
@johnzheng1975 We have seen the same issuer earlier, i.e, JWT public key failed to refresh due to intermittent network error. We fixed it by improving the retry logic in Pilot. See issue#14638 for details. The fix was checked to Istio 1.1. If you upgrade your Istio version to 1.1 or above, the issue should be resolved. |
Good to know pilot has retry in 1.1(https://github.com/istio/istio/pull/14640/files), I am upgrading istio.
Thanks in advance for your answer. |
The second question is:
Thanks in advance for your answer. |
when pilot pushes new config to proxy, the pubkey in the new config will override the old one in proxy. When pilot failed to fetch the pubkey, it uses an empty pubkey in the config and this is causing the 401. Note, we improved the code with retry and better caching in 1.1. Pilot will fallback to use the cached pubkey if it failed to fetch it instead of returning an empty pubkey.
Actually proxy doesn't get the pubkey by itself in istio. That is the old behavior before but later we switched to use pilot to fetch the pubkey. In other words, these code are not used now.
Please be noted when making custom changes in proxy code: we are switching to use Envoy JWT filter in 1.3 which means the current JWT filter in proxy is going to be deprecated and removed later. @johnzheng1975 I have a question about pilot failed to get pubkey from s3. Do you know how often does this happen? How long does the network error last? thanks. |
@yangminzhu , Thanks a lot for your introduction. I have further question for istio 1.0.0, thanks in advance for your answer.
Am I correct? And, is that means no any benefits to change public key from outside to inside? |
@yangminzhu , I am upgrading to istio1.1. However, there is still some cluster using istio1.0. Assume my jwksURI/public key will not expire. (I will restart istio-pilot in case it change)
|
From below log from istio1.1, fetch fail is 39 times in a day. FYI
|
How many pilot pods do you have? The other pods may be connecting to a different pilot instance that fetches the pubkey successfully.
So most of the re-fetch should be issued by the re-fresh job that does the fetch every 20 mins and it reties 3 times each time, it's possible that all the 3 fetches failed. But one thing to note is once it fetches the pubkey successfully once, it will be cached and used even if it fails next time.
I think
There is a workaround that you can apply a k8s job that apply/delete a dummy service entry periodically (like every 1 mins), this will force pilot to fetch the pubkey every 1 min. However, this will also make pilot much more busy. The other thing is I think you may also try about the https://github.com/istio/istio/releases/tag/1.0.9, I think we actually cherrypicked the fix to 1.0.9.
Yeah, the workaround seems okay, but I would suggest you to try to update to 1.0.9 to avoid custom changes to the Pilot. |
Thanks for the info, it's happening much more frequent than I would expect. do you still see 401 after upgrading? I think the above logs are just for failed re-fetch, pilot should keep using the cached pubkey and there should be no 401 in this case. |
Thanks Zhu,yangmin very much, I never get so detailed answer from open source community. Zhu, yangmin: How many pilot pods do you have? The other pods may be connecting to a different pilot instance that fetches the pubkey successfully. Zhu, yangmin: So most of the re-fetch should be issued by the re-fresh job that does the fetch every 20 mins and it reties 3 times each time, it's possible that all the 3 fetches failed. But one thing to note is once it fetches the pubkey successfully once, it will be cached and used even if it fails next time. Zhu, yanming I think FetchPubkey is not called because the pubkey will be populated by pilot: https://github.com/istio/proxy/blob/2a21f69ce04f2acadde488ba2d986dca2223a624/src/envoy/http/jwt_auth/jwt_authenticator.cc#L134 Zhu, yangmin: The other thing is I think you may also try about the https://github.com/istio/istio/releases/tag/1.0.9, I think we actually cherrypicked the fix to 1.0.9. |
Bug description
Istio pilot failed to fetch pubkey sometimes. This will bring JWT verify fail (401).
Affected product area (please put an X in all that apply)
[*] Security
Expected behavior
Pilot should get public key from s3, for jwt verify. If fail once, should retry.
Istio-proxy should have a way to get the key out of cluster.
Steps to reproduce the bug
1 Run kubernetes and istio
2 Enable JWT verify.
3 We found sometimes the public key cannot be get by istio-pilot from s3.
4 All http request will return 401 in istio-proxy
5 Istio-proxy will try to get the key from s3, but it also fail because s3 is not in cluster.
Version (include the output of
istioctl version --remote
andkubectl version
)1.0.0
How was Istio installed?
Helm
Environment where bug was observed (cloud vendor, OS, etc)
AWS / Coreos
Note that we can customizing istio-proxy code, it will be great if you can give me suggestion:
how to fix this with customized istio-proxy code.
Thanks a lot.
The text was updated successfully, but these errors were encountered: