Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[istio-pilot] Istio pilot failed to fetch pubkey sometimes. #16341

Closed
johnzheng1975 opened this issue Aug 16, 2019 · 10 comments
Closed

[istio-pilot] Istio pilot failed to fetch pubkey sometimes. #16341

johnzheng1975 opened this issue Aug 16, 2019 · 10 comments

Comments

@johnzheng1975
Copy link
Member

johnzheng1975 commented Aug 16, 2019

Bug description
Istio pilot failed to fetch pubkey sometimes. This will bring JWT verify fail (401).

2019-08-15T17:47:29.805647Z	error	model	Failed to fetch pubkey from "https://s3.amazonaws.com/fs-jwk/use1b.prod/use1b-prod-backoffice-jwks": Get https://s3.amazonaws.com/fs-jwk/use1b.prod/use1b-prod-backoffice-jwks: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2019-08-14T08:26:33.265361Z	error	model	Cannot fetch JWT public key from "https://s3.amazonaws.com/fs-jwk/use1b.prod/use1b-prod-ds-manufacturer-api-jwks", Get https://s3.amazonaws.com/fs-jwk/use1b.prod/use1b-prod-ds-manufacturer-api-jwks: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2019-08-16T09:40:36.916487Z	error	Failed to update lock: Operation cannot be fulfilled on configmaps "istio-ingress-controller-leader-istio": the object has been modified; please apply your changes to the latest version and try again

Affected product area (please put an X in all that apply)
[*] Security

Expected behavior
Pilot should get public key from s3, for jwt verify. If fail once, should retry.
Istio-proxy should have a way to get the key out of cluster.

Steps to reproduce the bug
1 Run kubernetes and istio
2 Enable JWT verify.
3 We found sometimes the public key cannot be get by istio-pilot from s3.
4 All http request will return 401 in istio-proxy
5 Istio-proxy will try to get the key from s3, but it also fail because s3 is not in cluster.

jwt_authenticator.cc
void JwtAuthenticator::FetchPubkey(PubkeyCacheItem* issuer) {
  uri_ = issuer->jwt_config().remote_jwks().http_uri().uri();
  ...
  const auto& cluster = issuer->jwt_config().remote_jwks().http_uri().cluster();
  if (cm_.get(cluster) == nullptr) {
    ENVOY_LOG(error, "Fetch pubkey can not find cluster: {}", cluster);
    DoneWithStatus(Status::FAILED_FETCH_PUBKEY);
    return;
  }

Version (include the output of istioctl version --remote and kubectl version)
1.0.0

How was Istio installed?
Helm

Environment where bug was observed (cloud vendor, OS, etc)
AWS / Coreos

Note that we can customizing istio-proxy code, it will be great if you can give me suggestion:
how to fix this with customized istio-proxy code.
Thanks a lot.

@liminw
Copy link
Contributor

liminw commented Aug 16, 2019

@johnzheng1975 We have seen the same issuer earlier, i.e, JWT public key failed to refresh due to intermittent network error. We fixed it by improving the retry logic in Pilot. See issue#14638 for details. The fix was checked to Istio 1.1. If you upgrade your Istio version to 1.1 or above, the issue should be resolved.

@johnzheng1975
Copy link
Member Author

johnzheng1975 commented Aug 19, 2019

Good to know pilot has retry in 1.1(https://github.com/istio/istio/pull/14640/files), I am upgrading istio.
Two things I still do not understand.
First question is: why jwk will expire in istio-proxy, it should never expire since it is inline_jwks.
I thought it is cached in istio-proxy forever until istio-proxy restart. However, the old jwk seems may disappear after sometime without istio-proxy restart.
Is it deleted by istio-pilot? It will recheck and get jwk each hour? and it will delete jwk in istio-proxy in case check fail?

    auto inline_jwks = ReadDataStore(jwt_config_.local_jwks(), true);
    if (!inline_jwks.empty()) {
      Status status = SetKey(inline_jwks,
                             // inline jwks never expires.
                             std::chrono::steady_clock::time_point::max());

Thanks in advance for your answer.

@johnzheng1975
Copy link
Member Author

johnzheng1975 commented Aug 19, 2019

The second question is:
Seems istio-proxy will try to get publickey by itself, if istio-proxy cannot get the key from istio-pilot
Why it always fail? Is it necessary to put the key inside k8s cluster?

jwt_authenticator.cc
void JwtAuthenticator::FetchPubkey(PubkeyCacheItem* issuer) {
  uri_ = issuer->jwt_config().remote_jwks().http_uri().uri();
  ...
  const auto& cluster = issuer->jwt_config().remote_jwks().http_uri().cluster();
  if (cm_.get(cluster) == nullptr) {
    ENVOY_LOG(error, "Fetch pubkey can not find cluster: {}", cluster);
    DoneWithStatus(Status::FAILED_FETCH_PUBKEY);
    return;
  }

Thanks in advance for your answer.

@yangminzhu
Copy link
Contributor

Good to know pilot has retry in 1.1(https://github.com/istio/istio/pull/14640/files), I am upgrading istio.
Two things I still do not understand.
First question is: why jwk will expire in istio-proxy, it should never expire since it is inline_jwks.
I thought it is cached in istio-proxy forever until istio-proxy restart. However, the old jwk seems may disappear after sometime without istio-proxy restart.
Is it deleted by istio-pilot? It will recheck and get jwk each hour? and it will delete jwk in istio-proxy in case check fail?

when pilot pushes new config to proxy, the pubkey in the new config will override the old one in proxy. When pilot failed to fetch the pubkey, it uses an empty pubkey in the config and this is causing the 401.

Note, we improved the code with retry and better caching in 1.1. Pilot will fallback to use the cached pubkey if it failed to fetch it instead of returning an empty pubkey.

    auto inline_jwks = ReadDataStore(jwt_config_.local_jwks(), true);
    if (!inline_jwks.empty()) {
      Status status = SetKey(inline_jwks,
                             // inline jwks never expires.
                             std::chrono::steady_clock::time_point::max());

Thanks in advance for your answer.

The second question is:
Seems istio-proxy will try to get publickey by itself, if istio-proxy cannot get the key from istio-pilot
Why it always fail? Is it necessary to put the key inside k8s cluster?

Actually proxy doesn't get the pubkey by itself in istio. That is the old behavior before but later we switched to use pilot to fetch the pubkey. In other words, these code are not used now.

jwt_authenticator.cc
void JwtAuthenticator::FetchPubkey(PubkeyCacheItem* issuer) {
  uri_ = issuer->jwt_config().remote_jwks().http_uri().uri();
  ...
  const auto& cluster = issuer->jwt_config().remote_jwks().http_uri().cluster();
  if (cm_.get(cluster) == nullptr) {
    ENVOY_LOG(error, "Fetch pubkey can not find cluster: {}", cluster);
    DoneWithStatus(Status::FAILED_FETCH_PUBKEY);
    return;
  }

Thanks in advance for your answer.

Please be noted when making custom changes in proxy code: we are switching to use Envoy JWT filter in 1.3 which means the current JWT filter in proxy is going to be deprecated and removed later.

@johnzheng1975 I have a question about pilot failed to get pubkey from s3. Do you know how often does this happen? How long does the network error last? thanks.

@johnzheng1975
Copy link
Member Author

johnzheng1975 commented Aug 20, 2019

@yangminzhu , Thanks a lot for your introduction.
Your question: Do you know how often does this happen?
My answer: For one cluster, there are 5 public key, basically will raise once 1 - 2 times each month.
Your question: How long does the network error last?
My answer: Here is an example:
For one service, it has three pods. One of three has issue(http request enter will return 200), the other two are fine(http request enter will return 200).
Case 1: It last 1 hour, recovery by itself.
Case 2: It last 3 hours, did not recovery until we restart the pods manually.

I have further question for istio 1.0.0, thanks in advance for your answer.
1 As you said, pilot will empty public key in istio-proxy. Why it only impact one pod, not all three pods in my upper case?
2 During my testing, I found: if pilot public key fetch fail, it will FETCH AGAIN for EACH new coming http request to your service, until it got successfully. This means it has a lot of changes to recovery, right? Why my case took 1 hour to recovery?
3 As you said "proxy doesn't get the pubkey by itself in istio", same with my testing result.
If I pilot cannot get public key, the uri_ in code is empty string (in my testing)

void JwtAuthenticator::FetchPubkey(PubkeyCacheItem* issuer) {
  uri_ = issuer->jwt_config().remote_jwks().http_uri().uri();

Am I correct? And, is that means no any benefits to change public key from outside to inside?

@johnzheng1975
Copy link
Member Author

johnzheng1975 commented Aug 20, 2019

@yangminzhu , I am upgrading to istio1.1. However, there is still some cluster using istio1.0.
So, do you have any suggestion how to simple fix it in istio1.0?

Assume my jwksURI/public key will not expire. (I will restart istio-pilot in case it change)
Here is my simple quick fix before istio upgrade, is that fine?
jwks_resolver.go_old.txt
jwks_resolver.go_change.txt

func (r *jwksResolver) refresh(t time.Time) {
        ... ...
		// Remove cached item if it hasn't been used for a while.
		if now.Sub(e.lastUsedTime) >= r.evictionDuration {
			//r.keyEntries.Delete(jwksURI)
			//return true
		}
        ... ...
				resp, err := r.getRemoteContent(jwksURI)
				if err != nil {
					log.Errorf("Cannot fetch JWT public key from %q, %v", jwksURI, err)
					//r.keyEntries.Delete(jwksURI)
					return
				}

@johnzheng1975
Copy link
Member Author

From below log from istio1.1, fetch fail is 39 times in a day. FYI

export ns=istio-system;kubectl get pods -n $ns|grep pilot|awk '{print $1}'|xargs -i sh -c 'echo "\n {}";kubectl logs {} -n $ns -c discovery '|grep "Failed to fetch jwt public key from"
2019-08-20T17:58:30.930042Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-kiosk-jwks"
2019-08-20T20:18:31.273683Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-natsgrpcproxy-jwks"
2019-08-20T22:38:30.493749Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-miniappinapp-jwks"
2019-08-20T22:38:31.385557Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-miniappinapp-jwks"
2019-08-20T22:38:31.407981Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-miniappinapp-jwks"
2019-08-21T02:05:16.100496Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-backoffice-jwks"
2019-08-21T02:52:16.373393Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-rsscan-agent-jwks"
2019-08-21T03:18:16.695951Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-backoffice-jwks"
2019-08-21T03:18:16.898469Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-backoffice-jwks"
2019-08-21T03:18:17.009218Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-l1volumental-jwks"
2019-08-21T03:18:17.658411Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-l1volumental-jwks"
2019-08-21T03:18:17.678332Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-nflapp-jwks"
2019-08-21T05:17:36.445281Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-natsgrpcproxy-jwks"
2019-08-21T06:27:09.395468Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-natsgrpcproxy-jwks"
2019-08-21T06:56:12.066143Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-desma-manufacturer-api-jwks"
2019-08-21T06:56:13.008360Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-desma-manufacturer-api-jwks"
2019-08-21T07:06:19.760952Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-miniappinapp-jwks"
2019-08-21T07:06:20.060005Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-miniappinapp-jwks"
2019-08-21T07:58:31.017560Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-l1volumental-jwks"
2019-08-21T09:28:55.026637Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-ham-jwks"
2019-08-21T09:28:55.515989Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-ham-jwks"
2019-08-21T09:53:59.078427Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-backoffice-jwks1"
2019-08-21T09:54:00.385390Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-backoffice-jwks1"
2019-08-21T09:56:07.689899Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-backoffice-jwks1"
2019-08-21T09:56:08.715166Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-backoffice-jwks1"
2019-08-21T10:08:59.523039Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-miniappinapp-jwks"
2019-08-21T10:09:01.363518Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-miniappinapp-jwks"
2019-08-21T10:09:01.517533Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-miniappinapp-jwks"
2019-08-21T10:15:44.698259Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-desma-manufacturer-api-jwks"
2019-08-21T10:15:44.757006Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-desma-manufacturer-api-jwks"
2019-08-21T10:18:32.803396Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-desma-manufacturer-api-jwks"
2019-08-21T10:18:32.824396Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-desma-manufacturer-api-jwks"
2019-08-21T10:24:22.443047Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-l1steitz-jwks"
2019-08-21T10:24:22.443193Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-rsscan-agent-jwks"
2019-08-21T10:24:22.827108Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-l1nfl-jwks"
2019-08-21T10:24:22.938737Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-l1nfl-jwks"
2019-08-21T10:51:26.377193Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-nflapp-jwks"
2019-08-21T11:07:59.769353Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-l1volumental-jwks"
2019-08-21T11:08:00.769330Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-l1volumental-jwks"

@yangminzhu
Copy link
Contributor

@yangminzhu , Thanks a lot for your introduction.
Your question: Do you know how often does this happen?
My answer: For one cluster, there are 5 public key, basically will raise once 1 - 2 times each month.
Your question: How long does the network error last?
My answer: Here is an example:
For one service, it has three pods. One of three has issue(http request enter will return 200), the other two are fine(http request enter will return 200).
Case 1: It last 1 hour, recovery by itself.
Case 2: It last 3 hours, did not recovery until we restart the pods manually.

I have further question for istio 1.0.0, thanks in advance for your answer.
1 As you said, pilot will empty public key in istio-proxy. Why it only impact one pod, not all three pods in my upper case?

How many pilot pods do you have? The other pods may be connecting to a different pilot instance that fetches the pubkey successfully.

2 During my testing, I found: if pilot public key fetch fail, it will FETCH AGAIN for EACH new coming http request to your service, until it got successfully. This means it has a lot of changes to recovery, right? Why my case took 1 hour to recovery?

So most of the re-fetch should be issued by the re-fresh job that does the fetch every 20 mins and it reties 3 times each time, it's possible that all the 3 fetches failed. But one thing to note is once it fetches the pubkey successfully once, it will be cached and used even if it fails next time.

3 As you said "proxy doesn't get the pubkey by itself in istio", same with my testing result.
If I pilot cannot get public key, the uri_ in code is empty string (in my testing)

void JwtAuthenticator::FetchPubkey(PubkeyCacheItem* issuer) {
  uri_ = issuer->jwt_config().remote_jwks().http_uri().uri();

Am I correct? And, is that means no any benefits to change public key from outside to inside?

I think FetchPubkey is not called because the pubkey will be populated by pilot: https://github.com/istio/proxy/blob/2a21f69ce04f2acadde488ba2d986dca2223a624/src/envoy/http/jwt_auth/jwt_authenticator.cc#L134

@yangminzhu , I am upgrading to istio1.1. However, there is still some cluster using istio1.0.
So, do you have any suggestion how to simple fix it in istio1.0?

There is a workaround that you can apply a k8s job that apply/delete a dummy service entry periodically (like every 1 mins), this will force pilot to fetch the pubkey every 1 min. However, this will also make pilot much more busy.

The other thing is I think you may also try about the https://github.com/istio/istio/releases/tag/1.0.9, I think we actually cherrypicked the fix to 1.0.9.

Assume my jwksURI/public key will not expire. (I will restart istio-pilot in case it change)
Here is my simple quick fix before istio upgrade, is that fine?
jwks_resolver.go_old.txt
jwks_resolver.go_change.txt

func (r *jwksResolver) refresh(t time.Time) {
        ... ...
		// Remove cached item if it hasn't been used for a while.
		if now.Sub(e.lastUsedTime) >= r.evictionDuration {
			//r.keyEntries.Delete(jwksURI)
			//return true
		}
        ... ...
				resp, err := r.getRemoteContent(jwksURI)
				if err != nil {
					log.Errorf("Cannot fetch JWT public key from %q, %v", jwksURI, err)
					//r.keyEntries.Delete(jwksURI)
					return
				}

Yeah, the workaround seems okay, but I would suggest you to try to update to 1.0.9 to avoid custom changes to the Pilot.

@yangminzhu
Copy link
Contributor

From below log from istio1.1, fetch fail is 39 times in a day. FYI

export ns=istio-system;kubectl get pods -n $ns|grep pilot|awk '{print $1}'|xargs -i sh -c 'echo "\n {}";kubectl logs {} -n $ns -c discovery '|grep "Failed to fetch jwt public key from"
2019-08-20T17:58:30.930042Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-kiosk-jwks"
2019-08-20T20:18:31.273683Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-natsgrpcproxy-jwks"
2019-08-20T22:38:30.493749Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-miniappinapp-jwks"
2019-08-20T22:38:31.385557Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-miniappinapp-jwks"
2019-08-20T22:38:31.407981Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-miniappinapp-jwks"
2019-08-21T02:05:16.100496Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-backoffice-jwks"
2019-08-21T02:52:16.373393Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-rsscan-agent-jwks"
2019-08-21T03:18:16.695951Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-backoffice-jwks"
2019-08-21T03:18:16.898469Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-backoffice-jwks"
2019-08-21T03:18:17.009218Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-l1volumental-jwks"
2019-08-21T03:18:17.658411Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-l1volumental-jwks"
2019-08-21T03:18:17.678332Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-nflapp-jwks"
2019-08-21T05:17:36.445281Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-natsgrpcproxy-jwks"
2019-08-21T06:27:09.395468Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-natsgrpcproxy-jwks"
2019-08-21T06:56:12.066143Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-desma-manufacturer-api-jwks"
2019-08-21T06:56:13.008360Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-desma-manufacturer-api-jwks"
2019-08-21T07:06:19.760952Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-miniappinapp-jwks"
2019-08-21T07:06:20.060005Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-miniappinapp-jwks"
2019-08-21T07:58:31.017560Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-l1volumental-jwks"
2019-08-21T09:28:55.026637Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-ham-jwks"
2019-08-21T09:28:55.515989Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-ham-jwks"
2019-08-21T09:53:59.078427Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-backoffice-jwks1"
2019-08-21T09:54:00.385390Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-backoffice-jwks1"
2019-08-21T09:56:07.689899Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-backoffice-jwks1"
2019-08-21T09:56:08.715166Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-backoffice-jwks1"
2019-08-21T10:08:59.523039Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-miniappinapp-jwks"
2019-08-21T10:09:01.363518Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-miniappinapp-jwks"
2019-08-21T10:09:01.517533Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-miniappinapp-jwks"
2019-08-21T10:15:44.698259Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-desma-manufacturer-api-jwks"
2019-08-21T10:15:44.757006Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-desma-manufacturer-api-jwks"
2019-08-21T10:18:32.803396Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-desma-manufacturer-api-jwks"
2019-08-21T10:18:32.824396Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-desma-manufacturer-api-jwks"
2019-08-21T10:24:22.443047Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-l1steitz-jwks"
2019-08-21T10:24:22.443193Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-rsscan-agent-jwks"
2019-08-21T10:24:22.827108Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-l1nfl-jwks"
2019-08-21T10:24:22.938737Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-l1nfl-jwks"
2019-08-21T10:51:26.377193Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-nflapp-jwks"
2019-08-21T11:07:59.769353Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-l1volumental-jwks"
2019-08-21T11:08:00.769330Z	warn	Failed to fetch jwt public key from "https://s3.amazonaws.com/f..n-jwk/use1b.dev/use1b-dev-l1volumental-jwks"

Thanks for the info, it's happening much more frequent than I would expect. do you still see 401 after upgrading? I think the above logs are just for failed re-fetch, pilot should keep using the cached pubkey and there should be no 401 in this case.

@johnzheng1975
Copy link
Member Author

johnzheng1975 commented Aug 22, 2019

Thanks Zhu,yangmin very much, I never get so detailed answer from open source community.

Zhu, yangmin: How many pilot pods do you have? The other pods may be connecting to a different pilot instance that fetches the pubkey successfully.
John Zheng: I have 2 - 4 pilot pods, your answer make sense.

Zhu, yangmin: So most of the re-fetch should be issued by the re-fresh job that does the fetch every 20 mins and it reties 3 times each time, it's possible that all the 3 fetches failed. But one thing to note is once it fetches the pubkey successfully once, it will be cached and used even if it fails next time.
John Zheng: Our code is based on old istio1.0.0, which has no retry. It make sense for your answer: try once each 20 mins, and fetch successfully after 1 hour.

Zhu, yanming I think FetchPubkey is not called because the pubkey will be populated by pilot: https://github.com/istio/proxy/blob/2a21f69ce04f2acadde488ba2d986dca2223a624/src/envoy/http/jwt_auth/jwt_authenticator.cc#L134
John Zheng: If issuer has no pubkey() or issuer is Expired, FetchPubkey still will be called, right?
In such case, it will return 401 then. There is a lot of such calling log in my cluster (include istio1.1.7)
image

Zhu, yangmin: The other thing is I think you may also try about the https://github.com/istio/istio/releases/tag/1.0.9, I think we actually cherrypicked the fix to 1.0.9.
Yeah, your workaround seems okay, but I would suggest you to try to update to 1.0.9 to avoid custom changes to the Pilot.
John Zheng: Thanks a lot. Rebase our customized istio-proxy code to 1.0.9 may bring workload, I will use my workarround, and istio1.0.0 may retire in 2 - 4 weeks.
I already applied my changes in my environment, it runs well and no related error message again. ^_^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants