Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale Quay Authentication #2055

Closed
dchw opened this issue Mar 31, 2021 · 0 comments · Fixed by #2062
Closed

Stale Quay Authentication #2055

dchw opened this issue Mar 31, 2021 · 0 comments · Fixed by #2062

Comments

@dchw
Copy link
Contributor

dchw commented Mar 31, 2021

I think I found a bug where Buildkit will use stale/expired credentials when attempting to communicate with Quay. We are having users who appear to be affected by this in Earthly: earthly/earthly#890

Lets see if I can provide a decent summary. Its one of those bugs thats a confluence of a couple factors - so here it goes:

  1. Quay returns the bare minimum in authentication. See here for details, heading "Token Response Fields". Notably, Quay omits the expires_in field. The documentation says that if it is missing, you should assume the duration is 60 seconds.

Here is a sample Quay auth payload (JWT elided):

{
  "token": "eyJhb..."
}
  1. Containerd interprets this missing field as an expires_in of 0 (see this function, and this struct for deserialization), and notifies Buildkit that this is zero. This is due to the inability of FetchTokenResponse to properly interpret missing vs. true zero values on the expires_in field.
  2. Buildkit treats an expires_in of 0 as "doesn't expire" (see here), and dutifully caches the token, to prevent overhead from re-authentication.
  3. Additionally, Buildkit's gc cleans up all cached credentials when they haven't been used for more than 10 minutes, checking every 5. (see here)

So, to reproduce this, you need to execute a build that talks to Quay, at least once every 10-ish minutes for at least an hour (the JWT coming from Quay in my case has an exp of 1 hour in the future, why its not also in the external JSON is beyond me).

You should be able to reproduce this by doing a build that uses a private Quay image, I was able to reproduce this error with Earthly (see linked bug above), but an equivalent Dockerfile would be something like this:

Dockerfile:

FROM quay.io/dchw/testing
RUN echo bye > goodbye.txt

After running this for an hour; I start to see logs like this: unexpected status code [manifests ci]: 401 UNAUTHORIZED

So - this unfortunately means that workarounds are few and far between. Options are to either:

  1. Restart the affected Buildkit daemon
  2. Wait 10-15 minutes for the GC to clean up the "unused" token

I should also mention that I don't know whose side of the fence this lands on - containerd or buildkit. I am raising it here since it is visible from buildkit, and that you'll send me to the right place if it doesn't belong here... and if it is in containerd we would need an update here too, probably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant