-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Timeout to Image Digest resolution #8724
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@julz: 0 warnings.
In response to this:
Image digest resolution happens synchronously in the revision reconciler. This means slow registries can potentially block up the reconciliation queue.
This PR:
- bumps
go-containerregistry
to pick up theWithContext()
feature; and- uses it to install a high (60 second) time-out for image digest resolution.
Started out with a pretty generous time-out. As a follow-on we could make the timeout configurable and/or place a deadline on the reconcile itself (in which case the context threaded down here will cancel the digest resolution correctly), and/or make it more aggressive, but at least this stops a slow registry gumming up an entire queue indefinitely for now.
Release Note
Adds a 60 second timeout for image digest resolution to guard against slow registries
/assign @mattmoor @vagababov @markusthoemmes
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}, | ||
}) | ||
|
||
ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can significantly tighten this. Maybe 100ms instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to make sure the thing we cancel inside is the actual manifest get, if this is too short we can accidentally end up only testing cancelling the ping / credential lookup :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make this super safe by adding a second channel that is closed by the server so you know in the test when the actual request has been received. That'd allow us to ditch the hardcoded timeout completely and just use a cancel function on the context after that signal has arrived.
Up to you if you think that's worth here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, considered something like that but seems like a tiny bit more ceremony than this is worth here. I've bumped the timeout way down to 200ms tho, that seems good enough to be very confident we're timing out in go-containerregistry rather than credential lookup.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: julz, markusthoemmes The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The following is the coverage report on the affected files.
|
/lgtm |
Apparently I had unsubmitted comments to the knative#8724, but also make the file feel more unified Finally I removed all the fatal calls from the handlers since they run in test HTTP server goroutines and go doesn't like when you fatal inside a go routine.
* Fix some nits in the resolve test Apparently I had unsubmitted comments to the #8724, but also make the file feel more unified Finally I removed all the fatal calls from the handlers since they run in test HTTP server goroutines and go doesn't like when you fatal inside a go routine. * rm
Image digest resolution happens synchronously in the revision reconciler. This means slow registries can potentially block up the reconciliation queue.
This PR:
go-containerregistry
to pick up theWithContext()
feature; andStarted out with a pretty generous time-out. As a follow-on we could make the timeout configurable and/or place a deadline on the reconcile itself (in which case the context threaded down here will cancel the digest resolution correctly), and/or make it more aggressive, but at least this stops a slow registry gumming up an entire queue indefinitely for now.
Release Note
/assign @mattmoor @vagababov @markusthoemmes