Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Timeout to Image Digest resolution #8724

Merged
merged 3 commits into from
Jul 21, 2020

Conversation

julz
Copy link
Member

@julz julz commented Jul 21, 2020

Image digest resolution happens synchronously in the revision reconciler. This means slow registries can potentially block up the reconciliation queue.

This PR:

  • bumps go-containerregistry to pick up the WithContext() feature; and
  • uses it to install a high (60 second) time-out for image digest resolution.

Started out with a pretty generous time-out. As a follow-on we could make the timeout configurable and/or place a deadline on the reconcile itself (in which case the context threaded down here will cancel the digest resolution correctly), and/or make it more aggressive, but at least this stops a slow registry gumming up an entire queue indefinitely for now.

Release Note

Adds a 60 second timeout for image digest resolution to guard against slow registries

/assign @mattmoor @vagababov @markusthoemmes

@knative-prow-robot knative-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/API API objects and controllers labels Jul 21, 2020
@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Jul 21, 2020
Copy link
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julz: 0 warnings.

In response to this:

Image digest resolution happens synchronously in the revision reconciler. This means slow registries can potentially block up the reconciliation queue.

This PR:

  • bumps go-containerregistry to pick up the WithContext() feature; and
  • uses it to install a high (60 second) time-out for image digest resolution.

Started out with a pretty generous time-out. As a follow-on we could make the timeout configurable and/or place a deadline on the reconcile itself (in which case the context threaded down here will cancel the digest resolution correctly), and/or make it more aggressive, but at least this stops a slow registry gumming up an entire queue indefinitely for now.

Release Note

Adds a 60 second timeout for image digest resolution to guard against slow registries

/assign @mattmoor @vagababov @markusthoemmes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and tidy, very nice!

I'll leave the LGTM to the API WG.

/assign @mattmoor @dprotaso

},
})

ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can significantly tighten this. Maybe 100ms instead?

Copy link
Member Author

@julz julz Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure the thing we cancel inside is the actual manifest get, if this is too short we can accidentally end up only testing cancelling the ping / credential lookup :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make this super safe by adding a second channel that is closed by the server so you know in the test when the actual request has been received. That'd allow us to ditch the hardcoded timeout completely and just use a cancel function on the context after that signal has arrived.

Up to you if you think that's worth here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, considered something like that but seems like a tiny bit more ceremony than this is worth here. I've bumped the timeout way down to 200ms tho, that seems good enough to be very confident we're timing out in go-containerregistry rather than credential lookup.

@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: julz, markusthoemmes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 21, 2020
@knative-metrics-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/revision/revision.go 97.8% 97.9% 0.1

@dprotaso
Copy link
Member

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 21, 2020
@knative-prow-robot knative-prow-robot merged commit 7c33491 into knative:master Jul 21, 2020
vagababov added a commit to vagababov/serving that referenced this pull request Sep 3, 2020
Apparently I had unsubmitted comments to the knative#8724, but also make the file feel more unified

Finally I removed all the fatal calls from the handlers since they run in test HTTP server goroutines
and go doesn't like when you fatal inside a go routine.
knative-prow-robot pushed a commit that referenced this pull request Sep 3, 2020
* Fix some nits in the resolve test

Apparently I had unsubmitted comments to the #8724, but also make the file feel more unified

Finally I removed all the fatal calls from the handlers since they run in test HTTP server goroutines
and go doesn't like when you fatal inside a go routine.

* rm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants