Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make retrying transport and http errors configurable #1122

Merged
merged 5 commits into from
Sep 22, 2021

Conversation

DennisDenuto
Copy link
Contributor

@DennisDenuto DennisDenuto commented Sep 9, 2021

Related to #1114

Adds the following 4 options:

  • WithRetryHTTPBackoff
  • WithRetryHTTPPredicate
  • WithRetryTransportBackoff
  • WithRetryTransportPredicate

Went with a finer level API around configuring these (in my head) 2 groups of retry logic (low-level transport errors and higher level HTTP errors).

I was also aiming to keep the 'default' retry logic values the same (if these options are not provided).

@jonjohnsonjr
Copy link
Collaborator

So I don't love how many options this adds, but I get the need for this to be more configurable.

For the transport pieces, I think I'd like to revisit this: #740

Ideally, you could supply your own transport via remote.WithTransport and use transport.NewRetry to configure it. The only issue is around double-wrapping because of how we wrap transports by default. One option here is to just not wrap the supplied transport if you pass in a custom transport, but then you'd lose out on some "niceness". If we exposed this stuff in a nice way, you could claw that back by opting into it, e.g.:

// Wraps with retries and useragent and debug logging
remote.Image(..., remote.WithTransport(remote.Transport{t}))

// No wrapping of t
remote.Image(..., remote.WithTransport(t))

// Configurable backoff stuff.
remote.Image(..., remote.WithTransport(transport.NewRetry(t, ...)))

My only issue with that approach is that it's probably a breaking change :/

Another approach might be exposing more on the errors we return from transport, e.g. the request:

That would allow you to write a predicate that checked the request method, URL, and status code yourself to determine if you should retry it.

Note: This is a breaking change

Authored-by: Dennis Leon <leonde@vmware.com>
@codecov-commenter
Copy link

codecov-commenter commented Sep 10, 2021

Codecov Report

Merging #1122 (4e8dccb) into main (8388fde) will increase coverage by 0.22%.
The diff coverage is 83.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1122      +/-   ##
==========================================
+ Coverage   75.16%   75.39%   +0.22%     
==========================================
  Files         108      108              
  Lines        7724     7827     +103     
==========================================
+ Hits         5806     5901      +95     
  Misses       1363     1363              
- Partials      555      563       +8     
Impacted Files Coverage Δ
pkg/v1/remote/options.go 69.23% <65.78%> (+7.98%) ⬆️
pkg/v1/google/list.go 70.81% <100.00%> (+0.64%) ⬆️
pkg/v1/remote/multi_write.go 62.92% <100.00%> (+3.72%) ⬆️
pkg/v1/remote/transport/error.go 100.00% <100.00%> (ø)
pkg/v1/remote/transport/retry.go 100.00% <100.00%> (ø)
pkg/v1/remote/transport/transport.go 100.00% <100.00%> (ø)
pkg/v1/remote/write.go 64.82% <100.00%> (+1.96%) ⬆️
pkg/v1/daemon/image.go 75.51% <0.00%> (-24.49%) ⬇️
pkg/legacy/tarball/write.go 67.18% <0.00%> (-1.16%) ⬇️
pkg/v1/mutate/image.go 69.72% <0.00%> (+0.49%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8388fde...4e8dccb. Read the comment docs.

@DennisDenuto
Copy link
Contributor Author

DennisDenuto commented Sep 10, 2021

@jonjohnsonjr

Yeah I think iterating over the API change to make it as minimal as needed makes lots of sense. Both from maintainability and from a usability perspective! With the changes above sounds like a trade-off between a simpler API vs breaking changes to existing APIs.

I played around with the idea of using WithTransport as a way to allow a consumer to configure retrying.

Another approach might be exposing more on the errors we return from transport

I did this by creating a new RetryError that wraps the transport error (if it exists), and exposes the status code and request.
I also made it implement the Temporary function to make it a compatible with the 'default' transport predicate. (This part of the change IMO feels the most awkward, since we are exposing higher level errors (http status code) to a lower level function call.)

I pushed the changes. I'd love some early feedback to make sure i'm properly understanding your idea to begin with!

This PR comes with breaking changes:

  • Using WithTransport(t) with a non remote.Transport no longer wraps retry, logging etc...
  • Retrying with the transport.Retry will return a RetryError (even if the RoundTripper did not return an error). It always returns a RetryError since it contains http data (status code) to allow specifying a Predicate to determine whether to retry or not.

Note:

Some tests are failing due to these breaking changes. But I wanted to get feedback before proceeding down this path.

@jonjohnsonjr
Copy link
Collaborator

I don't think we need RetryError if we just expose the existing request field on the transport.Error struct. Callers can check if an error is transport.Error and use that information. If it's some other kind of error, they can just do that in an "else" rather than using Inner.

Using WithTransport(t) with a non remote.Transport no longer wraps retry, logging etc...

This feels kinda backwards. I think if someone gives us a transport.Transport, we should avoid wrapping it at all. We could use that as a signal that the caller knows what they're doing and we should get out of their way.

It would still make sense to me to expose Backoff/Predicate options in remote for higher level things that cannot be retried at a request level, e.g. layer uploads. I'm not sure how I would want to structure those hooks, but there are a handful of places that might make sense to inject some kind of retry check.

I'd expect the default predicate to be somewhat conservative, but having the option would let callers do some pretty flexible things like only retry for certain methods or for certain paths.

- Useful by consumers providing their own Predicate to determine whether to retry or not

Authored-by: Dennis Leon <leonde@vmware.com>
@DennisDenuto
Copy link
Contributor Author

DennisDenuto commented Sep 10, 2021

if we just expose the existing request field on the transport.Error

hmmm. I don't see how transport.Error ever gets returned by a RoundTripper. I can however see how it is returned at the http req/resp layer via CheckError(...)

or put another way...

I don't see how a consumer providing their own Retry Transport would ever get an error of type transport.Error

Although I see how exposing the Request on the transport.Error can be useful in conjunction with exposing Backoff/Predicate options in remote for higher level http retries.

I think if someone gives us a transport.Transport, we should avoid wrapping it at all.

hmm ok, I mis-understood your code snippet above then

// Wraps with retries and useragent and debug logging
remote.Image(..., remote.WithTransport(remote.Transport{t}))

but thinking about it more, it does make sense that it should not wrap, and i think will result in this not being a breaking change. I updated the PR with this.

It would still make sense to me to expose Backoff/Predicate options in remote

I think this helps clarify things a lot for me.

I was trying to have transport.Retry provide a way to retry both low level transport errors and http level errors. I updated the PR to put back Backoff/Predicate options in remote :-)

I'm not sure how I would want to structure those hooks,

I made a first attempt at it. Essentially the writer has predicate and backoff as configurable options

… http retries

Authored-by: Dennis Leon <leonde@vmware.com>
@jonjohnsonjr
Copy link
Collaborator

hmm ok, I mis-understood your code snippet above then

Yeah I didn't love my code snippet, but what you've got here is actually a better solution, I think.

I think this helps clarify things a lot for me.

I was trying to have transport.Retry provide a way to retry both low level transport errors and http level errors. I updated the PR to put back Backoff/Predicate options in remote :-)

Ah yeah, sorry this wasn't more clear. There were two categories of retry stuff going on, but we kind of lumped them together. Current PR is pretty close to optimal, I think.


// Transport results in *not* wrapping supplied transport with additional logic such as retries, useragent and debug logging
// Consumers are opt-ing into providing their own transport without any additional wrapping.
type Transport struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all that's left is to make the transport package actually return these.

I don't love the transport.Transport name. Maybe transport.Wrapper or something?

Another issue is that it's kind of a pain to actually construct a transport outside of the remote package (you need to know where layers came from, where you're pushing, etc). We could make that a lot more ergonomic by introducing a new transport implementation that defers those decisions until they're needed, like containerd (#666 (comment)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! changed it to transport.Wrapper

make the transport package actually return these.

I wasn't 100% sure what you meant here. I took this to mean NewWithContext should return transport.Wrapper. I pushed the change.

Another issue is that it's kind of a pain to actually construct a transport outside of the remote package (you need to know where layers came from, where you're pushing, etc). We could make that a lot more ergonomic by introducing a new transport implementation that defers those decisions until they're needed, like containerd (#666 (comment)).

hmm ok, is it possible to split introducing a new deferring transport implementation into a separate PR? Or do you see this as a blocker to this PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't 100% sure what you meant here. I took this to mean NewWithContext should return transport.Wrapper. I pushed the change.

Yep, that's perfect.

hmm ok, is it possible to split introducing a new deferring transport implementation into a separate PR? Or do you see this as a blocker to this PR?

Not a blocker, just thinking out loud.

if logs.Enabled(logs.Debug) {
o.transport = transport.NewLogger(o.transport)
}
if _, ok := o.transport.(*transport.Transport); !ok {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here and in google describing why we're doing things this way.

Also for the WithTransport option, we'll want to describe this workaround.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@DennisDenuto DennisDenuto force-pushed the make-retry-configurable branch 2 times, most recently from b591fe2 to 8178e55 Compare September 11, 2021 00:01
- add test to assert behavior around using a transport.Wrapper results
in no additional wrapping such as retry is done.

refactoring

- add comments
- rename transport.Transport -> transport.Wrapper
- make transport package return transport.Wrapper

Authored-by: Dennis Leon <leonde@vmware.com>
@DennisDenuto
Copy link
Contributor Author

@jonjohnsonjr Just checking in to see if theres anything else needed to have this merged in?

@jonjohnsonjr
Copy link
Collaborator

Sorry I missed your follow-up commits!

One final thing I forgot to mention. If we're dealing with a transport.Wrapper, we probably also want to skip all of this:

pr, err := ping(ctx, reg, t)
if err != nil {
return nil, err
}
// Wrap t with a useragent transport unless we already have one.
if _, ok := t.(*userAgentTransport); !ok {
t = NewUserAgent(t, "")
}
// Wrap t in a transport that selects the appropriate scheme based on the ping response.
t = &schemeTransport{
scheme: pr.scheme,
registry: reg,
inner: t,
}

We can do a little introspection to determine if the the given transport can be reused by looking at reg to see if it's the same as what t already pinged. Then we can skip some pinging. For the bearer case, we probably want to merge the existing scopes with the given scopes like with:

// Add any scopes that we don't already request.
got := stringSet(bt.scopes)
for _, want := range scopes {
if _, ok := got[want]; !ok {
bt.scopes = append(bt.scopes, want)
}
}

@DennisDenuto
Copy link
Contributor Author

Thanks I appreciate the pointers @jonjohnsonjr

We can do a little introspection to determine if the the given transport can be reused by looking at reg to see if it's the same as what t already pinged

I did this by saving state ('pingedRegistries') in transport.Wrapper

For the bearer case, we probably want to merge the existing scopes with the given scopes like with:

I figured that the scopes should be tied to the registry that the transport pinged against. Hence scopes being a map on the wrapped transport.

@jonjohnsonjr
Copy link
Collaborator

jonjohnsonjr commented Sep 21, 2021

Okay so this might make you hate me, but I'm not thrilled with how this turned out after my suggestion (I know... I'm sorry).

Do you mind backing this out to the previous revision? We can just submit that because it's useful on its own, then I can revisit the ping optimizations when I have more time to look at it.

Again, sorry, I appreciate your patience.

@DennisDenuto
Copy link
Contributor Author

@jonjohnsonjr lol no worries. i've reverted that commit

fwiw I have enjoyed working on this PR and we get a ton of value from using this library. so more than happy to be patient and get things done right! 😄

pkg/v1/remote/options.go Outdated Show resolved Hide resolved
pkg/v1/remote/transport/transport.go Outdated Show resolved Hide resolved
- Consumers should construct a transport.Wrapper via constructor
transport.NewWithContext
- options retryBackoff and retryPredicate should only apply to http
errors and not lower level transport errors. (Consumers can still provide
a transport with the retry behavior they want)

Authored-by: Dennis Leon <leonde@vmware.com>
Copy link
Collaborator

@jonjohnsonjr jonjohnsonjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with this. This looks good to me -- let's merge and let it bake for a bit while I try to figure out how we can get rid of extra pings :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants