Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/crypto/acme: ACME client's internal retry implementation results in hanging retries on 429s #40376

Open
viola opened this issue Jul 23, 2020 · 15 comments · May be fixed by golang/crypto#149
Open

Comments

@viola
Copy link

@viola viola commented Jul 23, 2020

What version of Go are you using (go version)?

$ go version
go version go1.14.4 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN="/Users/viola/go/bin"
GOCACHE="/Users/viola/Library/Caches/go-build"
GOENV="/Users/viola/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/viola/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.14.4/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.14.4/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/viola/crypto/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/0s/nyt41p_j69d8vzq1qfkg5nrh0000gn/T/go-build384678003=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

Start off with triggering a rate-limit, to get a 429 from Let's Encrypt. Here's how I triggered one by going over the default 5 per week duplicate certificate limit set by Let's Encrypt :

  • Call AuthorizeOrder with a valid AuthzID identifier.
  • Repeat the request 6 times.
  • On the 6th attempt, the request winds up hanging with a much longer execution. It wound up hitting a timeout set up an upstream caller at 45 seconds.
  • Unless the context is canceled or timed-out, or a "non-retriable error status" is received, retries are indefinite. In this case, the client never returns, because 429 is a retriable error code, and this will be a somewhat terminal state (for the rest of the week).
  • Here's a look at the 429 that the client was receiving and retrying on in this example.
429 urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many certificates already issued for exact set of domains: violababola.best: see https://letsencrypt.org/docs/rate-limits/

What did you expect to see?

  • Fail fast and not to retry on the client side error => http.StatusTooManyRequests since this is a not recoverable response error code. At least, the facility to configure this.
  • Also, that there is some bound on the number of retries.

What did you see instead?

Retries on the client side error => http.StatusTooManyRequests
It seems like http.StatusTooManyRequests is something that should not be considered as a retriable response error code as it's not recoverable.

Related PR Fix

  • This PR introduces a custom ShouldRetry func option that can be set on the ACME client to allow the default set of retriable response error codes to be overridden. This will keep the current behaviour backwards compatible, but provide more flexible retry configuration.
@cagedmantis
Copy link
Contributor

@cagedmantis cagedmantis commented Jul 27, 2020

@cagedmantis
Copy link
Contributor

@cagedmantis cagedmantis commented Jul 28, 2020

Hi @viola! Thank you for contributing to the Go project and welcome.

@viola
Copy link
Author

@viola viola commented Jul 28, 2020

Hi @cagedmantis, good talking to you again! Please let me know if there is anything else I can elaborate on to help this issue move forward.
PS. Go cubs go!

@icholy
Copy link

@icholy icholy commented Jul 28, 2020

Related issue #40161

@andrewlouis93
Copy link

@andrewlouis93 andrewlouis93 commented Aug 4, 2020

cc @FiloSottile 👋 Just doing a gentle nudge on this one 👀

@andrewlouis93
Copy link

@andrewlouis93 andrewlouis93 commented Sep 7, 2020

Hello folks and apologies for the ping @FiloSottile @x1ddos. I just wanted to check in on this and see if we could move it forwards? The PR that fixes this is ready for review: golang/crypto#149

Please do let me know if there's anything I can do to help trudge this along.

@alicethorne-ab
Copy link

@alicethorne-ab alicethorne-ab commented Oct 26, 2020

Hi all. I'm a developer at @1Password and we're currently experiencing this exact issue with the library. It'd be a great help if we could get golang/crypto#149 moved along and merged before the upcoming code freeze. Please let me know if any testing is needed to assist with that.

@andrewlouis93
Copy link

@andrewlouis93 andrewlouis93 commented Oct 27, 2020

cc @FiloSottile if you have some time, would love to pick this one up. Especially since there is more interest now ☝️

@ZhiminXiang
Copy link

@ZhiminXiang ZhiminXiang commented Oct 27, 2020

I also hit this issue. Could the fix be prioritized? :)

@viola
Copy link
Author

@viola viola commented Oct 27, 2020

@icholy @alicethorne-ab @ZhiminXiang thanks for letting me know you've hit the same issue. This farther validates that our golang/crypto#149 fix would be really nice to bring over to crypto.
@FiloSottile @x1ddos @cagedmantis folks any eyes on that PR would be greatly appreciated! Please let me know if there is anything I can do to help. ❤️

@ghost
Copy link

@ghost ghost commented Nov 19, 2020

I would love to see this fixed since 429 is being returned in a case where you will need to retry for days before it would succeed. That doesn't seem to match the purpose of 429.

@rolandshoemaker
Copy link
Member

@rolandshoemaker rolandshoemaker commented Nov 24, 2020

Hey @viola, sorry it has taken so long to address this. I agree this isn't the correct behavior, as far as I am aware there is only one class of 429 returned from most ACME servers that is likely to be retry-able in the short term (an overall req/s limit). Rather than expanding the API surface of the client I think it makes sense to just remove the behavior of retrying on 429 responses in general.

@gopherbot
Copy link

@gopherbot gopherbot commented Nov 24, 2020

Change https://golang.org/cl/272927 mentions this issue: x/crypto/acme: only retry GET requests on 429

@viola
Copy link
Author

@viola viola commented Nov 24, 2020

@rolandshoemaker thank you so much for looking into it for me! While https://golang.org/cl/272927 will address my original issue with retrying on 429 in AuthorizeOrder that uses post I still think there is a value in my proposed extension for the api. Mainly ability for the caller to control what to retry on if desired or needed. Given that golang/crypto#149 is not a breaking change would you still consider it? Once again, thank you for taking the time to review my issue and PR! <3

@rolandshoemaker
Copy link
Member

@rolandshoemaker rolandshoemaker commented Nov 25, 2020

Typically when we extend the public API of a package we go through a proposal process before moving to a CL as once the API is landed it becomes much harder to make any changes (although less so for x/ packages, it is still something we want to follow). If you could write up the proposed API for this change in this issue it would help clear up a few things. It doesn't need to be super detailed, just what the public methods would look like, and what the use case is.

In particular I'd be interested in what you see the use cases for this API would be, beyond the initial 429 one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

8 participants