export resolver.obtainError? #793

dhduvall · 2019-02-09T00:08:57Z

I got a badNonce during Obtain(). Based on the logs, it looks like it failed before it got confirmation of validation: I see the five badNonce errors after Checking DNS record propagation, but not the The server validated our request message; if I GET the auth URL, it tells me that the status is valid.

Obtain() returns an error that, according to %T, is of type resolver.obtainError. If there's a way to "cast" this to the underlying map[string]error, I can't figure it out, which means I can't get at any of the underlying errors and either report them, store them, act on any data they may contain, etc. In this case, I think I'd be able to assert that the error value in the map is actually an acme.NonceError and retry more (though I don't think I have access to either the URL I'd need or the routines to operate on it).

The only workaround I can think of is to parse the error string. At least that's fairly nicely structured, but it's still kinda gross.

Can resolver.obtainError be made public? That should be non-breaking, yes? Are there other errors that might be useful, as well?

I guess what I'm really looking for is a way for lego to tell me that although it failed, the operation is retryable (the failure isn't terminal), and to give me a way to retry it from whatever point it failed. This would take care of #771, too, but is probably a huge change.

The text was updated successfully, but these errors were encountered:

dhduvall · 2019-02-09T00:24:19Z

This also means that when I'm mocking Obtain(), I don't have any way of making my mock return one of the actual errors it'll get, so my testing won't be especially complete. This probably speaks to exporting the type, regardless of whatever other immediate uses it might or might not have.

ldez · 2019-02-09T00:29:44Z

If you have an issue related to badNone this need to fixed inside lego.

I made the choice and spend time on v2 to not export obtainError (it was exported in v1), then I don't really want to change that.

dhduvall · 2019-02-09T00:40:35Z

It's not clear to me if I run into the 5-badNonce limit, if that really means something's wrong, and there's no point in trying again, or whether the error is still temporary, but may need more tries than that.

In some ways, I don't really care; doubling the limit to 10 might make it work for more cases, but I suspect that there will always be some scenario where there's no suitable number, and lego really has no choice but to fail. That's fine, but my code has to be able to deal with that situation, and I have to be able to write tests for my code that can exercise its handling of it.

Other than doing a bunch of string processing, I don't know how to do either with lego as it is now.

ldez · 2019-02-09T00:43:05Z

I need more context, because badNonce can have multiple sources.
Doubling the limit is not a solution for me.

dhduvall · 2019-02-09T00:49:37Z

What context are you looking for? I can give you the logs from that run. Is there anything else?

ldez · 2019-02-09T00:52:45Z

Do you have many SAN? Do you make concurrency calls? etc.

I need to understand how the issue appears for you.

dhduvall · 2019-02-09T01:19:02Z

This happened with two concurrent calls to Obtain(), each for an order containing two domains, one of which is a wildcard. That is, order 1 is for a.domain.com and *.a.domain.com, and order 2 is for b.domain.com and *.b.domain.com.

I hope to drive the concurrency up a lot (to anywhere between 50 and 500, probably), but have been testing with small batches so that the logs are easier to read.

I'm running this from a VM inside Google Cloud. It doesn't have a permanent external IP address, at least for IPv4; not sure whether it's ever connecting to LE via IPv6, and if so, what that looks like.

I do have the patch from #783 applied, and am using it to institute a two minute delay prior pre-check, and that seems to be working for me as I expect.

For the order that did work, the time from "preparing to solve" to "validated our request" was 2m18.5s for the wildcard, and 2m15s for the other, spanning a total of 2m19s. The "Wait for apply change" and "Wait for propagation" aren't tagged with their respective domains, but there aren't many of them; only two of the latter (suggesting that it didn't actually have to do any waiting), and six of the former.

There are actually six nonce errors in the logs; the first one is actually for a challenge URI for the order that ended up succeeding. About .15s later, there are five nonce errors in a row for an authorization URI from the failed order. They're spaced by .12s, .31, .19, and .24s, respectively (the "too many retry" error comes .2s later).

ldez · 2019-02-09T01:32:17Z

That was what I thought.

I know there is a concurrency problem that produces badNonce, for now I haven't found the best way to handle this yet.

I wanted to correct that for v2, I de-prioritized because I had no issue on the subject.

glasser · 2019-10-18T00:23:53Z

We have a service that uses LetsEncrypt to provide certificates for many different user apps. Some of our users misconfigure their DNS and so we are sometimes not able to provide them certificates.

We used a different Go ACME library (github.com/hlandau/acme/acmeapi) which does not support ACMEv2 (though the author may be fixing that right now?). Porting to Lego has overall been a great experience that let us delete the majority of our code.

However, our old implementation did parse the specific errors (which here you have as acme.ProblemDetails.Type) returned from LetsEncrypt for monitoring purposes. Some error types we wanted to get alerted on; others were just user misconfigurations and we wanted to display them in our dashboard to users but not alert us.

We can almost implement that with Lego, except that we can't break into the obtainError returned by Obtain to get the nested ProblemDetails.

mholt · 2019-10-18T03:05:13Z

@glasser Just curious, what is the service you're using? (I ask because several of us in the industry are currently drafting up a document about best practices for ACME clients, especially those which operate at scale.)

See go-acme#793

csandanov · 2022-03-12T13:21:32Z

I want to log original errors received from let's encrypt but since obtainError not exported, there's no way to cast the err to map[string]error and obtain the error message. Is there a reason it's not exported? I've worked with other contrib libraries that had a need to log multiple error messages and they usually have exported an custom error type with an additional map field that contains these messages

ldez · 2022-03-12T13:37:09Z

For me, the current obtainError type is still not exportable, this type was created a long time ago, the error system provided by Go has been improved.
I think I can create a new type that can be better for lego API consumers.

csandanov · 2022-04-18T11:05:14Z

any updates on this issue? or maybe a recommendations for a workaround?

csandanov · 2022-09-01T07:07:33Z

any updates?

ldez · 2023-08-18T10:17:40Z

sorry it takes more time than I was thinking, but now the underlying errors related to obtainError are usable.
See #1993

ldez added the area/dnsprovider label Feb 9, 2019

ldez mentioned this issue Feb 9, 2019

Uses a jittered exponential backoff #794

Merged

glasser added a commit to meteor/lego that referenced this issue Oct 18, 2019

Expose ObtainError

f1e3479

See go-acme#793

mholt mentioned this issue Apr 29, 2020

Replace the underlying ACME library caddyserver/certmagic#71

Closed

ldez self-assigned this Mar 12, 2022

ldez mentioned this issue Aug 18, 2023

chore: update to go1.20 #1993

Merged

ldez added the enhancement label Aug 18, 2023

ldez closed this as completed in #1993 Aug 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

export resolver.obtainError? #793

export resolver.obtainError? #793

dhduvall commented Feb 9, 2019

dhduvall commented Feb 9, 2019

ldez commented Feb 9, 2019

dhduvall commented Feb 9, 2019

ldez commented Feb 9, 2019 •

edited

dhduvall commented Feb 9, 2019

ldez commented Feb 9, 2019

dhduvall commented Feb 9, 2019

ldez commented Feb 9, 2019

glasser commented Oct 18, 2019

mholt commented Oct 18, 2019

csandanov commented Mar 12, 2022

ldez commented Mar 12, 2022

csandanov commented Apr 18, 2022

csandanov commented Sep 1, 2022

ldez commented Aug 18, 2023

export resolver.obtainError? #793

export resolver.obtainError? #793

Comments

dhduvall commented Feb 9, 2019

dhduvall commented Feb 9, 2019

ldez commented Feb 9, 2019

dhduvall commented Feb 9, 2019

ldez commented Feb 9, 2019 • edited

dhduvall commented Feb 9, 2019

ldez commented Feb 9, 2019

dhduvall commented Feb 9, 2019

ldez commented Feb 9, 2019

glasser commented Oct 18, 2019

mholt commented Oct 18, 2019

csandanov commented Mar 12, 2022

ldez commented Mar 12, 2022

csandanov commented Apr 18, 2022

csandanov commented Sep 1, 2022

ldez commented Aug 18, 2023

ldez commented Feb 9, 2019 •

edited