Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

export resolver.obtainError? #793

Closed
dhduvall opened this issue Feb 9, 2019 · 15 comments · Fixed by #1993
Closed

export resolver.obtainError? #793

dhduvall opened this issue Feb 9, 2019 · 15 comments · Fixed by #1993

Comments

@dhduvall
Copy link
Contributor

dhduvall commented Feb 9, 2019

I got a badNonce during Obtain(). Based on the logs, it looks like it failed before it got confirmation of validation: I see the five badNonce errors after Checking DNS record propagation, but not the The server validated our request message; if I GET the auth URL, it tells me that the status is valid.

Obtain() returns an error that, according to %T, is of type resolver.obtainError. If there's a way to "cast" this to the underlying map[string]error, I can't figure it out, which means I can't get at any of the underlying errors and either report them, store them, act on any data they may contain, etc. In this case, I think I'd be able to assert that the error value in the map is actually an acme.NonceError and retry more (though I don't think I have access to either the URL I'd need or the routines to operate on it).

The only workaround I can think of is to parse the error string. At least that's fairly nicely structured, but it's still kinda gross.

Can resolver.obtainError be made public? That should be non-breaking, yes? Are there other errors that might be useful, as well?

I guess what I'm really looking for is a way for lego to tell me that although it failed, the operation is retryable (the failure isn't terminal), and to give me a way to retry it from whatever point it failed. This would take care of #771, too, but is probably a huge change.

@dhduvall
Copy link
Contributor Author

dhduvall commented Feb 9, 2019

This also means that when I'm mocking Obtain(), I don't have any way of making my mock return one of the actual errors it'll get, so my testing won't be especially complete. This probably speaks to exporting the type, regardless of whatever other immediate uses it might or might not have.

@ldez
Copy link
Member

ldez commented Feb 9, 2019

If you have an issue related to badNone this need to fixed inside lego.

I made the choice and spend time on v2 to not export obtainError (it was exported in v1), then I don't really want to change that.

@dhduvall
Copy link
Contributor Author

dhduvall commented Feb 9, 2019

It's not clear to me if I run into the 5-badNonce limit, if that really means something's wrong, and there's no point in trying again, or whether the error is still temporary, but may need more tries than that.

In some ways, I don't really care; doubling the limit to 10 might make it work for more cases, but I suspect that there will always be some scenario where there's no suitable number, and lego really has no choice but to fail. That's fine, but my code has to be able to deal with that situation, and I have to be able to write tests for my code that can exercise its handling of it.

Other than doing a bunch of string processing, I don't know how to do either with lego as it is now.

@ldez
Copy link
Member

ldez commented Feb 9, 2019

I need more context, because badNonce can have multiple sources.
Doubling the limit is not a solution for me.

@dhduvall
Copy link
Contributor Author

dhduvall commented Feb 9, 2019

What context are you looking for? I can give you the logs from that run. Is there anything else?

@ldez
Copy link
Member

ldez commented Feb 9, 2019

Do you have many SAN? Do you make concurrency calls? etc.

I need to understand how the issue appears for you.

@dhduvall
Copy link
Contributor Author

dhduvall commented Feb 9, 2019

This happened with two concurrent calls to Obtain(), each for an order containing two domains, one of which is a wildcard. That is, order 1 is for a.domain.com and *.a.domain.com, and order 2 is for b.domain.com and *.b.domain.com.

I hope to drive the concurrency up a lot (to anywhere between 50 and 500, probably), but have been testing with small batches so that the logs are easier to read.

I'm running this from a VM inside Google Cloud. It doesn't have a permanent external IP address, at least for IPv4; not sure whether it's ever connecting to LE via IPv6, and if so, what that looks like.

I do have the patch from #783 applied, and am using it to institute a two minute delay prior pre-check, and that seems to be working for me as I expect.

For the order that did work, the time from "preparing to solve" to "validated our request" was 2m18.5s for the wildcard, and 2m15s for the other, spanning a total of 2m19s. The "Wait for apply change" and "Wait for propagation" aren't tagged with their respective domains, but there aren't many of them; only two of the latter (suggesting that it didn't actually have to do any waiting), and six of the former.

There are actually six nonce errors in the logs; the first one is actually for a challenge URI for the order that ended up succeeding. About .15s later, there are five nonce errors in a row for an authorization URI from the failed order. They're spaced by .12s, .31, .19, and .24s, respectively (the "too many retry" error comes .2s later).

@ldez
Copy link
Member

ldez commented Feb 9, 2019

That was what I thought.

I know there is a concurrency problem that produces badNonce, for now I haven't found the best way to handle this yet.

I wanted to correct that for v2, I de-prioritized because I had no issue on the subject.

@glasser
Copy link

glasser commented Oct 18, 2019

We have a service that uses LetsEncrypt to provide certificates for many different user apps. Some of our users misconfigure their DNS and so we are sometimes not able to provide them certificates.

We used a different Go ACME library (github.com/hlandau/acme/acmeapi) which does not support ACMEv2 (though the author may be fixing that right now?). Porting to Lego has overall been a great experience that let us delete the majority of our code.

However, our old implementation did parse the specific errors (which here you have as acme.ProblemDetails.Type) returned from LetsEncrypt for monitoring purposes. Some error types we wanted to get alerted on; others were just user misconfigurations and we wanted to display them in our dashboard to users but not alert us.

We can almost implement that with Lego, except that we can't break into the obtainError returned by Obtain to get the nested ProblemDetails.

@mholt
Copy link
Contributor

mholt commented Oct 18, 2019

@glasser Just curious, what is the service you're using? (I ask because several of us in the industry are currently drafting up a document about best practices for ACME clients, especially those which operate at scale.)

glasser added a commit to meteor/lego that referenced this issue Oct 18, 2019
@csandanov
Copy link

I want to log original errors received from let's encrypt but since obtainError not exported, there's no way to cast the err to map[string]error and obtain the error message. Is there a reason it's not exported? I've worked with other contrib libraries that had a need to log multiple error messages and they usually have exported an custom error type with an additional map field that contains these messages

@ldez
Copy link
Member

ldez commented Mar 12, 2022

For me, the current obtainError type is still not exportable, this type was created a long time ago, the error system provided by Go has been improved.
I think I can create a new type that can be better for lego API consumers.

@ldez ldez self-assigned this Mar 12, 2022
@csandanov
Copy link

any updates on this issue? or maybe a recommendations for a workaround?

@csandanov
Copy link

any updates?

@ldez
Copy link
Member

ldez commented Aug 18, 2023

sorry it takes more time than I was thinking, but now the underlying errors related to obtainError are usable.
See #1993

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

5 participants