Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd: 🙏 propagate errors from ACME worker #4652

Merged
merged 1 commit into from
Jun 21, 2024

Conversation

cratelyn
Copy link
Contributor

when auto-https is enabled, we spawn a task in the background to handle https certificate resolution, via rustls_acme::AcmeState.

if that task encounters errors, they should be propagated up to the daemon, so that pd does not rapidly retry lookups, potentially hitting rate-limits and causing service interruptions.

this changes the pd entrypoint, binding the [JoinHandle] to a variable and polling upon that future in the tokio::select block that represents the core steady-state event loop of the daemon.

checklist before requesting a review

  • If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason:

    while this does change node behavior, none of the components changed are consensus-critical.

@cratelyn cratelyn added A-node Area: System design and implementation for node software C-enhancement Category: an enhancement to the codebase labels Jun 21, 2024
@cratelyn cratelyn self-assigned this Jun 21, 2024
@cratelyn cratelyn requested a review from conorsch June 21, 2024 16:41
@cratelyn cratelyn marked this pull request as ready for review June 21, 2024 16:57
@conorsch
Copy link
Contributor

First attempt on this diff to join preview with an intentionally incorrect auto-https value failed to exit:

2024-06-21T19:40:27.521161Z DEBUG acme_worker: rustls_acme::acme: response: "{\n  \"identifier\": {\n    \"type\": \"dns\",\n    \"value\": \"reindex-spike-4.testnet.plinfra.net\"\n  },\n  \"status\": \"invalid\",\n  \"expires\": \"2024-06-28T19:40:25Z\",\n  \"challenges\": [\n    {\n      \"type\": \"tls-alpn-01\",\n      \"url\": \"https://acme-staging-v02.api.letsencrypt.org/acme/chall-v3/12857105223/rG5gHg\",\n      \"status\": \"invalid\",\n  \"validated\": \"2024-06-21T19:40:26Z\",\n      \"error\": {\n        \"type\": \"urn:ietf:params:acme:error:dns\",\n        \"detail\": \"DNS problem: NXDOMAIN looking up A for reindex-spike-4.testnet.plinfra.net - check that a DNS record exists for this domain; DNS problem: NXDOMAIN looking up AAAA for reindex-spike-4.testnet.plinfra.net - check that a DNS record exists for this domain\",\n        \"status\": 400\n      },\n      \"token\": \"d1Cv86DI5UzIuR7k3bqlfb-al3LQO3Qk1G6S3ASRSGo\",\n      \"validationRecord\": [\n        {\n          \"hostname\": \"reindex-spike-4.testnet.plinfra.net\",\n          \"port\": \"443\"\n        }\n      ]\n    }\n  ]\n}"

2024-06-21T19:40:27.522013Z ERROR acme_worker: penumbra_auto_https: acme error: Order(BadAuth(Auth { status: Invalid, identifier: Dns("reindex-spike-4.testnet.plinfra.net"), challenges: [Challenge { typ: TlsAlpn01, url: "https://acme-staging-v02.api.letsencrypt.org/acme/chall-v3/12857105223/rG5gHg", token: "d1Cv86DI5UzIuR7k3bqlfb-al3LQO3Qk1G6S3ASRSGo", error: Some(Problem { typ: Some("urn:ietf:params:acme:error:dns"), detail: Some("DNS problem: NXDOMAIN looking up A for reindex-spike-4.testnet.plinfra.net - check that a DNS record exists for this domain; DNS problem: NXDOMAIN looking up AAAA for reindex-spike-4.testnet.plinfra.net - check that a DNS record exists for this domain") }) }] }))

Desired behavior is for pd to exit. Trying again with the following patch, pd did indeed exit when encountering the error:

2024-06-21T19:46:50.741166Z DEBUG acme_worker: rustls_acme::acme: response: "{\n  \"identifier\": {\n    \"type\": \"dns\",\n    \"value\": \"reindex-spike-4.testnet.plinfra.net\"\n  },\n  \"status\": \"invalid\",\n  \"expires\": \"2024-06-28T19:46:49Z\",\n  \"challenges\": [\n    {\n      \"type\": \"tls-alpn-01\",\n      \"url\": \"https://acme-staging-v02.api.letsencrypt.org/acme/chall-v3/12857168003/H3I5Jg\",\n      \"status\": \"invalid\",\n      \"validated\": \"2024-06-21T19:46:49Z\",\n      \"error\": {\n        \"type\": \"urn:ietf:params:acme:error:dns\",\n        \"detail\": \"DNS problem: NXDOMAIN looking up A for reindex-spike-4.testnet.plinfra.net - check that a DNS record exists for this domain; DNS problem: NXDOMAIN looking up AAAA for reindex-spike-4.testnet.plinfra.net - check that a DNS record exists for this domain\",\n        \"status\": 400\n      },\n      \"token\": \"ER6-Fug4ZM4ho71PgTaniupKdI6tRQdVFDnzZEhogBw\",\n      \"validationRecord\": [\n        {\n          \"hostname\": \"reindex-spike-4.testnet.plinfra.net\",\n          \"port\": \"443\"\n        }\n      ]\n    }\n  ]\n}"
2024-06-21T19:46:50.741279Z ERROR acme_worker: penumbra_auto_https: acme error: Order(BadAuth(Auth { status: Invalid, identifier: Dns("reindex-spike-4.testnet.plinfra.net"), challenges: [Challenge { typ: TlsAlpn01, url: "https://acme-staging-v02.api.letsencrypt.org/acme/chall-v3/12857168003/H3I5Jg", token: "ER6-Fug4ZM4ho71PgTaniupKdI6tRQdVFDnzZEhogBw", error: Some(Problem { typ: Some("urn:ietf:params:acme:error:dns"), detail: Some("DNS problem: NXDOMAIN looking up A for reindex-spike-4.testnet.plinfra.net - check that a DNS record exists for this domain; DNS problem: NXDOMAIN looking up AAAA for reindex-spike-4.testnet.plinfra.net - check that a DNS record exists for this domain") }) }] }))
2024-06-21T19:46:50.741444Z ERROR pd: acme worker failed: exiting due to acme error
Error: acme worker failed: exiting due to acme error
patch
diff --git a/crates/util/auto-https/src/lib.rs b/crates/util/auto-https/src/lib.rs
index 6c98b8aaa..27a4bd42f 100644
--- a/crates/util/auto-https/src/lib.rs
+++ b/crates/util/auto-https/src/lib.rs
@@ -70,7 +70,11 @@ where
     loop {
         match state.next().await {
             Some(Ok(ok)) => tracing::debug!("received acme event: {:?}", ok),
-            Some(Err(err)) => tracing::error!("acme error: {:?}", err),
+            Some(Err(err)) => {
+                tracing::error!("acme error: {:?}", err);
+                anyhow::bail!("exiting due to acme error");
+            }
             None => {
                 debug_assert!(false, "acme worker unexpectedly reached end-of-stream");
                 tracing::error!("acme worker unexpectedly reached end-of-stream");

This is a huge improvement over prior behavior! Will do a bit more testing before calling it final.

@conorsch conorsch force-pushed the kate/pd-auto-https-errors-are-propagated branch from d386312 to 1b8f5d6 Compare June 21, 2024 19:55
when auto-https is enabled, we spawn a task in the background to handle
https certificate resolution, via `rustls_acme::AcmeState`.

if that task encounters errors, they should be propagated up to the
daemon, so that `pd` does not rapidly retry lookups, potentially hitting
rate-limits and causing service interruptions.

this changes the `pd` entrypoint, binding the [`JoinHandle`] to a
variable and polling upon that future in the `tokio::select` block that
represents the core steady-state event loop of the daemon.

we also update the acme_worker loop to log-and-bail on error, ensuring
that pd exits when the error hits the select loop in pd main.

Co-Authored-By: Conor Schaefer <conor@penumbralabs.xyz>
@conorsch conorsch force-pushed the kate/pd-auto-https-errors-are-propagated branch from 1b8f5d6 to 21cefbd Compare June 21, 2024 20:56
Copy link
Contributor

@conorsch conorsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sprucing this up, much more confident about recommending this behavior as a first-run experience!

@conorsch conorsch merged commit adf260b into main Jun 21, 2024
13 checks passed
@conorsch conorsch deleted the kate/pd-auto-https-errors-are-propagated branch June 21, 2024 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-node Area: System design and implementation for node software C-enhancement Category: an enhancement to the codebase
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants