Nomad does not support agent service health checks when tls.verify_https_client is enabled #6923

schmichael · 2020-01-09T16:57:50Z

Nomad version

v0.10.2

Issue

The Nomad TLS doc at https://www.nomadproject.io/guides/security/encryption.html#network-isolation-with-tls says "Consul will not attempt to health check agents with verify_https_client set as it is unable to use client certificates." However, the Consul docs for health checks indicate that health checks do support TLS certificates. See https://www.consul.io/docs/agent/checks.html and https://www.consul.io/docs/agent/options.html#enable_agent_tls_for_checks).

We should take advantage of Consul's ability to use its certificates for doing health checks. Its possible, perhaps likely, that Consul's certificate is invalid for accessing Nomad, so we need an alternative method (eg Nomad could create a TTL check and heartbeat).

troyweber · 2021-06-24T03:19:06Z

Is there a solution to this yet? Nomad health check fail when I use mTLS with Consul.

schmichael · 2022-04-27T17:02:01Z

There is not a solution to this yet @troyweber. There are 2 big blockers:

The Consul Agent will need a certificate signed by the same CA as Nomad agent's mTLS certificates to perform the check. Not the Nomad agent's client certificate as that would grant Consul capabilities it should not have. For clusters that use the same CA for both Nomad and Consul mTLS certificates, that should Just Work. For clusters that do not Nomad would need new agent configuration parameters to use a secondary Nomad certificate in the Consul health check. None of this is impossible, but there's a lot of opportunities both in code and operations to accidentally harm security.
We could just use a TCP check to ensure the Nomad agent is alive. TCP checks against the HTTP port would spam errors in the agent logs. Nomad client agents don't listen on the RPC port, so we can't do a TCP check against it.

#1 seems like the best path but just a significant amount of extra agent config plumbing and then operator effort to generate and distribute the extra certificate.

I'm curious what folks are hoping to use this health check for? The only case where the health check would fail without the whole node having crashed is if the Nomad agent crashed. It's common to use a load balancer in front of Nomad agents to handle mTLS on the Nomad side, and load balancers should be able to detect and route around down Nomad agents more quickly than Consul would perform this health check and update the catalog.

For general cluster health monitoring the client agent's metrics and logs provides far more detailed information into the health of the agent than the simple HTTP liveness check.

So while I'd definitely love for this to Just Work, the reason we've avoided implementing #1 is that we couldn't justify the effort for the minimal value it seems to provide. Please let me know what you'd like to use this for!

suikast42 · 2023-03-11T22:15:51Z

Is there any progress on this issue ?

schmichael · 2023-06-29T17:43:01Z

Option 3: Implement a script check that runs a (new?) Nomad subcommand that would be able to handle mTLS. Getting that working securely in a cross platform way might be very tricky because the command would be configured by the Nomad agent but need access to a Nomad CLI certificate. Right now agents only know about their own agent certificate which you would not want to expose to Consul. ACLs would be tricky as well, although we could also add a new agent local token that is a locally generated UUID that only works for the health check endpoint on the local agent. ...lots of one off plumbing for this though.

Option 4: Implementing a unix domain socket for Nomad's Agent HTTP API (#17574) solves all of the ugly certificate handling issues. We would still need to handle the ACL token somehow, but the agent-local token might be an appealing solution here. An agent-local token might be a reasonable solution for securing our current unsecured metrics endpoint as well. Consul does not support unix sockets for http checks, so we would still need to switch to a script check as in 3 above.

tgross · 2023-06-29T18:27:29Z

For a possible option 5, what about changing the check to a TTL check? That way the agent is heartbeating to Consul rather than the other way around. The agent already needs to have a cert for the local Consul agent anyways to do Consul API operations.

schmichael added type/enhancement theme/core theme/consul labels Jan 9, 2020

tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021

tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021

jrasell added the theme/service-discovery/consul label Apr 11, 2022

louievandyke added the hcc/cst Admin - internal label Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad does not support agent service health checks when tls.verify_https_client is enabled #6923

Nomad does not support agent service health checks when tls.verify_https_client is enabled #6923

schmichael commented Jan 9, 2020

troyweber commented Jun 24, 2021

schmichael commented Apr 27, 2022

suikast42 commented Mar 11, 2023 •

edited

schmichael commented Jun 29, 2023 •

edited

tgross commented Jun 29, 2023

Nomad does not support agent service health checks when tls.verify_https_client is enabled #6923

Nomad does not support agent service health checks when tls.verify_https_client is enabled #6923

Comments

schmichael commented Jan 9, 2020

Nomad version

Issue

troyweber commented Jun 24, 2021

schmichael commented Apr 27, 2022

suikast42 commented Mar 11, 2023 • edited

schmichael commented Jun 29, 2023 • edited

tgross commented Jun 29, 2023

suikast42 commented Mar 11, 2023 •

edited

schmichael commented Jun 29, 2023 •

edited