Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad does not support agent service health checks when tls.verify_https_client is enabled #6923

Open
schmichael opened this issue Jan 9, 2020 · 5 comments

Comments

@schmichael
Copy link
Member

Nomad version

v0.10.2

Issue

The Nomad TLS doc at https://www.nomadproject.io/guides/security/encryption.html#network-isolation-with-tls says "Consul will not attempt to health check agents with verify_https_client set as it is unable to use client certificates." However, the Consul docs for health checks indicate that health checks do support TLS certificates. See https://www.consul.io/docs/agent/checks.html and https://www.consul.io/docs/agent/options.html#enable_agent_tls_for_checks).

We should take advantage of Consul's ability to use its certificates for doing health checks. Its possible, perhaps likely, that Consul's certificate is invalid for accessing Nomad, so we need an alternative method (eg Nomad could create a TTL check and heartbeat).

@troyweber
Copy link

Is there a solution to this yet? Nomad health check fail when I use mTLS with Consul.

@schmichael
Copy link
Member Author

There is not a solution to this yet @troyweber. There are 2 big blockers:

  1. The Consul Agent will need a certificate signed by the same CA as Nomad agent's mTLS certificates to perform the check. Not the Nomad agent's client certificate as that would grant Consul capabilities it should not have. For clusters that use the same CA for both Nomad and Consul mTLS certificates, that should Just Work. For clusters that do not Nomad would need new agent configuration parameters to use a secondary Nomad certificate in the Consul health check. None of this is impossible, but there's a lot of opportunities both in code and operations to accidentally harm security.
  2. We could just use a TCP check to ensure the Nomad agent is alive. TCP checks against the HTTP port would spam errors in the agent logs. Nomad client agents don't listen on the RPC port, so we can't do a TCP check against it.

#1 seems like the best path but just a significant amount of extra agent config plumbing and then operator effort to generate and distribute the extra certificate.

I'm curious what folks are hoping to use this health check for? The only case where the health check would fail without the whole node having crashed is if the Nomad agent crashed. It's common to use a load balancer in front of Nomad agents to handle mTLS on the Nomad side, and load balancers should be able to detect and route around down Nomad agents more quickly than Consul would perform this health check and update the catalog.

For general cluster health monitoring the client agent's metrics and logs provides far more detailed information into the health of the agent than the simple HTTP liveness check.

So while I'd definitely love for this to Just Work, the reason we've avoided implementing #1 is that we couldn't justify the effort for the minimal value it seems to provide. Please let me know what you'd like to use this for!

@suikast42
Copy link
Contributor

suikast42 commented Mar 11, 2023

Is there any progress on this issue ?

@schmichael
Copy link
Member Author

schmichael commented Jun 29, 2023

Option 3: Implement a script check that runs a (new?) Nomad subcommand that would be able to handle mTLS. Getting that working securely in a cross platform way might be very tricky because the command would be configured by the Nomad agent but need access to a Nomad CLI certificate. Right now agents only know about their own agent certificate which you would not want to expose to Consul. ACLs would be tricky as well, although we could also add a new agent local token that is a locally generated UUID that only works for the health check endpoint on the local agent. ...lots of one off plumbing for this though.

Option 4: Implementing a unix domain socket for Nomad's Agent HTTP API (#17574) solves all of the ugly certificate handling issues. We would still need to handle the ACL token somehow, but the agent-local token might be an appealing solution here. An agent-local token might be a reasonable solution for securing our current unsecured metrics endpoint as well. Consul does not support unix sockets for http checks, so we would still need to switch to a script check as in 3 above.

@tgross
Copy link
Member

tgross commented Jun 29, 2023

For a possible option 5, what about changing the check to a TTL check? That way the agent is heartbeating to Consul rather than the other way around. The agent already needs to have a cert for the local Consul agent anyways to do Consul API operations.

@louievandyke louievandyke added the hcc/cst Admin - internal label Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants