-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] allow reload of Consul config stanza and Consul client #4593
Comments
I assume this only applies if you configured the Nomad agent to talk with the Consul agent over HTTPS (https://www.nomadproject.io/docs/agent/configuration/consul.html#ssl)? If you use plain HTTP (over localhost) and you've configured Consul with short-lived certificates then you'll only need to restart the Consul agent when certificate renewal occurs. |
@rkettelerij you're correct. |
Allowing reloading of the Consul config stanza would also allow for refreshing the consul ACL token as well which I suspect would be a more common use-case. |
I'd like to see both reloading of the acl token as well as reloading client certificates when reloading the nomad agent. we are obligated to use verify_incoming on the consul agents and using short-living certificates, and restarting nomad agent at every cert and/or token renewal is painful. |
I'd really like this feature as well. Nomad is really lacking in support for reloading TLS configurations. Right now you can only update the tls configuration for the nomad agents themselves. That doesn't help you when your entire cluster (i.e., Nomad + Vault + Consul) uses the same root CA. You end up in a situation where you might not lose your quorum, but you can't actually schedule any work. There are many issues that I've run into which are all related to this core problem (#3247, #3746, #4413, #4593, #6052). I've been doing a little bit of digging and it seems like the reloading logic is scattered across the agent, client, and server code. So the reloading logic is very inconsistent across the board. From what I can gather, we seem to be in this state: Agent
Server
Clients
The other downside to all of this is that because Nomad has partial support for SIGHUP reloading, you'd think that you could use some combination of reloads + a full restart to refresh all of your tls configuration - but if you don't orchestrate it right, you run into #3885. This is a major problem which really hurts the operational side of things. Its a shame because other Hashicorp tools like Vault and consul-template already support reloading tls configurations via SIGHUP. I know that Nomad has far more configurations to update, but honestly this has been a big problem for at least the last 3 years I've used Nomad. Rolling your own PKI with Vault and using that in your hashistack cluster should be a best practice! I would really like to help address this problem but I think this may require some significant refactoring to enable this. Any help would be greatly appreciated. |
Couple notes - for those using:
It is crucial that Nomad reload these properties on a SIGHUP (as otherwise it requires an unacceptably risky restart vs what should be a simple reload):
Seeing as Consul Connect by design/default uses a CA which itself rotates frequently and autonomously, all three of key_file, cert_file, and ca_file should absolutely be reloaded. |
For our use case, we bootstrap Consul/Nomad client agents via a preinstalled root CA cert (air-gapped, with several Vault clusters' signed as intermediaries w/
This is all well and good, until Nomad completely fails on an expired Consul client certificate, despite having a valid keypair available and configured (and ignoring it on reload). |
Otherwise, is there an intention in the future to perhaps have Nomad work with Consul's
☝🏻 this, but on the Nomad end of things, would be pretty great 👍🏻 |
Any updates on intentions around this issue? |
Would be great not to need to coordinate multiple services' safe restarts solely because nomad doesn't reload client tls keypair on sighup as it should. |
Something that was not immediately obvious to me at first is that restarting the Nomad agent on a client does not interrupt existing allocs as long as the agent comes back up sufficiently quickly (perhaps within the heartbeat time which defaults to 10s). So gracefully reloading a nomad client becomes less important. |
the several services in this case aren't the workload, but rather Nomad/Vault/Consul sidecars, since these all have issues with selectively reloading client tls agent credentials (whether it's nomad->consul, vault->consul, or vault agent not reloading credentials at all hashicorp/vault#8216 ) yes, the workloads aren't usually directly affected, but it becomes a much more difficult problem when coordinating a restart of nomad + vault + consul sidecars + mesh gateways + vault agent instances |
This sounds very similar to what I'm doing on my Nomad workers. Consul reloads TLS certificates on SIGHUP fine (or consul reload against the local agent) which is important because this would cause services to disappear briefly if consul required a restart. So no coordination needed here. Restarting Vault Agent will cause secrets to be reissued. I could see this being an issue if Vault Agent has a listener running acting as a proxy, but if you don't use this pattern (I don't, I don't think Nomad supports it, at least not for clients anyway) then no coordination needed. In practice I very rarely need to restart the Vault Agent since it's the piece that's getting credentials automatically for everything else, so there would be a bit of a dependency problem if it was trying to arrange it's own credentials too. Maybe if you were using TLS client certificates for Vault (I'm using AppRole these days) I could see this being more likely, but I'd be curious to know how you're managing that TLS certificate at that point. And finally Nomad, which for clients needs a restart, but since it doesn't interrupt running allocs and reattaches fine also no coordination needed. I think some better docs could be useful (especially something like here's how you can run nomad+consul+vault together with security options all turned on with Vault managing secrets, bonus points if Terraform provisions it all too) since I know I fell into some incorrect assumptions early on so I imagine others would be too, but I'm not seeing any hard blockers in todays capabilities? |
Ended up having to abandon For those discovering this ticket when encountering these issues, I've found a periodic nomad batch job, making use of In an ideal world, all of these would reload their client tls keypairs, but currently don't:
Long and short of it, as it is today, there are several areas which prevent full mTLS process<->process when using Consul/Vault/Nomad, forcing restarts which require coordination - even with coordination, there are still issues users will encounter; for example, mesh gateways will fail and be restarted/rescheduled, batch jobs can be killed/restarted, etc Would reeeeeeeeally appreciate it if every process which uses client TLS keypairs also reloads those keypairs; biggest headache is when it's a 50/50 chance whether a reload actually reloads everything it should. If a client TLS keypair is needed, that TLS keypair comes with an expiry, so it is completely nonsensical to allow the use of an expired keypair to continue on when it has been reloaded with new credentials in several other areas using the same keypair. |
Cascading failures from Vault not reloading Consul client TLS keypair can then cause every other downstream system to fail, miss renewals, and force a bootstrap of the TLS keypair before anything can go green again. Massively annoying when graceful reload/shutdown are common sense, particularly in a long lived services which are dependent on TLS and involve leadership/quorum considerations when requiring a restart. Why reload only 1/3 of the places valid TLS keypairs are required? Implementing a code path which requires TLS, but doesn't reload the keypair, means every other implementation which DOES support reloads is dead code, because they are useless when 2/3 of connections fail entirely without a restart. |
If 100% of TLS keypairs in use were properly reloaded, end users wouldn't be needing to consider all of the following areas and whether or not they will need a full restart and how they affect each other:
|
This is still a problem; would be great not having to fully restart nomad across a ridiculous number of instances every day just because tls keypairs are only partially reloaded |
When using TLS for Consul connectivity, it is obviously preferable to use low TTL certificates with frequent renewal. Currently in order to renew the Consul certificates the Nomad client must be restarted. It would be preferable if the Nomad client could use SIGUP in order to reload the Consul client/config like reloading the Nomad TLS certificates. This would be a much less intrusive operation and involve less risk.
The text was updated successfully, but these errors were encountered: