Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault 1.11 Multi-Issuer CA breaks Connect CA Intermediate CAs (<-> Vault Provider) #15217

Closed
exo-cedric opened this issue Nov 1, 2022 · 16 comments
Labels
theme/consul-vault Relating to Consul & Vault interactions type/bug Feature does not function as expected type/docs Documentation needs to be created/updated/clarified

Comments

@exo-cedric
Copy link

Overview of the Issue

The Vault 1.11-introduced Multi-Issuer feature breaks Consul Connect CA (Vault Provider).

When time comes to issue a new ICA (<-> IntermediateCertTTL), Consul successfully pki/.../root/sign-intermediate and pki/.../intermediate/set-signed but fails to pick-up the new ICA, because it omits to switch the default issuer.

Reproduction Steps

  1. Setup Consul Connect CA along Vault Provider
  2. Set a sufficiently short IntermediateCertTTL for the sake of the test (we stumbled on the issue using 168h)
  3. Wait until the time comes for the Intermediate CA to be renewed (~50% of TTL IIRC)
  4. Observe Vault being requested and issuing a new Intermediate CA every hour (in audit logs, "path":"pki/.../root/sign-intermediate/")
  5. Despite Consul keeping (and growing its IntermediateCerts list) with the old Intermediate CA
  6. And eventually failing to renew Leaf Certificate with PUT https://vault:8200/v1/pki/.../sign/leaf-cert\nCode: 400. Errors:\n\n* cannot satisfy request, as TTL would result in notAfter 2022-11-03T16:21:21.775770518Z that is beyond the expiration of the CA certificate at 2022-11-02T14:56:33Z error message

(another tell-tale sign of the issue is the tls: handshake message of length 118637 bytes exceeds maximum of 65536 bytes error message, assumedly because of the (non-sensically) ever-growing Intermediate CA list)

Consul info for both Client and Server

Consul Server and Agent: 1.12.3
Vault Server: 1.11.4

Logs etc.

Problem:

## On Vault Server

# vault list pki/consul-connect/issuers | wc -l
152

# vault read pki/consul-connect/config/issuers
Key        Value
---        -----
default    4c96d2a0-cf0e-bbc9-0fc6-394bff16a3dd

# vault read -field=certificate pki/consul-connect/issuer/4c96d2a0-cf0e-bbc9-0fc6-394bff16a3dd/json | openssl x509 -noout -issuer -subject -dates
issuer=C = CH, L = ZRH1, O = Exoscale, OU = Consul, CN = Exoscale Consul CA (ZRH1)
subject=CN = pri-lyjdnh2.vault.ca.b9a7d174.consul
notBefore=Oct 19 14:56:03 2022 GMT
notAfter=Nov  2 14:56:33 2022 GMT


## On Consul Agent

# wget -qO- http://127.0.0.1:8500/v1/connect/ca/roots | jq -r '.Roots[]|.IntermediateCerts[]' | ... | openssl x509 ...
issuer=C = CH, L = ZRH1, O = Exoscale, OU = Consul, CN = Exoscale Consul CA (ZRH1)
subject=CN = pri-lyjdnh2.vault.ca.b9a7d174.consul
notBefore=Oct 19 14:56:03 2022 GMT
notAfter=Nov  2 14:56:33 2022 GMT

Mark the notAfter=Nov 2 14:56:33 2022 GMT; that Intermediate CA should have been renewed/rotated long ago (and fails issuing new 72h Leaf Cert)

Code base

I believe the issue lies in those portions of the code base:

More specifically in the ActivateIntermediate function, which ought to PUT pki/.../issuers/config with the ID of the new IntermediateCerts CA.

Validation

We have validated the above hypothesis by:

  • manually changing the default issuer with the most recent one
  • restarting the Consul servers
  • observing the Intermediate list reflected the change (and was back to a single IntermediateCerts)
## On Vault Server

# for issuer in $(vault list pki/consul-connect/issuers | grep '^[0-9a-f]'); do echo "$(vault read -field=certificate pki/consul-connect/issuer/${issuer}/json | openssl x509 -noout -enddate) ${issuer}"; done | sort | tail -n 1
notAfter=Nov 14 17:08:08 2022 GMT e97d39ec-d3da-50ca-c03f-276278774ad7

# echo '{"default":"e97d39ec-d3da-50ca-c03f-276278774ad7"}' | vault write pki/consul-connect/config/issuers -


## On Consul Servers

# systemctl restart consul

# wget -qO- http://127.0.0.1:8500/v1/connect/ca/roots | jq -r '.Roots[]|.IntermediateCerts[]' | ... | openssl x509 ...
issuer=C = CH, L = ZRH1, O = Exoscale, OU = Consul, CN = Exoscale Consul CA (ZRH1)
subject=CN = pri-1cpskcek.vault.ca.b9a7d174.consul
notBefore=Oct 31 17:07:38 2022 GMT
notAfter=Nov 14 17:08:08 2022 GMT
@jkirschner-hashicorp
Copy link
Contributor

Hi @exo-cedric,

Thank you for reporting this. We'll have to look into the changes resulting from Vault 1.11's multi-issuer feature, as you described.

Separately, you mentioned that the intermediate CA list grows indefinitely. That should be fixed as of 1.11.9, 1.12.5*, and 1.13.2 by PR #14429. That PR prunes expired certificates. In the case of your observations, it sounds like you might have a growing list of not-yet-expired certificates (one added per hour?), so even if you were to upgrade to from 1.12.3 to 1.12.6, you might still see a growing list (until some of the certs pass their expiration).

*If you consider upgrading from 1.12.3 (to a later 1.12.x or 1.13.x), I strongly recommend you first review this guidance first: https://developer.hashicorp.com/consul/docs/upgrading/upgrade-specific#modify-vault-policy-for-vault-ca-provider. For example, if you stay on 1.12.x, I would go straight to 1.12.6 (skip 1.12.5).

@jkirschner-hashicorp jkirschner-hashicorp added type/bug Feature does not function as expected type/docs Documentation needs to be created/updated/clarified theme/consul-vault Relating to Consul & Vault interactions labels Nov 1, 2022
@exo-cedric
Copy link
Author

Hi @jkirschner-hashicorp

Thank you for the quick feedback.

Separately, you mentioned that the intermediate CA list grows indefinitely. That should be fixed as of 1.11.9, 1.12.5*, and 1.13.2 by PR #14429. That PR prunes expired certificates. In the case of your observations, it sounds like you might have a growing list of not-yet-expired certificates (one added per hour?), so even if you were to upgrade to from 1.12.3 to 1.12.6, you might still see a growing list (until some of the certs pass their expiration).

The list is growing, on an hourly basis with the same IntermediateCert (the one that is erroneously being kept in use):

# wget -qO- http://127.0.0.1:8500/v1/connect/ca/roots | jq -r '.Roots[]|.IntermediateCerts[]' | ... | openssl x509 -noout -subject | sort | uniq -c
     26 subject=CN = pri-m39qdbn.vault.ca.764ccfb4.consul

*If you consider upgrading from 1.12.3 (to a later 1.12.x or 1.13.x), I strongly recommend you first review this guidance first: https://developer.hashicorp.com/consul/docs/upgrading/upgrade-specific#modify-vault-policy-for-vault-ca-provider. For example, if you stay on 1.12.x, I would go straight to 1.12.6 (skip 1.12.5).

Thanks for the heads-up! (that was already on our radar... and applied as the first potential culprit for the case at hand :-) )

@jkirschner-hashicorp
Copy link
Contributor

Hi @exo-cedric,

Just to confirm, where/when are you seeing the following error?

tls: handshake message of length 118637 bytes exceeds maximum of 65536 bytes

Are you seeing that because the new leaf certificates generated after step 3 (>50% of intermediate CA cert TTL) have an ever-growing list of intermediate certs, so services in the mesh using those new leaf certificates begin to fail the TLS handshake when communicating with other services?

@exo-cedric
Copy link
Author

Hi @jkirschner-hashicorp

Just to confirm, where/when are you seeing the following error?

tls: handshake message of length 118637 bytes exceeds maximum of 65536 bytes

I don't have a precise analysis of when/how exactly this error shows up, except it did at the same time we started experiencing the IntermediateCert problem. My hunch is it is related to the growing list of (same) IntermediateCerts that Consul Connect CA keeps in store (query-able via wget -qO- http://127.0.0.1:8500/v1/connect/ca/roots | jq -r '.Roots[]|.IntermediateCerts[]').

Typical log message is: Oct 31, 2022 @ 12:51:03.634 [GET /health/ready] Error getting leader status: Get "https://consul-agent.consul:8501/v1/status/leader": tls: handshake message of length 92819 bytes exceeds maximum of 65536 bytes

@jkirschner-hashicorp
Copy link
Contributor

@exo-cedric : We're actively looking into this to understand cause(s), potential workaround(s), and fix(es) we could make in Consul. We'll reach out if we have further questions as we go. Thank you for the detailed report!

@jkirschner-hashicorp
Copy link
Contributor

jkirschner-hashicorp commented Nov 14, 2022

@exo-cedric: Are your Consul client agents using either auto-config or auto-encrypt? I'm wondering if you are just seeing the TLS handshake error messages on control plane traffic, and whether that's because Consul client agents are using certificates issued by the Connect CA (which only happens if using auto-config or auto-encrypt).

@exo-cedric
Copy link
Author

@jkirschner-hashicorp

Thanks for actively looking into this and your feedback. We're using auto-encrypt { tls = true } (and a few - 3-odd - dns_san)

@jkirschner-hashicorp
Copy link
Contributor

jkirschner-hashicorp commented Nov 17, 2022

Status update

For anyone using Vault 1.11.0+ as Consul's Connect CA provider, we've published this knowledge base article with more details on the issue, including the recommended workaround. We've also added mentions of this known issue to relevant places in the Consul and Vault docs.

The Consul team is working on fixes to be included in an upcoming Consul patch release. Refer to PR #15253.

@exo-cedric
Copy link
Author

Awesome! Thanks to all parties involved for quickly addressing and fixing this issue.
(I'll further report the issue being solved once we have applied the Consul patch release, hopefully in the days to come)

@jkirschner-hashicorp
Copy link
Contributor

jkirschner-hashicorp commented Nov 30, 2022

Latest status as of Dec 2:

At this time, we recommend that multi-datacenter deployments wait until an upcoming patch release for a full fix.

Consul 1.12.7, 1.13.4, and 1.14.2 were released on Nov 30 with a fix that resolves this issue in primary datacenters. An additional fix is still needed to resolve this issue in the secondary datacenters that exist in multi-datacenter deployments using WAN federation.

For now, those affected should continue to refer to the knowledge base article with more details on the issue, including the recommended workaround.

@jkirschner-hashicorp
Copy link
Contributor

jkirschner-hashicorp commented Dec 2, 2022

@exo-cedric : I updated the previous comment to reflect our latest understanding. For now, we recommend that multi-DC deployments use the workaround and wait until the next patch release before upgrading.

@exo-cedric
Copy link
Author

Hello @jkirschner-hashicorp

We've updated our preprod Consul servers to 1.13.4 and the default issuer appears to be updated as expected:

root@infra-vault-pp004:~# vault-login-root; /root/vault-consul-connect-issuer-fix -l
INFO[vault-consul-connect-issuer-fix]: Fetching the list of issuers ...
INFO[vault-consul-connect-issuer-fix]: There currently are 6 issuers configured
cc4cb190-6268-be4a-c563-795b5baac723 CN=pri-1ef74anj.vault.ca.764ccfb4.consul 2022-12-24T10:39:03Z
18b06cd5-6be0-adec-5c8d-3b69e420fd72 CN=pri-u3damcoq.vault.ca.764ccfb4.consul 2022-12-24T11:08:53Z
86299bbf-445c-ac66-6bef-0ef0d7c82365 CN=pri-d5flz1r.vault.ca.764ccfb4.consul 2022-12-24T11:09:09Z
07b6c64f-21fd-e4b0-ad20-64ac2c05bfdf CN=pri-wgntxgi.vault.ca.764ccfb4.consul 2023-01-02T12:41:14Z
d87d4186-c0c1-4ce7-cf25-f6e960d17755 CN=pri-z0okcitf.vault.ca.764ccfb4.consul 2023-01-02T12:46:14Z
127ed149-312f-178f-2ecf-56161d3b01a4 CN=pri-sgnm2e70.vault.ca.764ccfb4.consul 2023-01-02T12:52:36Z
INFO[vault-consul-connect-issuer-fix]: Current issuer: 127ed149-312f-178f-2ecf-56161d3b01a4 CN=pri-sgnm2e70.vault.ca.764ccfb4.consul 2023-01-02T12:52:36Z
INFO[vault-consul-connect-issuer-fix]: Candidate issuer: 127ed149-312f-178f-2ecf-56161d3b01a4 CN=pri-sgnm2e70.vault.ca.764ccfb4.consul 2023-01-02T12:52:36Z
INFO[vault-consul-connect-issuer-fix]: Current and candidate issuers are equal

# wget -qO- http://127.0.0.1:8500/v1/connect/ca/roots | jq -r '.Roots[]|.IntermediateCerts[]' > /tmp/consul-connect-cas.pem && _ssl_crt /tmp/consul-connect-cas.pem | grep subject= | sort | uniq -c && rm -f /tmp/consul-connect-cas.pem
      1 subject=CN = pri-sgnm2e70.vault.ca.764ccfb4.consul

On the other hand, we don't see the "obsolete" issuers being cleaned up in Vault PKI "store". I'm afraid this might lead to issues (on the medium (to very long?) term); maybe something worth keeping on the radar too ?

PS: Retrospectively, I now believe this to be a Vault (1.11+) issue; API behavior should not have changed in such significant/breaking manner (?).

@jkirschner-hashicorp
Copy link
Contributor

jkirschner-hashicorp commented Dec 5, 2022

On the other hand, we don't see the "obsolete" issuers being cleaned up in Vault PKI "store". I'm afraid this might lead to issues (on the medium (to very long?) term); maybe something worth keeping on the radar too ?

My non-expert understanding is that Vault performance may be affected once the number of issuers in a PKI secrets engine approaches ~100+. It should take a while to reach that point assuming intermediate TTL isn't really low.

Vault 1.13.0+ will include the tidy_expired_issuers option to allow users to opt into automatically removing expired issuers after a post-expiration delay of issuer_safety_buffer (defaults to 1 year). Refer to this Vault PR.

For now, an operator could manually remove "obsolete" issuers if the number of obsolete issuers becomes problematic. We realize that's not ideal long-term.

Once Vault 1.13 is released, perhaps Consul could set tidy_expired_issuers = true if using Consul-managed PKI paths, and recommend that the operator set tidy_expired_issuers = true if using Vault-managed PKI paths.

Alternatively, we could consider having Consul try to delete issuers itself, but that would require giving Consul additional privileges to do that in its Vault token (delete on <pki_path>/issuer/*). I'm not sure if that's desirable for operators.

What are your thoughts?

@exo-cedric
Copy link
Author

Vault 1.13.0+ will include the tidy_expired_issuers option to allow users to opt into automatically removing expired issuers after a post-expiration delay of issuer_safety_buffer (defaults to 1 year). Refer to this Vault PR.

Once Vault 1.13 is released, perhaps Consul could set tidy_expired_issuers = true if using Consul-managed PKI paths, and recommend that the operator set tidy_expired_issuers = true if using Vault-managed PKI paths.

Yes. I think this would be the best approach

Alternatively, we could consider having Consul try to delete issuers itself, but that would require giving Consul additional privileges to do that in its Vault token (delete on <pki_path>/issuer/*). I'm not sure if that's desirable for operators.

Entirely agree this approach might not be desirable

@exo-cedric
Copy link
Author

I can now confirm our 32-day Intermediate CA has been successfully rotated at 50% its lifetime, without manual intervention, including Vault default issuer:

root@infra-vault-pp004:~# /root/vault-consul-connect-issuer-fix -l 
INFO[vault-consul-connect-issuer-fix]: Fetching the list of issuers ...
INFO[vault-consul-connect-issuer-fix]: There currently are 2 issuers configured
127ed149-312f-178f-2ecf-56161d3b01a4 CN=pri-sgnm2e70.vault.ca.764ccfb4.consul 2023-01-02T12:52:36Z
acaf6dd7-b294-e4a1-bba8-06d2882d5e2f CN=pri-1tk0p3qh.vault.ca.764ccfb4.consul 2023-01-18T12:52:43Z
INFO[vault-consul-connect-issuer-fix]: Current issuer: acaf6dd7-b294-e4a1-bba8-06d2882d5e2f CN=pri-1tk0p3qh.vault.ca.764ccfb4.consul 2023-01-18T12:52:43Z
INFO[vault-consul-connect-issuer-fix]: Candidate issuer: acaf6dd7-b294-e4a1-bba8-06d2882d5e2f CN=pri-1tk0p3qh.vault.ca.764ccfb4.consul 2023-01-18T12:52:43Z
INFO[vault-consul-connect-issuer-fix]: Current and candidate issuers are equal

As far as I'm concerned, this issue may be considered Solved

@jkirschner-hashicorp
Copy link
Contributor

Hi @exo-cedric,

I'm glad to hear that. With both your confirmation and the merge/release of the fix for secondary datacenters (#15661), I'm marking this closed.

The full fix (for both primary and secondary datacenters) is available in:

  • 1.12.x release series: 1.12.8+
  • 1.13.x release series: 1.13.5+
  • 1.14.x release series: 1.14.3+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/consul-vault Relating to Consul & Vault interactions type/bug Feature does not function as expected type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

No branches or pull requests

2 participants