Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Vault certificates with no lease #1106

Closed
wants to merge 1 commit into from

Conversation

dacohen
Copy link
Contributor

@dacohen dacohen commented May 25, 2018

According to the Vault documentation, when issuing a large number of certificates, it is best practice to not generate a lease for the certificate:

When large number of certificates are generated with long lifetimes, it is recommended that lease generation be disabled, as large amount of leases adversely affect the startup time of Vault.

However, consul-template uses lease durations to determine how often to render a template. If there is no lease associated with a secret, it defaults to 5 minutes, and renews approximately every minute and a half, which is too frequently for most certificates. Thus, it is not possible to follow Vault recommendations for large number of certs when using consul-template.

This PR is an attempt to rectify this. If the returned secret is a PKI-issued cert, then we can infer what the lease duration would have been by decoding the certificate data itself, removing the requirement for Vault to track the leases of all active certificates, and generate CRL entries for them.

This is my attempt at getting around the mismatch between consul-template functionality and Vault recommendations. If there is a better way of doing this, I'd love to start a conversation.

@jefferai
Copy link
Member

How many certificates are you expecting? Probably your best bet is to just enable leases unless you have huge numbers of certs.

@dacohen
Copy link
Contributor Author

dacohen commented May 26, 2018

It looks like we currently have on the order of 20k active certificates, mostly with 7 day TTLs. We use consul-template to provision certs for K8S pods, and frequently start and stop pods in response to demand, so the number of certificates climbs quickly.

Enabling leases for these certificates leads to long startup times for Vault, as mentioned in the documentation, on the order of 5-10 minutes, and what seems like excessive memory usage, currently about 6 Gigabytes. In addition, Vault frequently fails to revoke expired leases due to size constraints on the CRL, and calls to the PKI Tidy endpoint take tens of hours to complete, and don't seem to affect the memory or startup profile.

We attributed these symptoms to certificate leases, but if this seems unlikely, then we can probably close this PR and move the conversation to an issue on the Vault repo.

@jefferai
Copy link
Member

Vault doesn't have any startup delay due to leases since about 0.8.3, the documentation is out of date at this point.

If tidy is taking 10+ hours to analyze 20k entries then your storage seems suspect. It should be a few minutes.

@jefferai
Copy link
Member

Also expired leases shouldn't go onto the CRL so long as the reason for expiration is ttl. Are you manually revoking these leases?

@dacohen
Copy link
Contributor Author

dacohen commented May 26, 2018

Thanks for the suggestions. We're using a DynamoDB backend, which could be the culprit when it comes to the abysmal tidy speed. Whenever Vault starts, it prints out several hundred errors regarding CRL size, and is unresponsive for several minutes until the rate of these errors decreases. I do not believe we're revoking certificates, unless consul-template does something of this sort when it shuts down.

@jefferai
Copy link
Member

I have heard that DynamoDB is very, very, very slow w.r.t. Vault. Especially when Vault starts up or a node takes over active duty, it does make a lot of I/O as it reads leases, and you may well be getting throttled. This could also be the issue with tidy taking a long time.

What version of Vault?

@dacohen
Copy link
Contributor Author

dacohen commented May 26, 2018

We're currently running v0.10.1

@jefferai
Copy link
Member

With that version there shouldn't be correlation with those messages you're seeing (which you're seeing because leases are attempting to be revoked and failing due to CRL size -- you can use revoke-force to ignore the error if you want) and time until requests can be serviced, since leases are loaded lazily. So you're probably just incredibly I/O bound.

@dacohen
Copy link
Contributor Author

dacohen commented May 27, 2018

That makes sense. Thanks again for all your help. We were trying to avoid running Consul, which I know is the recommended backend for Vault, since it's one more system for our team to manage. We're looking to use a managed service (hence DynamoDB) in order to avoid this. Looking at the supported storage backends, Google Cloud Storage/Spanner, MySQL/PostgreSQL, and S3 would fit that requirement. In your experience, do any of these offer better IO performance than DynamoDB?

@jefferai
Copy link
Member

Honestly, I wouldn't know. They're all community supported so we don't use/test them.

@dmicanzerofox
Copy link

dmicanzerofox commented Jul 12, 2018

I think this may be related to an issue we just had this week.

Our CRL list had grown so large and was prohibiting us from issuing new certs. We had tried to tidy it and the tidy appeared to do nothing. It was actually doing something but because our TTLs were still valid, tiday was not finding them as candidates and cleaning them. Our "fix" was to disable leases but now we are stuck with the 5 minute Cycle as well. Our current action is reenabling the leases.

hashicorp/vault#4877

We have patched the pki_tidy command to clear all entries that have been revoked, so that we are at least able to manage our CRL list using vault tooling (opposed to just deleting stuff from our backend :p ) .

https://github.com/hashicorp/vault/compare/master...dmicanzerofox:feature/pki-tidy-revoked-certs?expand=1

@eikenb
Copy link
Contributor

eikenb commented Jun 18, 2019

Hey @dacohen, I'm taking over as maintainer of consul-template here at hashi and am curious if you are still having the issues that drove this PR? A good amount of time has passed and Vault has put out several releases that might have impacted your case. In short I'm curious if this feature/PR is still desired. Thanks.

@eikenb
Copy link
Contributor

eikenb commented Jun 19, 2019

Hey @dacohen, since you submitted your pull request we've implemented a policy of requiring a signed CLA before merging. Would you please consider signing the CLA? Thanks.

@dacohen
Copy link
Contributor Author

dacohen commented Jun 19, 2019

@eikenb, as far as I know the issues addressed by this PR still exist in current versions of Vault.

The underlying issue is that consul-template won't attempt to renew a secret unless it has a lease duration associated with it. However, this requires Vault to keep track of leases for every active certificate. In recent versions, Vault does a better job of handling large numbers of leases, but it would still be convenient for Vault to not have to issue a lease, while still supporting renewal.

Let me know if this answers your question.

Thanks,

Daniel

@eikenb
Copy link
Contributor

eikenb commented Jul 16, 2019

@dacohen, thanks for the additional information.

I'll want to check with the vault devs to make sure that this would compromise security in any way before thinking about working on merging it. But I'll try to at least kick that off...

@hashicorp/vault, in case you are listening... what do you think about this?

@eikenb eikenb added this to the 0.22.0 - Community PR focused milestone Aug 21, 2019
@eikenb
Copy link
Contributor

eikenb commented Aug 21, 2019

There are no security issues and this PR seems reasonable even with Vault not having the startup time issue as there seems to be other reasons to disable leases as @dmicanzerofox seems to indicate. So I'm thinking this is good to go.

@eikenb eikenb closed this in 2734e51 Sep 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement vault Related to the Vault integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants