Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Renewal check for vault-pki certificates fails, renewing constantly #1272

Closed
sfudeus opened this issue Sep 2, 2019 · 11 comments · Fixed by #1277
Closed

Renewal check for vault-pki certificates fails, renewing constantly #1272

sfudeus opened this issue Sep 2, 2019 · 11 comments · Fixed by #1277
Labels
Milestone

Comments

@sfudeus
Copy link

sfudeus commented Sep 2, 2019

Consul Template version

consul-template v0.21.2 (3957cdf)

Configuration

log_level = "trace"
vault {
  address = "http://192.168.0.254:8200"
   ssl {
     enabled = false
   }
 }
template {
  source = "/etc/consul-template/etcd_ca.tmpl"
  destination = "/tmp/ca.pem"
  # Required to not have matchbox interpolate
  left_delimiter  = "[["
  right_delimiter = "]]"
}
[[- $hostname := env "HOSTNAME" -]]
[[- $hostip := env "HOSTIP" -]]
[[- $cn_arg := printf "common_name=%s.%s" $hostname "po.k8s.zone" -]]
[[- $ip_arg := printf "ip_sans=127.0.0.1,%s" $hostip -]]
[[- with secret "/ui/poinfra/unicorn-prod-01-bs-etcd/issue/etcd" $cn_arg $ip_arg -]]
[[ .Data.issuing_ca ]][[ end ]]```

### Command

```shell
consul-template -config /etc/consul-template/test.hcl

Debug output

https://gist.github.com/sfudeus/397bfd48758355c3b952ecebfdf3a2c6

Expected behavior

Our cert ttl is 8h. With consul-template 0.21.0, renewal occurred with around 5h certificate lifetime left. With 0.21.2 certificates are renewed continuously (i.e. approx. all 17s in this test setup.

Certs should have been fetched and renewed when 1/6 to 1/3 of their TTL is left (see #1267)

Actual behavior

Certificates are updated continuously.

Steps to reproduce

  1. Start consul-template against vault with pki in given trivial configuration.
  2. Observe behaviour in log output

References

Are there any other GitHub issues (open or closed) that should
be linked here? For example:

@eikenb eikenb added the bug label Sep 3, 2019
@eikenb
Copy link
Contributor

eikenb commented Sep 3, 2019

Hey @sfudeus, thanks for the report.

Seems test coverage wasn't as good as I thought as this is the second recent regression that doesn't seem to have test coverage. As I want to get CT stable before adding new features, I'll be getting right to this.

@eikenb eikenb added this to the 0.21.3 milestone Sep 3, 2019
@eikenb
Copy link
Contributor

eikenb commented Sep 4, 2019

@sfudeus.

Did you change that template at all between 0.21.0 and 0.21.2? One of the fixes in 0.21.1 was fixing uses of vault/secret write, where you use have..

[[- with secret "/ui/poinfra/unicorn-prod-01-bs-etcd/issue/etcd" $cn_arg $ip_arg -]]

Before that fix version 1 of vault kv secret stores just triggered a panic and version-2 only returned an 'no data provided' message and retried, see #1252 for more. Going back to a 0.21.0 build, using a config and template based your original post, I get the same panic I got while looking at that issue.

I created the ui secrets store in vault to replicate your case, making it a version 1 kv store... vault secrets enable -path=ui --version=1 kv.

@eikenb
Copy link
Contributor

eikenb commented Sep 4, 2019

I was reading some about the pki secrets engine and noticed is supported -path... still getting up to speed with everything, so I didn't realize this was possible (all docs I had seen used the /pki path). So does that ui/ path use the pki engine? I'm guess it does and I need to read more about how to set that up for testing.

@eikenb
Copy link
Contributor

eikenb commented Sep 4, 2019

Ok. I followed the docs and created a mix of the version you gave with the version in the docs (trying to use your code as much as possible w/ the documented paths)... and I got it to work. I think I may even have replicated the issue. I need to try it on the 0.21.0 version to make sure (which I'll do tomorrow... getting late).

@sfudeus
Copy link
Author

sfudeus commented Sep 4, 2019

Hi @eikenb, thanks for diving into this.

The template was not changed between 0.21.0 and 0.21.2. In fact it wasn't changed since 0.19.something, changing the behavior already regarding the point in time when renewal occurs, but now better aligned with the definition from #1267.

The vault pki is created via vault secrets enable -path /ui/poinfra/unicorn-prod-01-bs-etcd pki and tuned with vault secrets tune -max-lease-ttl=87600h /ui/poinfra/unicorn-prod-01-bs-etcd initially.

@eikenb
Copy link
Contributor

eikenb commented Sep 4, 2019

Using 0.21.0 I seem to have somewhat replicated the old behavior except that instead of sleeping for 5 hours it sleeps for 4.5 minutes. The code tells me that this means that vault didn't return a specified lease duration, so it went with a default.

The fact that it was 4.5 minutes also tells me that it doesn't think the secret is renewable (non-renewable secrets are refreshed after 85% of the lease or default). But you indicate in your initial post that these should be renewable.

These 2 things seem to indicate I don't have vault setup quite right yet. I'm running vault in dev mode (vault server -dev -log-level=warn) and setup the pki engine like so...

    vault secrets enable pki
    vault secrets tune -max-lease-ttl=8760h pki/
    vault write -field=certificate pki/root/generate/internal \
            common_name="example.com" \
            ttl=8760h > CA_cert.crt
    vault write pki/config/urls \
        issuing_certificates="http://127.0.0.1:8200/v1/pki/ca" \
        crl_distribution_points="http://127.0.0.1:8200/v1/pki/crl"
    vault write pki/roles/example-dot-com \
        allowed_domains=example.com \
        allow_subdomains=true \
        max_ttl=72h

Those paths were taken from the docs, I changed your template to use those paths...

[[- $hostname := env "HOSTNAME" -]]
[[- $hostip := env "HOSTIP" -]]
[[- $cn_arg := printf "common_name=%s.%s" $hostname "example.com" -]]
[[- $ip_arg := printf "ip_sans=127.0.0.1,%s" $hostip -]]
[[- with secret "pki/issue/example-dot-com" $cn_arg $ip_arg -]]
[[ .Data.issuing_ca ]][[ end ]]

If you see anything there that doesn't look right, please let me know.

Aside from all that, I still think I'll be able to figure out the issue as that is related to how long it will sleep and not whether is sleeps at all or not (which is the problem) nor whether it is renewable.

@eikenb
Copy link
Contributor

eikenb commented Sep 4, 2019

Oh.. and i set the env-vars to...

export HOSTNAME=foo
export HOSTIP=192.168.1.1

[edit: looked at template again and saw HOSTNAME was not supposed to be the FQDN]

@sfudeus
Copy link
Author

sfudeus commented Sep 4, 2019

It might well be that I observed the (roughly) 5h timeframe for a certificate which is issued with a ttl of 72h. Template for a certificate differs here:

[[- $ttl_arg := printf "ttl=%s" "72h" -]]
[[- with secret "/ui/poinfra/unicorn-prod-01-bs-etcd/issue/etcd" $cn_arg $ip_arg $ttl_arg -]]
[[ .Data.certificate ]][[ end ]]

We do issue a bunch of certs, keys and ca-certs via consul-template.

Question:
Might it be related that we usually have 3 templates running against the same vault path (here: /ui/poinfra/unicorn-prod-01-bs-etcd/issue/etcd) but differing arguments (ttl, alt_names) for extracting private_key, certificate and issuing_ca. Would aligning the arguments help, possibly with a dedup config?
It still looks like a regression, but maybe we are doing things wrongly.

Edit:
Ok, dedup wouldn't be any help, since it is only needed to sync multiple consul-template instances, which we do not have.
And the renewable loop was already triggered for this small testing case where only the issuing_ca is fetched, regardless of a second template with other args, which we have in the full setup with templates for crt, key and ca-cert.

@eikenb
Copy link
Contributor

eikenb commented Sep 4, 2019

I think I've got it... it looks like an issue with sleeping that was missed in the tests due to the low values sleeps are set to during testing. I need to add some tests then will push up a PR.

eikenb added a commit that referenced this issue Sep 4, 2019
Over-zealous refactoring resulted in <-time.After() channel used for
non-renewable sleep to basically be ignored. Originally had channel of
the time channels but I "simplified" it when I shouldn't have. Bascially
undo that refactor by switching the way the sleep is passed to the next
iteraction from a `<-chan time.Time` to a `chan (<-chan time.time)`.
This way it can see it has a sleep to do, then do it instead of skipping
it as the time hadn't passed yet.

This also let me write a test for it as I don't need to actually sleep
to see that there is something in the channel.

Fixes #1272
eikenb added a commit that referenced this issue Sep 4, 2019
An over-zealous refactoring resulted in <-time.After() channel used for
non-renewable sleep to be ignored. I originally had channel of the time
channels but "simplified" it when I shouldn't have. This undoes that
refactor by switching the way the sleep is passed to the next iteration
from a `<-chan time.Time` to a `chan (<-chan time.time)`. This way it
can see it has a sleep to do, then do it instead of skipping it as the
time hadn't passed yet.

This also let me write a test for it as I don't need to actually sleep
to see that there is something in the channel.

Fixes #1272
@eikenb
Copy link
Contributor

eikenb commented Sep 5, 2019

Just pushed up #1277 that should fix this issue. Once I have it reviewed I'll merge it and make the 0.21.3 release.

eikenb added a commit that referenced this issue Sep 5, 2019
An over-zealous refactoring resulted in <-time.After() channel used for
non-renewable sleep to be ignored. I originally had channel of the time
channels but "simplified" it too much. This reworks it the way the sleep
is passed to the next iteration to just pass the time.Duration of the
sleep down the channel, then time.Sleep-ing when reading off the
channel. This way it can see it has a sleep to do, then do it instead of
skipping it as the time hadn't passed yet.

This also let me write a test for it as I don't need to actually sleep
to see that there is something in the channel.

Fixes #1272
@sfudeus
Copy link
Author

sfudeus commented Sep 6, 2019

Looks good on first glance - thanks for the fast fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants